Sunday, March 4, 2007

Profiling code on one processor

Before starting multiprocessor, I would like to do some general optimization and profiling for the current JPEG code running on one processor. It's better to do it now than after partitioning. Basically I did,

1) adding Barrel Shifter in processor. The barrel shifter can significantly improve the performance and code size because there are lots of shifting in DCT, quantizer and color conversion. Now it takes only one instruction to shift the register to any position while it takes several instructions for a shifting, one bit one instruction. For dct(), it reduce from 3.91s to 2.26s, for zzq() it reduced from 3.31s to 2.05s.

2) Using 32-bit local variables as much as possible. The RISC processor is most efficient to deal with 32-bit variables. In fact, most RISC processors shift left and right for every manipulation of 16-bit and 8-bit variables. It reduce dct() for 6s to 3.91s.

Besides that, I also did some work to simplify the code. That also improve the speed.

1) Simplify VLC and remove unnecessary functions.
2) Use Fast DCT algorithm from the famous open source Telenor H.263 codec to replace orignial code.
3) Add visual progress indication that I can see the result of optimization more easier.
4) Move the code to write result to CF card out of VLC. VLC simply write the result into a buffer in external memory and it's written to CF card after the encoding is finished. So the time consumption of VLC can be more precisely determined.

Xilinx provides tools for profiling, mb-gprof. The use of it can be found in Xilinx "Platform Studio User Guide". There is a detailed description about how to profile on FPGA board. Basically we need
1) add a profile timer core. You can use opb_timer_0. Set correct address and other signals for timer. Connect interrupt of timer to processor.
2) enable sw_intrusive_profiling and set profile_timer.
3) rebuild bitstream and software libraries.
4) compile code with -pg and download.

If you just download it, the code with profiling doesn't always work like normal code. You need to use XDM. The steps are:
1) start XDM.
2) 'connect mb mdm'.
3) 'profile -config sampling_freq_hz 10000 binsize 4 profile_mem 0x3e000000' to set profiler parameters.
4) 'dow mb-bmp2jpg/executable.elf' to download the code.
5) 'bps exit' to set a breakpoint at exit.
6) 'con' to run.
7) when it stops at exit, use 'profile' to read back from target and save profiling information in 'gmon.out' in project directory.
8) use 'mb-gprof mb-bmp2jpg/executable.elf gmon.out > profile.txt' to save the profiling result.

It's quite interesting to see the profiling result of every optimization I did because it's quantized feedback. However, the result might be inaccurate because I can easily notice that code with profiling runs much slower than code without profiling. Meanwhile, it relies on opb_timer and external memory to read the store timing information but software profiling tool may not be able to measusre the delay on OPB bus and external memory.

It's worthy to explore the accuracy of software profiler. On the other hand, for the further work, I think I would probably need a better hardware profiler. It need to be precise and predictable. Also it needs support multiprocessor profiling. But at the same time, I can start to partition the code on two processors and measure the improvement.

Some hints:
1) For disassemble I use mb-objdump -D -S *.elf. Checking assembly code generated is very important for optimizing.
2) CF card driver from Xilinx seems not stable. First it doesn't recognize CF card formatted by WinXP. It must be format by some camera or under Linux (mkdosfs /dev/sdb1). Second if a CF card reading or writing is interrupted (for instance start xdm before reading finishes), I sometimes need to format CF card. I think network with tftp is a good alternative.
3) It can be useful to measure the impact of external memory as buffer and cache. Currently the access to external memory doesn't align to cache line, I suppose.
4) Now I use a 1600x1200 24bit BMP file as baseline. Meanwhile I remove color subsampling becaue these code sounds not stable. This in fact doubles computation required.

2 comments:

Unknown said...

I have had some problem with the processor of my computer and i was looking information wich help me to solve this particular issue. I must to say in my search i found this blog resulting vere attractive for all the useful information contained here . I found another blog called costa rica investment opportunities very interesting. I wanted share it with you.

Unknown said...

This whole information is absolutely useful and interesting. i like this blog because this blog is easily understandable, and that is invaluable to the readers. I wanted to buy viagra online, and i saw this blog. I think this information will be useful for me, because i want to know more about processors.