Since dct() takes most of time on one-processor implementation, it's logical to move it onto another processor as my first step to soft multiprocessor. That's similar to what I did before. I create one more microblaze, its local bus and local memory and communication memory. Two microblaze processors can talk to each other via on-chip 8Kbyte dual port memory.
I choose dual port memory is because it's efficient for large volume data communication. In fact, color conversion function can write its output directly into dual port memory and dct can get its input from there as well. The same to dct output. There is no additional copy involved. It's a stable design and works well.
The software design, however, is a bit tricky. At first, I implement the system in RPC mode. Basically microblaze 0 writes dct input into dual port memory, waits dct to finish and afterwards continues to zzq and vlc encoding. Not surprising, profiling result shows that this design is actually slower than one-microblaze implementation.
The reason is that processors doesn't run concurrently in RPC mode. It only make sense if one processor is much faster than others for that 'procedure'. That's not my case.
To get a concurrent design, software must be modified to run concurrently.  I partition main loop on processor 0 into two tasks, one for external memory reading and color conversion while the other one for zzq, vlc and writing back into external memory.
The best way to run two tasks concurrently is RTOS. But it's too complex to port an RTOS at this moment. I choose an easy way. Task one is in main loop and it always check if task two is ready. If it's ready then task two get CPU cycles.
That results to another problem. More buffers are need. There was only one buffer available for each task but that's not enough. Suppose when dct is slower than color conversion, processor 0 can't do anything after color conversion because task2 can't run before dct is ready. In that case, processor 0 should continue to run task1, color conversion for next macro block.
I design a linked buffer list to replace static buffer. Every time color conversion starts a new macro block, it allocates a new dynamic buffer. The buffer is freed by processor 1 after it finishes dct conversion. Processor 1 allocates a new dynamic buffer when it starts a dct conversion for a new macro block as well. It's freed after processor 0 finishes vlc on it. All buffers are located on dual port memory so no additional copy as it was.
After these work, something interesting happens. I can easily notice that it gets faster.  The profiling result shows that total time on processor 0 is 2.2s less than one-processor implementation. That's a proof that they run concurrently! It's first time that I see the performance improvement from soft multiprocessor although I already know it can.
Later I do two additional improvements. The first one is to move zzq() onto processor 1 because the load on processor 1 is much lower than processor 0. The second is to add one more microblaze processor for zzq and vlc. You can notice the improvement from both.
From this exercise, it's clear that the programming model and communication for multiprocessor is quite different to that for single-processor. To implement more processors on a chip, an efficient and robust linked buffer list is essential. Fortunately it looks not too difficult at this moment.
The work load for buffer management and processor management can get larger if we have more processors onto it. PowerPC is better than microblaze in term of this job. It also better to replace CF card to network. 
(By the way, the ethernet driver in EDK is not free. Xilinx said that EDK users can evaluate that IP for one year but I can't find how to activate the evaluation license.)
In long run, probably we need both dual port memory and message passing mechanism. Dual port memory (or DMA) can be used for large data bulks while short message is more flexible and cheap.
It looks that Xilinx doesn't offer much tool for multiprocessor design. In my design, processor 1 is almost a black box to me. I only read some statistics from dual port memory. To fine tune the system, I need accurate timing information. The current software profiling is not accurate enough either.
A six-processor implementation (as below) can be interesting to try. But before that, I probably need to design some tools to ease design and profile.
                               /----> dct1(2)  -> vlc1(4)  ---\
getMB(0) -> ColorConversion(1)                                                                                 ----> Writeback(0)
                               \----> dct2(3)  -> vlc2(5)  ---/
Sunday, March 25, 2007
Sunday, March 4, 2007
Profiling code on one processor
Before starting multiprocessor, I would like to do some general optimization and profiling for the current JPEG code running on one processor. It's better to do it now than after partitioning. Basically I did,
1) adding Barrel Shifter in processor. The barrel shifter can significantly improve the performance and code size because there are lots of shifting in DCT, quantizer and color conversion. Now it takes only one instruction to shift the register to any position while it takes several instructions for a shifting, one bit one instruction. For dct(), it reduce from 3.91s to 2.26s, for zzq() it reduced from 3.31s to 2.05s.
2) Using 32-bit local variables as much as possible. The RISC processor is most efficient to deal with 32-bit variables. In fact, most RISC processors shift left and right for every manipulation of 16-bit and 8-bit variables. It reduce dct() for 6s to 3.91s.
Besides that, I also did some work to simplify the code. That also improve the speed.
1) Simplify VLC and remove unnecessary functions.
2) Use Fast DCT algorithm from the famous open source Telenor H.263 codec to replace orignial code.
3) Add visual progress indication that I can see the result of optimization more easier.
4) Move the code to write result to CF card out of VLC. VLC simply write the result into a buffer in external memory and it's written to CF card after the encoding is finished. So the time consumption of VLC can be more precisely determined.
Xilinx provides tools for profiling, mb-gprof. The use of it can be found in Xilinx "Platform Studio User Guide". There is a detailed description about how to profile on FPGA board. Basically we need
1) add a profile timer core. You can use opb_timer_0. Set correct address and other signals for timer. Connect interrupt of timer to processor.
2) enable sw_intrusive_profiling and set profile_timer.
3) rebuild bitstream and software libraries.
4) compile code with -pg and download.
If you just download it, the code with profiling doesn't always work like normal code. You need to use XDM. The steps are:
1) start XDM.
2) 'connect mb mdm'.
3) 'profile -config sampling_freq_hz 10000 binsize 4 profile_mem 0x3e000000' to set profiler parameters.
4) 'dow mb-bmp2jpg/executable.elf' to download the code.
5) 'bps exit' to set a breakpoint at exit.
6) 'con' to run.
7) when it stops at exit, use 'profile' to read back from target and save profiling information in 'gmon.out' in project directory.
8) use 'mb-gprof mb-bmp2jpg/executable.elf gmon.out > profile.txt' to save the profiling result.
It's quite interesting to see the profiling result of every optimization I did because it's quantized feedback. However, the result might be inaccurate because I can easily notice that code with profiling runs much slower than code without profiling. Meanwhile, it relies on opb_timer and external memory to read the store timing information but software profiling tool may not be able to measusre the delay on OPB bus and external memory.
It's worthy to explore the accuracy of software profiler. On the other hand, for the further work, I think I would probably need a better hardware profiler. It need to be precise and predictable. Also it needs support multiprocessor profiling. But at the same time, I can start to partition the code on two processors and measure the improvement.
Some hints:
1) For disassemble I use mb-objdump -D -S *.elf. Checking assembly code generated is very important for optimizing.
2) CF card driver from Xilinx seems not stable. First it doesn't recognize CF card formatted by WinXP. It must be format by some camera or under Linux (mkdosfs /dev/sdb1). Second if a CF card reading or writing is interrupted (for instance start xdm before reading finishes), I sometimes need to format CF card. I think network with tftp is a good alternative.
3) It can be useful to measure the impact of external memory as buffer and cache. Currently the access to external memory doesn't align to cache line, I suppose.
4) Now I use a 1600x1200 24bit BMP file as baseline. Meanwhile I remove color subsampling becaue these code sounds not stable. This in fact doubles computation required.
1) adding Barrel Shifter in processor. The barrel shifter can significantly improve the performance and code size because there are lots of shifting in DCT, quantizer and color conversion. Now it takes only one instruction to shift the register to any position while it takes several instructions for a shifting, one bit one instruction. For dct(), it reduce from 3.91s to 2.26s, for zzq() it reduced from 3.31s to 2.05s.
2) Using 32-bit local variables as much as possible. The RISC processor is most efficient to deal with 32-bit variables. In fact, most RISC processors shift left and right for every manipulation of 16-bit and 8-bit variables. It reduce dct() for 6s to 3.91s.
Besides that, I also did some work to simplify the code. That also improve the speed.
1) Simplify VLC and remove unnecessary functions.
2) Use Fast DCT algorithm from the famous open source Telenor H.263 codec to replace orignial code.
3) Add visual progress indication that I can see the result of optimization more easier.
4) Move the code to write result to CF card out of VLC. VLC simply write the result into a buffer in external memory and it's written to CF card after the encoding is finished. So the time consumption of VLC can be more precisely determined.
Xilinx provides tools for profiling, mb-gprof. The use of it can be found in Xilinx "Platform Studio User Guide". There is a detailed description about how to profile on FPGA board. Basically we need
1) add a profile timer core. You can use opb_timer_0. Set correct address and other signals for timer. Connect interrupt of timer to processor.
2) enable sw_intrusive_profiling and set profile_timer.
3) rebuild bitstream and software libraries.
4) compile code with -pg and download.
If you just download it, the code with profiling doesn't always work like normal code. You need to use XDM. The steps are:
1) start XDM.
2) 'connect mb mdm'.
3) 'profile -config sampling_freq_hz 10000 binsize 4 profile_mem 0x3e000000' to set profiler parameters.
4) 'dow mb-bmp2jpg/executable.elf' to download the code.
5) 'bps exit' to set a breakpoint at exit.
6) 'con' to run.
7) when it stops at exit, use 'profile' to read back from target and save profiling information in 'gmon.out' in project directory.
8) use 'mb-gprof mb-bmp2jpg/executable.elf gmon.out > profile.txt' to save the profiling result.
It's quite interesting to see the profiling result of every optimization I did because it's quantized feedback. However, the result might be inaccurate because I can easily notice that code with profiling runs much slower than code without profiling. Meanwhile, it relies on opb_timer and external memory to read the store timing information but software profiling tool may not be able to measusre the delay on OPB bus and external memory.
It's worthy to explore the accuracy of software profiler. On the other hand, for the further work, I think I would probably need a better hardware profiler. It need to be precise and predictable. Also it needs support multiprocessor profiling. But at the same time, I can start to partition the code on two processors and measure the improvement.
Some hints:
1) For disassemble I use mb-objdump -D -S *.elf. Checking assembly code generated is very important for optimizing.
2) CF card driver from Xilinx seems not stable. First it doesn't recognize CF card formatted by WinXP. It must be format by some camera or under Linux (mkdosfs /dev/sdb1). Second if a CF card reading or writing is interrupted (for instance start xdm before reading finishes), I sometimes need to format CF card. I think network with tftp is a good alternative.
3) It can be useful to measure the impact of external memory as buffer and cache. Currently the access to external memory doesn't align to cache line, I suppose.
4) Now I use a 1600x1200 24bit BMP file as baseline. Meanwhile I remove color subsampling becaue these code sounds not stable. This in fact doubles computation required.
Subscribe to:
Comments (Atom)
 
