After I finished six-processor implementation, I realize that it's necessary to design a tool to generate multiprocessor architecture instead of manual design before I go to more microblaze processors. The architecture description file should be simple, straightforward and, as my preference, similar to natural language. And for future, the description can be an SVG graph.
Now the first version of BlazeCluster is done. It's available free at http://www.opencores.org/projects.cgi/web/mpdma/overview. The last six-microblaze architecture can be generated from such a simple script:
microblaze_m, microblaze, opb-master, 64k on-chip ram, jtag, opb-uartlite baudrate 9600 stdio, opb-cf-card readwrite
microblaze_cc, microblaze, 8k on-chip ram
microblaze_dct0, microblaze, 8k on-chip ram
microblaze_dct1, microblaze, 8k on-chip ram
microblaze_vlc0, microblaze, 8k on-chip ram
microblaze_vlc1, microblaze, 8k on-chip ram
dp_m_cc, dpram, 8k, left microblaze_m address 0x20000000, right microblaze_cc address 0x20000000
dp_cc_dct0, dpram, 8k, left microblaze_cc address 0x21000000, right microblaze_dct0 address 0x21000000
dp_dct0_vlc0, dpram, 8k, left microblaze_dct0 address 0x22000000, right microblaze_vlc0 address 0x22000000
dp_vlc0_m, dpram, 8k, left microblaze_vlc0 address 0x23000000, right microblaze_m address 0x23000000
dp_cc_dct1, dpram, 8k, left microblaze_cc address 0x24000000, right microblaze_dct1 address 0x24000000
dp_dct1_vlc1, dpram, 8k, left microblaze_dct1 address 0x25000000, right microblaze_vlc1 address 0x25000000
dp_vlc1_m, dpram, 8k, left microblaze_vlc1 address 0x26000000, right microblaze_m address 0x26000000
ddr, on-board ddr sdram, 256m address 0x30000000
The first six lines define six processors and their parameters. The next seven lines is about the dual port memories as interface between them. The last line define the shared memory. Use BlazeCluster to generate MHS, MSS and UCF file. Copy them into an empty project, synthesize and compile. Finally it works!
The code is all written in Perl. Basically it does some translations. First it reads the description file, convert it into an internal data structure and then generate MHS, MSS and UCF file. The code is straightforward and easy to understand. The first version supports only Xilinx XUPV2P board and microblaze only.
Sunday, May 6, 2007
Saturday, April 14, 2007
Six-processor implementation
I continue to parallelize the system. In last three-processor implementation, three-stage pipeline is created. Another approach can be data parallelization. I can use six processors to encode two pictures simultaneously. Following is the topology.
-> dct0 -> vlc0 -\
master->color conversion / master
\ /
-> dct1 -> vlc1 -
(the topology may shown incorrectly because blogger removes space in between. I don't know how to solve that yet but I think you can understand it :)
From the profiling of one-processor implementation, it shows that dct and vlc takes two times more time than color conversion. So it's logical that color conversions for two channels share one processor. The master processor read data from external memory, shoot it and write the result back to external memory. I expect to achieve double speed than three processor implementation.
The interesting thing is that it really get double speed, :) according to profiling result. It takes the same amount time to compress two pictures. Compared to the original one-processor design, it already achieve 10X performance gain.
The design process is not complex, almost just repeat of what I did in three processor implementation except to set channel number and double buffering. The only bug I met is that I use ilmb instead of dlmb in the setting of processor 4 and 5. It takes a while to find it out as it's still difficult to debug.
The process the create multiprocessor on FPGA can actually be automated by some simple script. From my design, it's clear that there is a common structure for multiprocessor system. There is only a little difference between different implementations of processor, interconnection and communication library. Meanwhile, as processors get more and more, it's easy to make mistake simply in writing, just like what I did, and difficult to find it out. I think I should write a script to do that.
Multiprocess approach looks quite convincing until now. The get higher performance, I think I should try heterogenius design instead of current homogenius implementation. It's because almost every processor is dedicated to one task. A simple accelerator can improve the performance a lot.
Meanwhile it looks that I need some other components to facilitate multiprocessor, like DMA controller, especially DMA controller between external memory and internal memory. It may be also useful to design a hardware message interface. Currently message is send via dual port memory as data block. That's not efficient and scalable.
-> dct0 -> vlc0 -\
master->color conversion / master
\ /
-> dct1 -> vlc1 -
(the topology may shown incorrectly because blogger removes space in between. I don't know how to solve that yet but I think you can understand it :)
From the profiling of one-processor implementation, it shows that dct and vlc takes two times more time than color conversion. So it's logical that color conversions for two channels share one processor. The master processor read data from external memory, shoot it and write the result back to external memory. I expect to achieve double speed than three processor implementation.
The interesting thing is that it really get double speed, :) according to profiling result. It takes the same amount time to compress two pictures. Compared to the original one-processor design, it already achieve 10X performance gain.
The design process is not complex, almost just repeat of what I did in three processor implementation except to set channel number and double buffering. The only bug I met is that I use ilmb instead of dlmb in the setting of processor 4 and 5. It takes a while to find it out as it's still difficult to debug.
The process the create multiprocessor on FPGA can actually be automated by some simple script. From my design, it's clear that there is a common structure for multiprocessor system. There is only a little difference between different implementations of processor, interconnection and communication library. Meanwhile, as processors get more and more, it's easy to make mistake simply in writing, just like what I did, and difficult to find it out. I think I should write a script to do that.
Multiprocess approach looks quite convincing until now. The get higher performance, I think I should try heterogenius design instead of current homogenius implementation. It's because almost every processor is dedicated to one task. A simple accelerator can improve the performance a lot.
Meanwhile it looks that I need some other components to facilitate multiprocessor, like DMA controller, especially DMA controller between external memory and internal memory. It may be also useful to design a hardware message interface. Currently message is send via dual port memory as data block. That's not efficient and scalable.
Sunday, March 25, 2007
Run Concurrently on Two-Processor
Since dct() takes most of time on one-processor implementation, it's logical to move it onto another processor as my first step to soft multiprocessor. That's similar to what I did before. I create one more microblaze, its local bus and local memory and communication memory. Two microblaze processors can talk to each other via on-chip 8Kbyte dual port memory.
I choose dual port memory is because it's efficient for large volume data communication. In fact, color conversion function can write its output directly into dual port memory and dct can get its input from there as well. The same to dct output. There is no additional copy involved. It's a stable design and works well.
The software design, however, is a bit tricky. At first, I implement the system in RPC mode. Basically microblaze 0 writes dct input into dual port memory, waits dct to finish and afterwards continues to zzq and vlc encoding. Not surprising, profiling result shows that this design is actually slower than one-microblaze implementation.
The reason is that processors doesn't run concurrently in RPC mode. It only make sense if one processor is much faster than others for that 'procedure'. That's not my case.
To get a concurrent design, software must be modified to run concurrently. I partition main loop on processor 0 into two tasks, one for external memory reading and color conversion while the other one for zzq, vlc and writing back into external memory.
The best way to run two tasks concurrently is RTOS. But it's too complex to port an RTOS at this moment. I choose an easy way. Task one is in main loop and it always check if task two is ready. If it's ready then task two get CPU cycles.
That results to another problem. More buffers are need. There was only one buffer available for each task but that's not enough. Suppose when dct is slower than color conversion, processor 0 can't do anything after color conversion because task2 can't run before dct is ready. In that case, processor 0 should continue to run task1, color conversion for next macro block.
I design a linked buffer list to replace static buffer. Every time color conversion starts a new macro block, it allocates a new dynamic buffer. The buffer is freed by processor 1 after it finishes dct conversion. Processor 1 allocates a new dynamic buffer when it starts a dct conversion for a new macro block as well. It's freed after processor 0 finishes vlc on it. All buffers are located on dual port memory so no additional copy as it was.
After these work, something interesting happens. I can easily notice that it gets faster. The profiling result shows that total time on processor 0 is 2.2s less than one-processor implementation. That's a proof that they run concurrently! It's first time that I see the performance improvement from soft multiprocessor although I already know it can.
Later I do two additional improvements. The first one is to move zzq() onto processor 1 because the load on processor 1 is much lower than processor 0. The second is to add one more microblaze processor for zzq and vlc. You can notice the improvement from both.
From this exercise, it's clear that the programming model and communication for multiprocessor is quite different to that for single-processor. To implement more processors on a chip, an efficient and robust linked buffer list is essential. Fortunately it looks not too difficult at this moment.
The work load for buffer management and processor management can get larger if we have more processors onto it. PowerPC is better than microblaze in term of this job. It also better to replace CF card to network.
(By the way, the ethernet driver in EDK is not free. Xilinx said that EDK users can evaluate that IP for one year but I can't find how to activate the evaluation license.)
In long run, probably we need both dual port memory and message passing mechanism. Dual port memory (or DMA) can be used for large data bulks while short message is more flexible and cheap.
It looks that Xilinx doesn't offer much tool for multiprocessor design. In my design, processor 1 is almost a black box to me. I only read some statistics from dual port memory. To fine tune the system, I need accurate timing information. The current software profiling is not accurate enough either.
A six-processor implementation (as below) can be interesting to try. But before that, I probably need to design some tools to ease design and profile.
/----> dct1(2) -> vlc1(4) ---\
getMB(0) -> ColorConversion(1) ----> Writeback(0)
\----> dct2(3) -> vlc2(5) ---/
I choose dual port memory is because it's efficient for large volume data communication. In fact, color conversion function can write its output directly into dual port memory and dct can get its input from there as well. The same to dct output. There is no additional copy involved. It's a stable design and works well.
The software design, however, is a bit tricky. At first, I implement the system in RPC mode. Basically microblaze 0 writes dct input into dual port memory, waits dct to finish and afterwards continues to zzq and vlc encoding. Not surprising, profiling result shows that this design is actually slower than one-microblaze implementation.
The reason is that processors doesn't run concurrently in RPC mode. It only make sense if one processor is much faster than others for that 'procedure'. That's not my case.
To get a concurrent design, software must be modified to run concurrently. I partition main loop on processor 0 into two tasks, one for external memory reading and color conversion while the other one for zzq, vlc and writing back into external memory.
The best way to run two tasks concurrently is RTOS. But it's too complex to port an RTOS at this moment. I choose an easy way. Task one is in main loop and it always check if task two is ready. If it's ready then task two get CPU cycles.
That results to another problem. More buffers are need. There was only one buffer available for each task but that's not enough. Suppose when dct is slower than color conversion, processor 0 can't do anything after color conversion because task2 can't run before dct is ready. In that case, processor 0 should continue to run task1, color conversion for next macro block.
I design a linked buffer list to replace static buffer. Every time color conversion starts a new macro block, it allocates a new dynamic buffer. The buffer is freed by processor 1 after it finishes dct conversion. Processor 1 allocates a new dynamic buffer when it starts a dct conversion for a new macro block as well. It's freed after processor 0 finishes vlc on it. All buffers are located on dual port memory so no additional copy as it was.
After these work, something interesting happens. I can easily notice that it gets faster. The profiling result shows that total time on processor 0 is 2.2s less than one-processor implementation. That's a proof that they run concurrently! It's first time that I see the performance improvement from soft multiprocessor although I already know it can.
Later I do two additional improvements. The first one is to move zzq() onto processor 1 because the load on processor 1 is much lower than processor 0. The second is to add one more microblaze processor for zzq and vlc. You can notice the improvement from both.
From this exercise, it's clear that the programming model and communication for multiprocessor is quite different to that for single-processor. To implement more processors on a chip, an efficient and robust linked buffer list is essential. Fortunately it looks not too difficult at this moment.
The work load for buffer management and processor management can get larger if we have more processors onto it. PowerPC is better than microblaze in term of this job. It also better to replace CF card to network.
(By the way, the ethernet driver in EDK is not free. Xilinx said that EDK users can evaluate that IP for one year but I can't find how to activate the evaluation license.)
In long run, probably we need both dual port memory and message passing mechanism. Dual port memory (or DMA) can be used for large data bulks while short message is more flexible and cheap.
It looks that Xilinx doesn't offer much tool for multiprocessor design. In my design, processor 1 is almost a black box to me. I only read some statistics from dual port memory. To fine tune the system, I need accurate timing information. The current software profiling is not accurate enough either.
A six-processor implementation (as below) can be interesting to try. But before that, I probably need to design some tools to ease design and profile.
/----> dct1(2) -> vlc1(4) ---\
getMB(0) -> ColorConversion(1) ----> Writeback(0)
\----> dct2(3) -> vlc2(5) ---/
Sunday, March 4, 2007
Profiling code on one processor
Before starting multiprocessor, I would like to do some general optimization and profiling for the current JPEG code running on one processor. It's better to do it now than after partitioning. Basically I did,
1) adding Barrel Shifter in processor. The barrel shifter can significantly improve the performance and code size because there are lots of shifting in DCT, quantizer and color conversion. Now it takes only one instruction to shift the register to any position while it takes several instructions for a shifting, one bit one instruction. For dct(), it reduce from 3.91s to 2.26s, for zzq() it reduced from 3.31s to 2.05s.
2) Using 32-bit local variables as much as possible. The RISC processor is most efficient to deal with 32-bit variables. In fact, most RISC processors shift left and right for every manipulation of 16-bit and 8-bit variables. It reduce dct() for 6s to 3.91s.
Besides that, I also did some work to simplify the code. That also improve the speed.
1) Simplify VLC and remove unnecessary functions.
2) Use Fast DCT algorithm from the famous open source Telenor H.263 codec to replace orignial code.
3) Add visual progress indication that I can see the result of optimization more easier.
4) Move the code to write result to CF card out of VLC. VLC simply write the result into a buffer in external memory and it's written to CF card after the encoding is finished. So the time consumption of VLC can be more precisely determined.
Xilinx provides tools for profiling, mb-gprof. The use of it can be found in Xilinx "Platform Studio User Guide". There is a detailed description about how to profile on FPGA board. Basically we need
1) add a profile timer core. You can use opb_timer_0. Set correct address and other signals for timer. Connect interrupt of timer to processor.
2) enable sw_intrusive_profiling and set profile_timer.
3) rebuild bitstream and software libraries.
4) compile code with -pg and download.
If you just download it, the code with profiling doesn't always work like normal code. You need to use XDM. The steps are:
1) start XDM.
2) 'connect mb mdm'.
3) 'profile -config sampling_freq_hz 10000 binsize 4 profile_mem 0x3e000000' to set profiler parameters.
4) 'dow mb-bmp2jpg/executable.elf' to download the code.
5) 'bps exit' to set a breakpoint at exit.
6) 'con' to run.
7) when it stops at exit, use 'profile' to read back from target and save profiling information in 'gmon.out' in project directory.
8) use 'mb-gprof mb-bmp2jpg/executable.elf gmon.out > profile.txt' to save the profiling result.
It's quite interesting to see the profiling result of every optimization I did because it's quantized feedback. However, the result might be inaccurate because I can easily notice that code with profiling runs much slower than code without profiling. Meanwhile, it relies on opb_timer and external memory to read the store timing information but software profiling tool may not be able to measusre the delay on OPB bus and external memory.
It's worthy to explore the accuracy of software profiler. On the other hand, for the further work, I think I would probably need a better hardware profiler. It need to be precise and predictable. Also it needs support multiprocessor profiling. But at the same time, I can start to partition the code on two processors and measure the improvement.
Some hints:
1) For disassemble I use mb-objdump -D -S *.elf. Checking assembly code generated is very important for optimizing.
2) CF card driver from Xilinx seems not stable. First it doesn't recognize CF card formatted by WinXP. It must be format by some camera or under Linux (mkdosfs /dev/sdb1). Second if a CF card reading or writing is interrupted (for instance start xdm before reading finishes), I sometimes need to format CF card. I think network with tftp is a good alternative.
3) It can be useful to measure the impact of external memory as buffer and cache. Currently the access to external memory doesn't align to cache line, I suppose.
4) Now I use a 1600x1200 24bit BMP file as baseline. Meanwhile I remove color subsampling becaue these code sounds not stable. This in fact doubles computation required.
1) adding Barrel Shifter in processor. The barrel shifter can significantly improve the performance and code size because there are lots of shifting in DCT, quantizer and color conversion. Now it takes only one instruction to shift the register to any position while it takes several instructions for a shifting, one bit one instruction. For dct(), it reduce from 3.91s to 2.26s, for zzq() it reduced from 3.31s to 2.05s.
2) Using 32-bit local variables as much as possible. The RISC processor is most efficient to deal with 32-bit variables. In fact, most RISC processors shift left and right for every manipulation of 16-bit and 8-bit variables. It reduce dct() for 6s to 3.91s.
Besides that, I also did some work to simplify the code. That also improve the speed.
1) Simplify VLC and remove unnecessary functions.
2) Use Fast DCT algorithm from the famous open source Telenor H.263 codec to replace orignial code.
3) Add visual progress indication that I can see the result of optimization more easier.
4) Move the code to write result to CF card out of VLC. VLC simply write the result into a buffer in external memory and it's written to CF card after the encoding is finished. So the time consumption of VLC can be more precisely determined.
Xilinx provides tools for profiling, mb-gprof. The use of it can be found in Xilinx "Platform Studio User Guide". There is a detailed description about how to profile on FPGA board. Basically we need
1) add a profile timer core. You can use opb_timer_0. Set correct address and other signals for timer. Connect interrupt of timer to processor.
2) enable sw_intrusive_profiling and set profile_timer.
3) rebuild bitstream and software libraries.
4) compile code with -pg and download.
If you just download it, the code with profiling doesn't always work like normal code. You need to use XDM. The steps are:
1) start XDM.
2) 'connect mb mdm'.
3) 'profile -config sampling_freq_hz 10000 binsize 4 profile_mem 0x3e000000' to set profiler parameters.
4) 'dow mb-bmp2jpg/executable.elf' to download the code.
5) 'bps exit' to set a breakpoint at exit.
6) 'con' to run.
7) when it stops at exit, use 'profile' to read back from target and save profiling information in 'gmon.out' in project directory.
8) use 'mb-gprof mb-bmp2jpg/executable.elf gmon.out > profile.txt' to save the profiling result.
It's quite interesting to see the profiling result of every optimization I did because it's quantized feedback. However, the result might be inaccurate because I can easily notice that code with profiling runs much slower than code without profiling. Meanwhile, it relies on opb_timer and external memory to read the store timing information but software profiling tool may not be able to measusre the delay on OPB bus and external memory.
It's worthy to explore the accuracy of software profiler. On the other hand, for the further work, I think I would probably need a better hardware profiler. It need to be precise and predictable. Also it needs support multiprocessor profiling. But at the same time, I can start to partition the code on two processors and measure the improvement.
Some hints:
1) For disassemble I use mb-objdump -D -S *.elf. Checking assembly code generated is very important for optimizing.
2) CF card driver from Xilinx seems not stable. First it doesn't recognize CF card formatted by WinXP. It must be format by some camera or under Linux (mkdosfs /dev/sdb1). Second if a CF card reading or writing is interrupted (for instance start xdm before reading finishes), I sometimes need to format CF card. I think network with tftp is a good alternative.
3) It can be useful to measure the impact of external memory as buffer and cache. Currently the access to external memory doesn't align to cache line, I suppose.
4) Now I use a 1600x1200 24bit BMP file as baseline. Meanwhile I remove color subsampling becaue these code sounds not stable. This in fact doubles computation required.
Sunday, February 11, 2007
Game Starts!
During my master study and thesis project, it's getting clear to me that FPGA can achieve a few magnitudes higher performance if smart designed while keep flexibility at the same time. A good approach is soft multiprocessor on FPGA.
My master project is in that direction. It's to implement a four-processor system on FPGA to compress a BMP image on CF card and write it back afterwards. During the project, I realized that there is a large potential for such a system and my design is far away from the optimized.
Fortunately, my professor in TU/e, Eindhoven kindly lend an FPGA board to me. Now I can continue to play with it. Let's start!
My master project is in that direction. It's to implement a four-processor system on FPGA to compress a BMP image on CF card and write it back afterwards. During the project, I realized that there is a large potential for such a system and my design is far away from the optimized.
Fortunately, my professor in TU/e, Eindhoven kindly lend an FPGA board to me. Now I can continue to play with it. Let's start!
Subscribe to:
Comments (Atom)