Since dct() takes most of time on one-processor implementation, it's logical to move it onto another processor as my first step to soft multiprocessor. That's similar to what I did before. I create one more microblaze, its local bus and local memory and communication memory. Two microblaze processors can talk to each other via on-chip 8Kbyte dual port memory.
I choose dual port memory is because it's efficient for large volume data communication. In fact, color conversion function can write its output directly into dual port memory and dct can get its input from there as well. The same to dct output. There is no additional copy involved. It's a stable design and works well.
The software design, however, is a bit tricky. At first, I implement the system in RPC mode. Basically microblaze 0 writes dct input into dual port memory, waits dct to finish and afterwards continues to zzq and vlc encoding. Not surprising, profiling result shows that this design is actually slower than one-microblaze implementation.
The reason is that processors doesn't run concurrently in RPC mode. It only make sense if one processor is much faster than others for that 'procedure'. That's not my case.
To get a concurrent design, software must be modified to run concurrently. I partition main loop on processor 0 into two tasks, one for external memory reading and color conversion while the other one for zzq, vlc and writing back into external memory.
The best way to run two tasks concurrently is RTOS. But it's too complex to port an RTOS at this moment. I choose an easy way. Task one is in main loop and it always check if task two is ready. If it's ready then task two get CPU cycles.
That results to another problem. More buffers are need. There was only one buffer available for each task but that's not enough. Suppose when dct is slower than color conversion, processor 0 can't do anything after color conversion because task2 can't run before dct is ready. In that case, processor 0 should continue to run task1, color conversion for next macro block.
I design a linked buffer list to replace static buffer. Every time color conversion starts a new macro block, it allocates a new dynamic buffer. The buffer is freed by processor 1 after it finishes dct conversion. Processor 1 allocates a new dynamic buffer when it starts a dct conversion for a new macro block as well. It's freed after processor 0 finishes vlc on it. All buffers are located on dual port memory so no additional copy as it was.
After these work, something interesting happens. I can easily notice that it gets faster. The profiling result shows that total time on processor 0 is 2.2s less than one-processor implementation. That's a proof that they run concurrently! It's first time that I see the performance improvement from soft multiprocessor although I already know it can.
Later I do two additional improvements. The first one is to move zzq() onto processor 1 because the load on processor 1 is much lower than processor 0. The second is to add one more microblaze processor for zzq and vlc. You can notice the improvement from both.
From this exercise, it's clear that the programming model and communication for multiprocessor is quite different to that for single-processor. To implement more processors on a chip, an efficient and robust linked buffer list is essential. Fortunately it looks not too difficult at this moment.
The work load for buffer management and processor management can get larger if we have more processors onto it. PowerPC is better than microblaze in term of this job. It also better to replace CF card to network.
(By the way, the ethernet driver in EDK is not free. Xilinx said that EDK users can evaluate that IP for one year but I can't find how to activate the evaluation license.)
In long run, probably we need both dual port memory and message passing mechanism. Dual port memory (or DMA) can be used for large data bulks while short message is more flexible and cheap.
It looks that Xilinx doesn't offer much tool for multiprocessor design. In my design, processor 1 is almost a black box to me. I only read some statistics from dual port memory. To fine tune the system, I need accurate timing information. The current software profiling is not accurate enough either.
A six-processor implementation (as below) can be interesting to try. But before that, I probably need to design some tools to ease design and profile.
/----> dct1(2) -> vlc1(4) ---\
getMB(0) -> ColorConversion(1) ----> Writeback(0)
\----> dct2(3) -> vlc2(5) ---/
Sunday, March 25, 2007
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment