Saturday, April 14, 2007

Six-processor implementation

I continue to parallelize the system. In last three-processor implementation, three-stage pipeline is created. Another approach can be data parallelization. I can use six processors to encode two pictures simultaneously. Following is the topology.
-> dct0 -> vlc0 -\
master->color conversion / master
\ /
-> dct1 -> vlc1 -
(the topology may shown incorrectly because blogger removes space in between. I don't know how to solve that yet but I think you can understand it :)

From the profiling of one-processor implementation, it shows that dct and vlc takes two times more time than color conversion. So it's logical that color conversions for two channels share one processor. The master processor read data from external memory, shoot it and write the result back to external memory. I expect to achieve double speed than three processor implementation.

The interesting thing is that it really get double speed, :) according to profiling result. It takes the same amount time to compress two pictures. Compared to the original one-processor design, it already achieve 10X performance gain.

The design process is not complex, almost just repeat of what I did in three processor implementation except to set channel number and double buffering. The only bug I met is that I use ilmb instead of dlmb in the setting of processor 4 and 5. It takes a while to find it out as it's still difficult to debug.

The process the create multiprocessor on FPGA can actually be automated by some simple script. From my design, it's clear that there is a common structure for multiprocessor system. There is only a little difference between different implementations of processor, interconnection and communication library. Meanwhile, as processors get more and more, it's easy to make mistake simply in writing, just like what I did, and difficult to find it out. I think I should write a script to do that.

Multiprocess approach looks quite convincing until now. The get higher performance, I think I should try heterogenius design instead of current homogenius implementation. It's because almost every processor is dedicated to one task. A simple accelerator can improve the performance a lot.

Meanwhile it looks that I need some other components to facilitate multiprocessor, like DMA controller, especially DMA controller between external memory and internal memory. It may be also useful to design a hardware message interface. Currently message is send via dual port memory as data block. That's not efficient and scalable.