Saturday, August 25, 2007

Mandelbrot set on Intel Core Duo with TBB

A few weeks ago I got to know that Intel is going to open TBB library. My research was soft multiprocessor on FPGA which is parallel computing as well. I think it's very interesting to do it with TBB on dual core PC to compare the results.

After a few days, it works! What impress me is how easy to use TBB. Most of my time is actually spend on porting Mandelbrot application to Windows. It's ported from code on FPGA, which was ported from QuickMAN by Paul Gentieu. After it worked on one core, I read the TBB tutorial and tried to apply parallel_for() on the loop. Within less than thirty minutes, it's my first time to see my laptop kept on working on 100% CPU usage for a while. The runtime is around 30% faster and I am very happy with that.

When I continued reading tutorial, I realized that 30% is not enough. For a scalable application like Mandelbrot, it's supposed to just double the speed. The reason is grain size. I applied parallel_for() for x loop which is internal loop. Therefore the grain size is too small and parallelism overhead would be big. I modified to code to apply parallel_for() for the y loop, external loop. The result is then exciting. The runtime is indeed half of that without TBB. Have a look of result for detail.

As a curious engineer :), I would like to explore more inside. Kevin Farnham's blog shows how to get some information about dynamic mapping. range.begin() can be used as a unique ID for threads while range.end() might be modified on-the-fly. I traced different thread and applied blue or red color background color for different thread. It appears that the algorithm inside TBB is smart. For instance, assuming iteration range is 512, set to 384, grain size modified to 256 at runtime.

However the mapping is only for mapping on threads but not mapping on cores. To look into this, you need Thread Profiler. Check out more at http://quickwayne.googlepages.com/mandelbrotsetonintelduocore

Sunday, August 12, 2007

Area Estimation and Eight Microblaze Implementation

Area estimation is helpful as we are approaching the limit of the chip. We can know if the design fits the chip before the time-consuming mapping process. In BlazeCluster, I can just sum up every component's area to get estimation.

The area of components is supposed to be available in their respective documents. However it's not always there and it's time consuming to look into them. A better way can be, because the area information is in the synthesize report, I can simply scan them. Perl script is perfect to do that. And it works.

With the guidance of estimation, I put eight microblaze processors with FPU with PowerPC onto a chip. It's ultra-fast and doesn't take too much time. The design is very scalable. There is little to rework.

After a while, however, I am surprised to find that there is a big difference between estimation and synthesize result, as big as 15%-20%. The reason is probably due to ISE optimization. The estimation of multiplier and BRAM is accurate.