Saturday, October 6, 2007

Performance Counter for Microblaze

To know how much time it spend exactly in calculation is not straightforward. In EDK, there is only software profiling which is not precise enough and probably doesn't work on multiprocessor without shared memory. We need something else.

Fortunately there are already examples. In most of 'big' processors, there are performance counters available which are basically small hardware counters running independent of software. Therefore it's more accurate. These counters can be configured to count lots of processor internal status, like cache hit, etc. It's common to use VTune or PAPI to read these counters on PC and then tune performance. So it's a good idea to make a performance counter for Microblaze if we need to tune it.

I choose FSL bus as the interface from counter to Microblaze. It's because read/start/stop counter op should be as light as possible. FSL is low-overhead and predictable, almost ideal candidate. The counter is written in VHDL, following template generated by EDK. It's available at http://www.opencores.org/projects.cgi/web/performance_counter/overview

There are three parts (process), bus interface, counter and overflow detector. Bus interface get command, like start, stop, reset, from processor and send counter value back. If there is an overflow, the overflow detector shall find. Every time processor read counter value, it should check if there is an overflow in between.

The performance counter is later integrated into BlazeCluster. If set
mb0, microblaze, 8k on-chip ram, barrel-shifter, fpu, perfcnt

a performance counter, FPU and barrel-shifter is instantiated and properly connected. perfcounter.c is software driver. Inside there are functions

1) reset_and_start_counter()
2) reset_and_stop_counter()
3) start_counter()
4) stop_counter()
5) read_counter()

The function names are self-explanatory. In read_counter() it reads from FSL twice and use the latter value. It's because there is one level FIFO in FSL bus. The first word read is stale value cached in FIFO while the second is the right one.

Source code

-----------------------------------------------------------------------------
-- perfcounter - entity/architecture pair
------------------------------------------------------------------------------

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

-------------------------------------------------------------------------------------
--
--
-- Definition of Ports
-- FSL_Clk : Synchronous clock
-- FSL_Rst : System reset, should always come from FSL bus
-- FSL_S_Clk : Slave asynchronous clock
-- FSL_S_Read : Read signal, requiring next available input to be read
-- FSL_S_Data : Input data
-- FSL_S_CONTROL : Control Bit, indicating the input data are control word
-- FSL_S_Exists : Data Exist Bit, indicating data exist in the input FSL bus
-- FSL_M_Clk : Master asynchronous clock
-- FSL_M_Write : Write signal, enabling writing to output FSL bus
-- FSL_M_Data : Output data
-- FSL_M_Control : Control Bit, indicating the output data are contol word
-- FSL_M_Full : Full Bit, indicating output FSL bus is full
--
-------------------------------------------------------------------------------

entity perfcounter is
port
(
-- DO NOT EDIT BELOW THIS LINE ---------------------
-- Bus protocol ports, do not add or delete.
FSL_Clk : in std_logic;
FSL_Rst : in std_logic;
FSL_S_Clk : out std_logic;
FSL_S_Read : out std_logic;
FSL_S_Data : in std_logic_vector(0 to 31);
FSL_S_Control : in std_logic;
FSL_S_Exists : in std_logic;
FSL_M_Clk : out std_logic;
FSL_M_Write : out std_logic;
FSL_M_Data : out std_logic_vector(0 to 31);
FSL_M_Control : out std_logic;
FSL_M_Full : in std_logic
-- DO NOT EDIT ABOVE THIS LINE ---------------------
);

attribute SIGIS : string;
attribute SIGIS of FSL_Clk : signal is "Clk";
attribute SIGIS of FSL_S_Clk : signal is "Clk";
attribute SIGIS of FSL_M_Clk : signal is "Clk";

end perfcounter;

architecture EXAMPLE of perfcounter is

-- cmd - command to counters, b0 enable b1 rst
-- counter - performance counter

signal cmd : std_logic_vector(0 to 3);
signal counter : std_logic_vector(0 to 31);
signal overflow : std_logic;

begin

FSL_S_Read <= FSL_S_Exists when FSL_Rst = '0' else '0';
FSL_M_Write <= not FSL_M_Full when FSL_Rst = '0' else '0';

FSL_M_Data <= counter;
FSL_M_Control <= overflow;

INPUT : process(FSL_CLK) is
begin
if FSL_Clk'event and FSL_Clk = '1' then
if FSL_Rst = '1' then
cmd <= X"0";
else
if (FSL_S_Exists = '1') then
cmd <= FSL_S_Data(24 to 27);
end if;
end if;
end if;
end process INPUT;

CNT : process(FSL_CLK) is
begin
if FSL_Clk'event and FSL_Clk = '1' then
if FSL_S_Data(25) = '1' and FSL_S_Exists = '1' then
counter <= X"00000000";
overflow <= '0';
else
if (cmd(0) = '1') then
counter <= std_logic_vector(unsigned(counter) + 1);
end if;
end if;
end if;
end process CNT;

OV : process(counter(31)) is
begin
if counter(31)'event and counter(31) = '0' then
overflow <= '1';
end if;
end process OV;

end architecture EXAMPLE;

Saturday, August 25, 2007

Mandelbrot set on Intel Core Duo with TBB

A few weeks ago I got to know that Intel is going to open TBB library. My research was soft multiprocessor on FPGA which is parallel computing as well. I think it's very interesting to do it with TBB on dual core PC to compare the results.

After a few days, it works! What impress me is how easy to use TBB. Most of my time is actually spend on porting Mandelbrot application to Windows. It's ported from code on FPGA, which was ported from QuickMAN by Paul Gentieu. After it worked on one core, I read the TBB tutorial and tried to apply parallel_for() on the loop. Within less than thirty minutes, it's my first time to see my laptop kept on working on 100% CPU usage for a while. The runtime is around 30% faster and I am very happy with that.

When I continued reading tutorial, I realized that 30% is not enough. For a scalable application like Mandelbrot, it's supposed to just double the speed. The reason is grain size. I applied parallel_for() for x loop which is internal loop. Therefore the grain size is too small and parallelism overhead would be big. I modified to code to apply parallel_for() for the y loop, external loop. The result is then exciting. The runtime is indeed half of that without TBB. Have a look of result for detail.

As a curious engineer :), I would like to explore more inside. Kevin Farnham's blog shows how to get some information about dynamic mapping. range.begin() can be used as a unique ID for threads while range.end() might be modified on-the-fly. I traced different thread and applied blue or red color background color for different thread. It appears that the algorithm inside TBB is smart. For instance, assuming iteration range is 512, set to 384, grain size modified to 256 at runtime.

However the mapping is only for mapping on threads but not mapping on cores. To look into this, you need Thread Profiler. Check out more at http://quickwayne.googlepages.com/mandelbrotsetonintelduocore

Sunday, August 12, 2007

Area Estimation and Eight Microblaze Implementation

Area estimation is helpful as we are approaching the limit of the chip. We can know if the design fits the chip before the time-consuming mapping process. In BlazeCluster, I can just sum up every component's area to get estimation.

The area of components is supposed to be available in their respective documents. However it's not always there and it's time consuming to look into them. A better way can be, because the area information is in the synthesize report, I can simply scan them. Perl script is perfect to do that. And it works.

With the guidance of estimation, I put eight microblaze processors with FPU with PowerPC onto a chip. It's ultra-fast and doesn't take too much time. The design is very scalable. There is little to rework.

After a while, however, I am surprised to find that there is a big difference between estimation and synthesize result, as big as 15%-20%. The reason is probably due to ISE optimization. The estimation of multiplier and BRAM is accurate.

Sunday, June 10, 2007

Mandelbrot Set on FPGA

Mandelbrot set is a set of points in the complex plane that forms a fractal. It's popular outside of mathematics because of its aesthetic appeal. It's also a typical parallel computing application implemented on a variety of platforms. My professor suggested me to try it on soft multiprocessor on FPGA.

The reference code for algorithm is from sf.net. The most impressive design is QuickMAN by Paul Gentieu with core algorithm implemented in assembly code. It's incredibly fast. The disadvantage is that it works on Windows only and doesn't support multicore yet.

The reference code for FPGA is from Xilinx. First it's implemented on one-PowerPC architecture on FPGA. However, it's very slow because there is no FPU. Next, I added a Microblaze processor with FPU. It gets much faster. In the third step, I added four Microblaze processors as following shows.


There is still some area on the chip and I can add more processors to accelerate. I updated BlazeCluster to add PowerPC support and I can use it to add processors easily. All source code and additional document are available at http://quickwayne.googlepages.com/

This implementation, however, is still much slower than QuickMAN running on a normal PC. The major reason is probably: 1) low CPU speed. Currently all microblaze runs at 100Mhz; 2) Inefficient floating point libarary. I use the floating library from Xilinx. It's efficient in area but probably not best in performance; 3) slow external memory access. The reason I can't make it more detail is that I can't make profiling on PowerPC work. I did what's in the manual and got a 250MB profiling file which is wrong. Even it works, it's still difficult to look into it because software-intrusive profiling is not accurate enough. It's better to have some on-chip profiling counters as Pentium or Athlon has.

Below is some tricks of the code.

1) Source code on all microblaze processors are identical. The only difference is actually compile options. There is a macro for software to know on which processor it's running. The only difference in binary code is communication dual port memory base address. In other words, the macro decides the address during compiling.

2) I use single-precision floating point because the FPU on microblaze doesn't support double-precision. Meanwhile, in case of double precision, it doesn't use FPU at all. It just emulate it.

Sunday, May 6, 2007

BlazeCluster, to create cluster of microblaze

After I finished six-processor implementation, I realize that it's necessary to design a tool to generate multiprocessor architecture instead of manual design before I go to more microblaze processors. The architecture description file should be simple, straightforward and, as my preference, similar to natural language. And for future, the description can be an SVG graph.

Now the first version of BlazeCluster is done. It's available free at http://www.opencores.org/projects.cgi/web/mpdma/overview. The last six-microblaze architecture can be generated from such a simple script:

microblaze_m, microblaze, opb-master, 64k on-chip ram, jtag, opb-uartlite baudrate 9600 stdio, opb-cf-card readwrite
microblaze_cc, microblaze, 8k on-chip ram
microblaze_dct0, microblaze, 8k on-chip ram
microblaze_dct1, microblaze, 8k on-chip ram
microblaze_vlc0, microblaze, 8k on-chip ram
microblaze_vlc1, microblaze, 8k on-chip ram
dp_m_cc, dpram, 8k, left microblaze_m address 0x20000000, right microblaze_cc address 0x20000000
dp_cc_dct0, dpram, 8k, left microblaze_cc address 0x21000000, right microblaze_dct0 address 0x21000000
dp_dct0_vlc0, dpram, 8k, left microblaze_dct0 address 0x22000000, right microblaze_vlc0 address 0x22000000
dp_vlc0_m, dpram, 8k, left microblaze_vlc0 address 0x23000000, right microblaze_m address 0x23000000
dp_cc_dct1, dpram, 8k, left microblaze_cc address 0x24000000, right microblaze_dct1 address 0x24000000
dp_dct1_vlc1, dpram, 8k, left microblaze_dct1 address 0x25000000, right microblaze_vlc1 address 0x25000000
dp_vlc1_m, dpram, 8k, left microblaze_vlc1 address 0x26000000, right microblaze_m address 0x26000000
ddr, on-board ddr sdram, 256m address 0x30000000

The first six lines define six processors and their parameters. The next seven lines is about the dual port memories as interface between them. The last line define the shared memory. Use BlazeCluster to generate MHS, MSS and UCF file. Copy them into an empty project, synthesize and compile. Finally it works!

The code is all written in Perl. Basically it does some translations. First it reads the description file, convert it into an internal data structure and then generate MHS, MSS and UCF file. The code is straightforward and easy to understand. The first version supports only Xilinx XUPV2P board and microblaze only.




Saturday, April 14, 2007

Six-processor implementation

I continue to parallelize the system. In last three-processor implementation, three-stage pipeline is created. Another approach can be data parallelization. I can use six processors to encode two pictures simultaneously. Following is the topology.
-> dct0 -> vlc0 -\
master->color conversion / master
\ /
-> dct1 -> vlc1 -
(the topology may shown incorrectly because blogger removes space in between. I don't know how to solve that yet but I think you can understand it :)

From the profiling of one-processor implementation, it shows that dct and vlc takes two times more time than color conversion. So it's logical that color conversions for two channels share one processor. The master processor read data from external memory, shoot it and write the result back to external memory. I expect to achieve double speed than three processor implementation.

The interesting thing is that it really get double speed, :) according to profiling result. It takes the same amount time to compress two pictures. Compared to the original one-processor design, it already achieve 10X performance gain.

The design process is not complex, almost just repeat of what I did in three processor implementation except to set channel number and double buffering. The only bug I met is that I use ilmb instead of dlmb in the setting of processor 4 and 5. It takes a while to find it out as it's still difficult to debug.

The process the create multiprocessor on FPGA can actually be automated by some simple script. From my design, it's clear that there is a common structure for multiprocessor system. There is only a little difference between different implementations of processor, interconnection and communication library. Meanwhile, as processors get more and more, it's easy to make mistake simply in writing, just like what I did, and difficult to find it out. I think I should write a script to do that.

Multiprocess approach looks quite convincing until now. The get higher performance, I think I should try heterogenius design instead of current homogenius implementation. It's because almost every processor is dedicated to one task. A simple accelerator can improve the performance a lot.

Meanwhile it looks that I need some other components to facilitate multiprocessor, like DMA controller, especially DMA controller between external memory and internal memory. It may be also useful to design a hardware message interface. Currently message is send via dual port memory as data block. That's not efficient and scalable.

Sunday, March 25, 2007

Run Concurrently on Two-Processor

Since dct() takes most of time on one-processor implementation, it's logical to move it onto another processor as my first step to soft multiprocessor. That's similar to what I did before. I create one more microblaze, its local bus and local memory and communication memory. Two microblaze processors can talk to each other via on-chip 8Kbyte dual port memory.

I choose dual port memory is because it's efficient for large volume data communication. In fact, color conversion function can write its output directly into dual port memory and dct can get its input from there as well. The same to dct output. There is no additional copy involved. It's a stable design and works well.

The software design, however, is a bit tricky. At first, I implement the system in RPC mode. Basically microblaze 0 writes dct input into dual port memory, waits dct to finish and afterwards continues to zzq and vlc encoding. Not surprising, profiling result shows that this design is actually slower than one-microblaze implementation.

The reason is that processors doesn't run concurrently in RPC mode. It only make sense if one processor is much faster than others for that 'procedure'. That's not my case.

To get a concurrent design, software must be modified to run concurrently. I partition main loop on processor 0 into two tasks, one for external memory reading and color conversion while the other one for zzq, vlc and writing back into external memory.

The best way to run two tasks concurrently is RTOS. But it's too complex to port an RTOS at this moment. I choose an easy way. Task one is in main loop and it always check if task two is ready. If it's ready then task two get CPU cycles.

That results to another problem. More buffers are need. There was only one buffer available for each task but that's not enough. Suppose when dct is slower than color conversion, processor 0 can't do anything after color conversion because task2 can't run before dct is ready. In that case, processor 0 should continue to run task1, color conversion for next macro block.

I design a linked buffer list to replace static buffer. Every time color conversion starts a new macro block, it allocates a new dynamic buffer. The buffer is freed by processor 1 after it finishes dct conversion. Processor 1 allocates a new dynamic buffer when it starts a dct conversion for a new macro block as well. It's freed after processor 0 finishes vlc on it. All buffers are located on dual port memory so no additional copy as it was.

After these work, something interesting happens. I can easily notice that it gets faster. The profiling result shows that total time on processor 0 is 2.2s less than one-processor implementation. That's a proof that they run concurrently! It's first time that I see the performance improvement from soft multiprocessor although I already know it can.

Later I do two additional improvements. The first one is to move zzq() onto processor 1 because the load on processor 1 is much lower than processor 0. The second is to add one more microblaze processor for zzq and vlc. You can notice the improvement from both.


From this exercise, it's clear that the programming model and communication for multiprocessor is quite different to that for single-processor. To implement more processors on a chip, an efficient and robust linked buffer list is essential. Fortunately it looks not too difficult at this moment.

The work load for buffer management and processor management can get larger if we have more processors onto it. PowerPC is better than microblaze in term of this job. It also better to replace CF card to network.

(By the way, the ethernet driver in EDK is not free. Xilinx said that EDK users can evaluate that IP for one year but I can't find how to activate the evaluation license.)

In long run, probably we need both dual port memory and message passing mechanism. Dual port memory (or DMA) can be used for large data bulks while short message is more flexible and cheap.

It looks that Xilinx doesn't offer much tool for multiprocessor design. In my design, processor 1 is almost a black box to me. I only read some statistics from dual port memory. To fine tune the system, I need accurate timing information. The current software profiling is not accurate enough either.

A six-processor implementation (as below) can be interesting to try. But before that, I probably need to design some tools to ease design and profile.


/----> dct1(2) -> vlc1(4) ---\
getMB(0) -> ColorConversion(1) ----> Writeback(0)
\----> dct2(3) -> vlc2(5) ---/

Sunday, March 4, 2007

Profiling code on one processor

Before starting multiprocessor, I would like to do some general optimization and profiling for the current JPEG code running on one processor. It's better to do it now than after partitioning. Basically I did,

1) adding Barrel Shifter in processor. The barrel shifter can significantly improve the performance and code size because there are lots of shifting in DCT, quantizer and color conversion. Now it takes only one instruction to shift the register to any position while it takes several instructions for a shifting, one bit one instruction. For dct(), it reduce from 3.91s to 2.26s, for zzq() it reduced from 3.31s to 2.05s.

2) Using 32-bit local variables as much as possible. The RISC processor is most efficient to deal with 32-bit variables. In fact, most RISC processors shift left and right for every manipulation of 16-bit and 8-bit variables. It reduce dct() for 6s to 3.91s.

Besides that, I also did some work to simplify the code. That also improve the speed.

1) Simplify VLC and remove unnecessary functions.
2) Use Fast DCT algorithm from the famous open source Telenor H.263 codec to replace orignial code.
3) Add visual progress indication that I can see the result of optimization more easier.
4) Move the code to write result to CF card out of VLC. VLC simply write the result into a buffer in external memory and it's written to CF card after the encoding is finished. So the time consumption of VLC can be more precisely determined.

Xilinx provides tools for profiling, mb-gprof. The use of it can be found in Xilinx "Platform Studio User Guide". There is a detailed description about how to profile on FPGA board. Basically we need
1) add a profile timer core. You can use opb_timer_0. Set correct address and other signals for timer. Connect interrupt of timer to processor.
2) enable sw_intrusive_profiling and set profile_timer.
3) rebuild bitstream and software libraries.
4) compile code with -pg and download.

If you just download it, the code with profiling doesn't always work like normal code. You need to use XDM. The steps are:
1) start XDM.
2) 'connect mb mdm'.
3) 'profile -config sampling_freq_hz 10000 binsize 4 profile_mem 0x3e000000' to set profiler parameters.
4) 'dow mb-bmp2jpg/executable.elf' to download the code.
5) 'bps exit' to set a breakpoint at exit.
6) 'con' to run.
7) when it stops at exit, use 'profile' to read back from target and save profiling information in 'gmon.out' in project directory.
8) use 'mb-gprof mb-bmp2jpg/executable.elf gmon.out > profile.txt' to save the profiling result.

It's quite interesting to see the profiling result of every optimization I did because it's quantized feedback. However, the result might be inaccurate because I can easily notice that code with profiling runs much slower than code without profiling. Meanwhile, it relies on opb_timer and external memory to read the store timing information but software profiling tool may not be able to measusre the delay on OPB bus and external memory.

It's worthy to explore the accuracy of software profiler. On the other hand, for the further work, I think I would probably need a better hardware profiler. It need to be precise and predictable. Also it needs support multiprocessor profiling. But at the same time, I can start to partition the code on two processors and measure the improvement.

Some hints:
1) For disassemble I use mb-objdump -D -S *.elf. Checking assembly code generated is very important for optimizing.
2) CF card driver from Xilinx seems not stable. First it doesn't recognize CF card formatted by WinXP. It must be format by some camera or under Linux (mkdosfs /dev/sdb1). Second if a CF card reading or writing is interrupted (for instance start xdm before reading finishes), I sometimes need to format CF card. I think network with tftp is a good alternative.
3) It can be useful to measure the impact of external memory as buffer and cache. Currently the access to external memory doesn't align to cache line, I suppose.
4) Now I use a 1600x1200 24bit BMP file as baseline. Meanwhile I remove color subsampling becaue these code sounds not stable. This in fact doubles computation required.

Sunday, February 11, 2007

Game Starts!

During my master study and thesis project, it's getting clear to me that FPGA can achieve a few magnitudes higher performance if smart designed while keep flexibility at the same time. A good approach is soft multiprocessor on FPGA.

My master project is in that direction. It's to implement a four-processor system on FPGA to compress a BMP image on CF card and write it back afterwards. During the project, I realized that there is a large potential for such a system and my design is far away from the optimized.

Fortunately, my professor in TU/e, Eindhoven kindly lend an FPGA board to me. Now I can continue to play with it. Let's start!