The reference code for algorithm is from sf.net. The most impressive design is QuickMAN by Paul Gentieu with core algorithm implemented in assembly code. It's incredibly fast. The disadvantage is that it works on Windows only and doesn't support multicore yet.
The reference code for FPGA is from Xilinx. First it's implemented on one-PowerPC architecture on FPGA. However, it's very slow because there is no FPU. Next, I added a Microblaze processor with FPU. It gets much faster. In the third step, I added four Microblaze processors as following shows.
This implementation, however, is still much slower than QuickMAN running on a normal PC. The major reason is probably: 1) low CPU speed. Currently all microblaze runs at 100Mhz; 2) Inefficient floating point libarary. I use the floating library from Xilinx. It's efficient in area but probably not best in performance; 3) slow external memory access. The reason I can't make it more detail is that I can't make profiling on PowerPC work. I did what's in the manual and got a 250MB profiling file which is wrong. Even it works, it's still difficult to look into it because software-intrusive profiling is not accurate enough. It's better to have some on-chip profiling counters as Pentium or Athlon has.
Below is some tricks of the code.
1) Source code on all microblaze processors are identical. The only difference is actually compile options. There is a macro for software to know on which processor it's running. The only difference in binary code is communication dual port memory base address. In other words, the macro decides the address during compiling.
2) I use single-precision floating point because the FPU on microblaze doesn't support double-precision. Meanwhile, in case of double precision, it doesn't use FPU at all. It just emulate it.