CPU algorithms take time

If you are looking for raw computational speed, FPGAs are a well known solution.

Unlike FPGA’, a CPUs speed is limited by the fact that the CPU must perform its instructions one at a time, even though many algorithms could be made to run faster. The CPUs general purpose nature simpy slows it down when compared to what could be done within an electronic circuit.

While many of the more modern CPUs have eight or more separate processing cores within a single chip, this still doesn’t compare to the sheer parallelism offered within an FPGA.

FPGAs run algorithms in parallel

Not only can FPGAs run their algorithms in parallel, but they can also tailor their silicon for just your algorithm. Modern CPUs support divides, floating point operations, and SIMD instructions, many of which may have nothing to do with what you are trying to accomplish. The FPGA on the other hand has a majority of its logic available for you to configure. So, just how many copies of your algorithm do you wish to configure onto your FPGA solution?

For these same reasons, FPGAs can be faster than GPUs as well–when the algorithm in question isn’t a video algorithm such as the GPU was optimized for.

While you might consider using an ASIC to run your algorithm even faster, FPGAs they tend to be a lot cheaper to debug and manufacture than ASICs are. For example, making a mistake on an FPGA usually only costs your time to fix it, whereas a mistake within an ASIC design may cost you many millions of dollars.

For all these reasons, if your goal is speed, you may find yourself considering an FPGA as your solution. They are hands down the most economical means of running an algorithm faster than a CPU can.

Just don’t neglect the speed of your interface while you consider engineering an FPGA solution.

It’s the interfaces, Sir

I’ve noticed, as I’ve personally worked with FPGAs, that the FPGA fabric has rarely ever limited my algorithm’s speed and performance needs. That’s always been the easy part of the task.

Interfaces are an FPGAs achilles heel

Want to calculate a Haar wavelet transform on an image? FPGAs have enough logic to run the algorithm many times over within them! You can run the transform horizontally, vertically, no problem–the raw task is easy for an FPGA.

That’s not the hard part.

The hard part is feeding the algorithm with data, and getting the results back out fast enough to be competitive with the alternatives.

Perhaps an example will help explain what I’m talking about.

Years ago, I had the opportunity to work on a really neat GPS processing algorithm. If you are familiar with GPS processing, you’ll know that the success of a GPS processing algorithm is based upon how many correlations you can do and how fast you can do them.

In my case, I was starting with a special algorithm that had been demonstrated in software, and had proven itself in software on a general purpose CPU. My problem was that the algorithm ran slower than pond water. My team needed speed. For other reasons, there was an FPGA connected to to our CPU microcontroller. So we asked ourselves, why not use the FPGA to speed up the processing?

Offloading an FFT process from a CPU onto an FPGA

Specifically, I was interfacing an ARM based PXA271 XScale CPU, built by Intel at the time, with a Spartan 3 FPGA, from Xilinx. The GPS algorithms performance was dominated by the number FFTs the CPU had to accomplish. Why not do those FFTs within the FPGA, and run them that much faster?

If you look at Xilinx’s FFT IP core, they offer a pipelined FFT core that can run one FFT sample per clock–just like they did when I was working on this problem. Hence, an N point FFT costs N clocks to ingest into the FFT, and after some (short) processing delay it takes N clocks to get the data back out. How much faster could you get? The ARM CPU took many more clocks than that to process the FFT, so this should be much faster, right?

So I built it.

The ARM CPU and the FPGA were both attached on the same physical bus. That meant I could have the CPU send the FPGA the FFT data over the bus and then read it back when finished.

With a little bit of debugging, I managed to get it all to work. It wasn’t all that hard technically to build, and speed was very important to our application. So, we ran the algorithm with a stop watch, anxiously waiting to see how much faster it would run.

The interfaces are the bottleneck

Much to my displeasure, the new FPGA enhanced FFT algorithm didn’t run any faster. Indeed, as I recall, I think it even ran slower. I was shocked. This wasn’t what I was expecting.

That’s when I learned the painful lesson that an algorithm’s speed is dependent upon the interface speed that feeds the algorithm. In our case, the interface was so slow that just transferring the data to the FPGA and reading the results back took more time to do than to perform the FFT in the first place.

Learn the interfaces!

This is one of the reasons why the study of FPGA design needs to include a study of interfaces and how to interact with them. Indeed, if you look at one of my favorite FPGA websites, fpga4fun.com, you’ll find a lot of discussions about how to build interfaces. They discuss serial ports, I2C, SPI, JTAG, simple video ports (play Pong!), HDMI, and more. All of these interfaces have their purpose, and the FPGA student is well served by studying how to interact with them.

Sadly, none of them would’ve been fast enough to rescue my FFT processing needs above. (Although, … using the DMA controller on the XScale CPU might’ve helped …)

So, for this reason, let me recommend to you that before you spend your whole dime on making your FPGA run super fast with multiple copies of your algorithm all running in parallel, that you at least spend as much (or more) of that dime guaranteeing that the FPGA can read and write your data fast enough to keep your FPGA busy running that super-algorithm.