A Cheaper Fast FIR Filter
Since we last put a generic high speed FIR together, a friend showed me another FPGA structure which can be used to create a highspeed FIR filter.
As you may recall from our last discussion, a filter is typically represented by the logic illustrated in Fig 1.
In our last presentation, we noted that the accumulation step at the bottom of Fig 1 requires a number of clocks to be successful. Hence, every addition in that presentation required a clock. We then lined everything up by placing an extra delay between the input data stages.
Today, let’s take a look at a simpler structure for FIR evaluation. By simpler, I mean one that doesn’t require any of the doubledelay structures we used in our last filter implementation.
Simpler Structure
This new simpler structure works by taking the delay line from the input samples, and places it on the output accumulation line. This concept is notionally shown in Fig 2.
Unlike Fig 1, there is no input delay line. That delay line was moved to the
accumulation structure. Another difference is that the taps are in a
reverse order. Instead of running from h[0]
to h[N1]
in Fig 1, the
taps in Fig 1 run from h[N1]
to h[0]
.
The code for this simpler structure is almost identical to the structure we used before. Indeed, the two are so similar, we could almost present the code difference as a simple diff. We’ll choose to maintain a touch more context, though, and encourage you to try a diff between the two components, old and new, yourself.
The first difference is that this simpler
filter
needs the same sample
value sent to all of the multiplies at once. Hence, we’ll send
sample[0]
to all of the
tap structures.
We’ll depend upon the synthesis tool to
clean up the unused delayed sample output still found within the
tap structures.
This will generate some warnings, but should still work well.
The second difference is that the taps of the
prior generic FIR,
needed to be given to it in reverse order. For this
new filter,
the taps are fed into
the filter
from h[0]
to h[N_k1]
.
I’ve placed exclamation points inline following each of the changed lines in the code above, so you can see where the changes are.
You can find the code for this modified filter here next to the original filter
Cost Comparison
The cost of this filter is almost exactly one DSP per tap–plus one flipflop per stage per bit.
There are two aspects of this design keeping this from being one DSP per tap.
The first is that the tap component
module
separates the addition from the multiplication. This addition could have
been subsumed, together with the multiply, into the DSP to make a multiply
and accumulate structure. Instead, each of the (N1)
additions still require
(roughly) two
3LUTs per output
bit. The inputs to these
3LUTs are the
two inputs for the addition, plus the carry bit. The output of the
first LUT
is the addition output, and the second
LUT the carry output.
Using Xilinx
7Series
devices, this addition can be accomplished in one 6LUT
per bit per tap.
The second item keeping this from costing a single DSP per tap is the
flipflop
structure containing the tap values, h[k]
.
When compared to the generic filter we presented last time, the last filter included two flipflops per stage per bit in the input as well in addition to the rest of the logic that remains in the newer implementation. These extra flipflops are gone in this updated implementation, rendering the updated implementation cheaper than the last one.
You might still choose to use the previous implementation, though. In particular, the prior implementation doesn’t struggle with the fanout issue this newer implementation has. As a result, the previous implementation may be able to run at higher speeds.
Next Steps
As I mentioned in the last filtering post, there are many ways to implement filters within FPGA’s and there are many ways to simplify even this filter implementation. Should your application permit it, sharper filters (i.e. filters with more taps) may be possible. For example:

Many filters are symmetric (linearphase). A carefully built symmetric filter will drop the number of required hardware multiplies in half.

Halfband filters can be built using only every other tap, dropping the number of hardware multiplies required in half again.

Hilbert transforms, are another special type of filter. Both the halfband filter and the symmetric filter optimizations apply to Hilbert transforms, even though Hilbert transforms are neither halfband nor symmetric filters.

Filters combined with downsamplers can also be accomplished with fewer hardware multiplies as well.
Once we finish or CORDIC test bench, we’ll start building a generic filtering test bench, prove these generic filtering structures, and then come back and continue looking at some of the optimizations outlined above.
And a stranger will they not follow, but will flee from him: for they know not the voice of strangers. (John 10:5)