In our last presentation, we noted that the accumulation step at the bottom of Fig 1 requires a number of clocks to be successful. Hence, every addition in that presentation required a clock. We then lined everything up by placing an extra delay between the input data stages.
This new simpler structure works by taking the delay line from the input samples, and places it on the output accumulation line. This concept is notionally shown in Fig 2.
Unlike Fig 1, there is no input delay line. That delay line was moved to the
accumulation structure. Another difference is that the taps are in a
reverse order. Instead of running from
h[N-1] in Fig 1, the
taps in Fig 1 run from
The code for this simpler structure is almost identical to the structure we used before. Indeed, the two are so similar, we could almost present the code difference as a simple diff. We’ll choose to maintain a touch more context, though, and encourage you to try a diff between the two components, old and new, yourself.
The first difference is that this simpler
needs the same sample
value sent to all of the multiplies at once. Hence, we’ll send
sample to all of the
We’ll depend upon the synthesis tool to
clean up the unused delayed sample output still found within the
This will generate some warnings, but should still work well.
I’ve placed exclamation points in-line following each of the changed lines in the code above, so you can see where the changes are.
There are two aspects of this design keeping this from being one DSP per tap.
The first is that the tap component
separates the addition from the multiplication. This addition could have
been subsumed, together with the multiply, into the DSP to make a multiply
and accumulate structure. Instead, each of the
(N-1) additions still require
3-LUTs per output
bit. The inputs to these
3-LUTs are the
two inputs for the addition, plus the carry bit. The output of the
is the addition output, and the second
LUT the carry output.
devices, this addition can be accomplished in one 6-LUT
per bit per tap.
The second item keeping this from costing a single DSP per tap is the
structure containing the tap values,
When compared to the generic filter we presented last time, the last filter included two flip-flops per stage per bit in the input as well in addition to the rest of the logic that remains in the newer implementation. These extra flip-flops are gone in this updated implementation, rendering the updated implementation cheaper than the last one.
You might still choose to use the previous implementation, though. In particular, the prior implementation doesn’t struggle with the fan-out issue this newer implementation has. As a result, the previous implementation may be able to run at higher speeds.
As I mentioned in the last filtering post, there are many ways to implement filters within FPGAs and there are many ways to simplify even this filter implementation. Should your application permit it, sharper filters (i.e. filters with more taps) may be possible. For example:
Half-band filters can be built using only every other tap, dropping the number of hardware multiplies required in half again.
Hilbert transforms, are another special type of filter. Both the half-band filter and the symmetric filter optimizations apply to Hilbert transforms, even though Hilbert transforms are neither half-band nor symmetric filters.
Once we finish or CORDIC test bench, we’ll start building a generic filtering test bench to prove these generic filtering structures work, and then come back and continue looking at some of the optimizations outlined above.
And a stranger will they not follow, but will flee from him: for they know not the voice of strangers. (John 10:5)