A Cheaper Fast FIR Filter

Since we last put a generic high speed FIR together, a friend showed me another FPGA structure which can be used to create a high-speed FIR filter.

As you may recall from our last discussion, a filter is typically represented by the logic illustrated in Fig 1.

Fig 1: Traditional Filter Implementation

In our last presentation, we noted that the accumulation step at the bottom of Fig 1 requires a number of clocks to be successful. Hence, every addition in that presentation required a clock. We then lined everything up by placing an extra delay between the input data stages.

Today, let’s take a look at a simpler structure for FIR evaluation. By simpler, I mean one that doesn’t require any of the double-delay structures we used in our last filter implementation.

Simpler Structure

This new simpler structure works by taking the delay line from the input samples, and places it on the output accumulation line. This concept is notionally shown in Fig 2.

Fig 2: Simpler Filter Implementation

Unlike Fig 1, there is no input delay line. That delay line was moved to the accumulation structure. Another difference is that the taps are in a reverse order. Instead of running from h[0] to h[N-1] in Fig 1, the taps in Fig 1 run from h[N-1] to h[0].

The code for this simpler structure is almost identical to the structure we used before. Indeed, the two are so similar, we could almost present the code difference as a simple diff. We’ll choose to maintain a touch more context, though, and encourage you to try a diff between the two components, old and new, yourself.

The first difference is that this simpler filter needs the same sample value sent to all of the multiplies at once. Hence, we’ll send sample[0] to all of the tap structures. We’ll depend upon the synthesis tool to clean up the unused delayed sample output still found within the tap structures. This will generate some warnings, but should still work well.

The second difference is that the taps of the prior generic FIR, needed to be given to it in reverse order. For this new filter, the taps are fed into the filter from h[0] to h[N_k-1].

for(k=0; k<NTAPS; k=k+1)
begin: FILTER

	firtap #(.FIXED_TAPS(FIXED_TAPS),
			.IW(IW), .OW(OW), .TW(TW),
			.INITIAL_VALUE(0))
		tapk(i_clk, i_reset,
			// Tap update circuitry
			tap_wr, tap[k], tapout[k+1], // !!!!!
			// Sample delay line
			// Well let the optimizer trim away sample[k+1]
			i_ce, sample[0], sample[k+1], // !!!!!
			// The output accumulator
			result[k], result[k+1]);

	if (!FIXED_TAPS)
		assign	tap[k+1] = tapout[k+1]; // !!!!!

	// ...
end endgenerate

I’ve placed exclamation points in-line following each of the changed lines in the code above, so you can see where the changes are.

You can find the code for this modified filter here next to the original filter

Cost Comparison

The cost of this filter is almost exactly one DSP per tap–plus one flip-flop per stage per bit.

There are two aspects of this design keeping this from being one DSP per tap.

The first is that the tap component module separates the addition from the multiplication. This addition could have been subsumed, together with the multiply, into the DSP to make a multiply and accumulate structure. Instead, each of the (N-1) additions still require (roughly) two 3-LUTs per output bit. The inputs to these 3-LUTs are the two inputs for the addition, plus the carry bit. The output of the first LUT is the addition output, and the second LUT the carry output. Using Xilinx 7-Series devices, this addition can be accomplished in one 6-LUT per bit per tap.

The second item keeping this from costing a single DSP per tap is the flip-flop structure containing the tap values, h[k].

When compared to the generic filter we presented last time, the last filter included two flip-flops per stage per bit in the input as well in addition to the rest of the logic that remains in the newer implementation. These extra flip-flops are gone in this updated implementation, rendering the updated implementation cheaper than the last one.

You might still choose to use the previous implementation, though. In particular, the prior implementation doesn’t struggle with the fan-out issue this newer implementation has. As a result, the previous implementation may be able to run at higher speeds.

Next Steps

As I mentioned in the last filtering post, there are many ways to implement filters within FPGAs and there are many ways to simplify even this filter implementation. Should your application permit it, sharper filters (i.e. filters with more taps) may be possible. For example:

Many filters are symmetric (linear-phase). A carefully built symmetric filter will drop the number of required hardware multiplies in half.
Half-band filters can be built using only every other tap, dropping the number of hardware multiplies required in half again.
Hilbert transforms, are another special type of filter. Both the half-band filter and the symmetric filter optimizations apply to Hilbert transforms, even though Hilbert transforms are neither half-band nor symmetric filters.
Filters combined with down-samplers can also be accomplished with fewer hardware multiplies as well.

Once we finish or CORDIC test bench, we’ll start building a generic filtering test bench to prove these generic filtering structures work, and then come back and continue looking at some of the optimizations outlined above.