It’s been some time since we’ve discussed digital filtering on the ZipCPU blog. When we last left the topic, we had several filters left to present. Today, let’s pick up the symmetric or linear phase filter, and demonstrate a block RAM based implementation of it. I’ll call this a slow filter, similar to our last slow filter, simply because this filter won’t be able to handle a new sample every clock tick. Instead, what makes this filter special is that it only requires one dedicated hardware multiply. Better yet, as with the last slow filter implementation, making this filter adjustable won’t require a lot of resources. Unlike the last filter, though, this one will offer nearly twice the performance for nearly the same amount of resources.
But we’ll come back to this in a moment. In the meantime, let’s try to catch up some of our readers who may be starting in the middle of this discussion. Basically, we’ve been slowly working through hardware implementations of all of the basic FIR filtering types. We’ve already laid a lot of ground work, ground work you might wish to review should you find yourself coming in the middle of this discussion.
Our first post on the topic outlined what a digital filter was, and how a filter could be built that operated at the full clock rate of an FPGA. This initial post was quickly followed by an updated implementation that used fewer flip-flops while achieving the same performance.
This initial filtering article was followed by a proposed abstraction that could be applied to every filtering implementation. As long as any filter we built, therefore, had the I/O interface of this generic filter, we could re-use a test harness, to test multiple different filters.
We then demonstrated how these methods could be applied to test our generic FIR filter.
We then came back later to make the pairwise averager into a more generic moving (block) average filter.
The last post on this topic broke the mold of a filter that accepted one input and produced one output on every clock tick. Instead, that post presented a filter that would accept one input every
Nclocks, allowing the filter to accomplish all of its multiplies using a single hardware multiply alone.
Since this last post, I’ve focused on interpolation for a while, by first proving that interpolation is a convolution, and then showing how that knowledge could be used to create a very useful quadratic interpolation method with some amazing out-of-band performance.
In other words, it’s been some time since we’ve discussed
and there remain many types of
and filtering implementations for us still to discuss. For example, we
or half-band filters.
Neither have we discussed how to implement a
Hilbert transform. Another
fun filtering topic we could discuss would be how to string multiple
together to handle the case where there may be
N clocks between
samples, but the filter has more than
N coefficients. There’s also the
the one that
an incoming sample stream to
(just about) any arbitrary low-bandwidth for only about
24 taps–independent of the actual
Then there’s the adaptive filters that are commonly used in the equalizers
within digital communications receivers. Finally, there’s the grandaddy filter
of them all: the
When implemented properly, an
based filter is not only quite configurable, it’s also very easy to
Seems like I could keep discussing filtering for quite some time.
To start off the discussion, consider that I recently counseled someone who was studying aircraft engine noise and trying build a filter that would grab only 180Hz to 6300Hz of an audio signal sampled at 44.1kHz. He was disappointed that his Butterworth filter design wasn’t quite meeting his need. Given that his design only had a 3dB stopband rejection, I’m sure you can understand why not. But let’s just consider this specification for a moment.
Suppose you wanted to design a 180Hz to 6300Hz bandpass filter. How many taps would you need? Let’s say you wanted a 40Hz transition band. You would then need 4095 taps to achieve a 65dB stopband. If you used the generic filtering approach, you would need to find an FPGA with 4095 multiplies on board. While I don’t know about your budget, I certainly couldn’t afford an FPGA with that many multiplies. On the other hand, if you used the slow filter from our last discussion, you would only be able to implement a filter 2267 taps long. This would leave you with the choice of either loosening your transition bandwidth or your stopband criteria.
On the other hand, if you could exploit the symmetry inherent in most FIR filter designs, you’d have more than enough logic to implement your filter design on a cheap FPGA board using only one hardware multiply.
To design a filter, the engineer must first determine a range of frequencies he wants the filter to “pass” (i.e. the passband), and another range of frequencies he wants the filter to “stop”. These two ranges tend to be disjoint, with a “transition band” between them. Within the stopband, the engineer must determine the “depth”, also known as how much attenuation the filter is to provide in this region. This can also be used as a criteria on how much (or little) distortion is allowed within the passband).
All of these criteria are easily illustrated in Fig 1 below.
Given this criteria, the Parks and McClellan algorithm is well known for generating “optimal” filters. This filter design method produces filters that come closest to the filter design specification, as measured by the maximum deviation from the specification. There are two other realities with using this method. First, the filters designed by it all have an odd number of coefficients. Second, as designed they are all symmetric and non-causal.
These two characteristics follow the fact that the filter is specified with real, as opposed to complex or imaginary, criteria. That is, the desired filter’s frequency response is expressed as a real function of frequency.
This operation is called a
discrete convolution, and it
defines the operation of any
Fig 2 shows this operation pictorially, illustrating how the
can be placed into a tapped delay line, multiplied by their respective
coefficient, and then summed together to create the output
Further, if you apply this operation to a complex exponential, such as when
This complex multiplier is called the frequency response of the filter. It is defined by,
This is also the function we were specifying earlier in Fig 1 above. This should be familiar to you so far, as we have already discussed the importance of a filter’s frequency response.
Now consider what happens if we insist that this
frequency response is a
function of frequency, just like we specified it above. We’ll use
Hs to describe this constrained filter. If this filter has a
function must be equal to its conjugate. We’ll start our proof there.
We can then replace
Hs with its definition from above and shown in Fig 1,
and then work the conjugate through the summation.
Here, we note two things. First,
hs[k] is real and so
hs[k] is also equal
to its own conjugate. Second, if we reverse the summation on the right via a
p=-k, then we have
We’ll switch back to a summation over
k for convenience of notation, although
k value is just a dummy variable bearing no reference to the
k three equations above. (Yes, I did get marked off for doing this by
my instructor years ago.)
Now reflect on the fact that this relationship must be true for all
f. That can only happen if
Of course, this isn’t very useful to us, since any
having non-zero values of
non-causal and as such
cannot be implemented: it depends upon the knowledge of future values of
x[n]. The easy way to deal with this problem is to take the
filter we just designed, and shift it so that it’s first non-zero coefficient
Hence, we now have a filter with
N non-zero coefficients, and where
This does nothing more than delay the operation of our filter in time–something the designer may not care about.
The difference between the frequency response of the two filters, the one symmetric about zero and the offset but now causal filter, is a phase term that is linear in frequency. For this reason, filters of this type are often called linear phase filters.
then we can rewrite our convolution so it uses fewer multiplies.
This is a big deal! It’s a big deal because we now have half as many
multiplies as we had before! We can even scale our
such that the middle point of
h[M], has some sort of “simple”
(2^(K-1))-1 for a
K bit word size, for one fewer multiply.
Put together, we’ve just about doubled the capability of any
we might wish to implement for only a minimal cost in additional logic.
You can see how this affects our operation in Fig 3. Notice how the tapped delay line containing the incoming signal is now folded in half. Further, before the multiply, there’s an addition where we add the pairs of signal data points together before multiplying by the common filter coefficient. You might also notice the single value on the right. This represents the coefficient in the middle–the one we’ll multiply by a constant value.
Now compare Fig 3 to Fig 2 and count the multiplies. See the difference?
N multiplies, there are now
(N-1)/2, or nearly half as many
as we had before.
Now let’s take a look at how we might optimize our filter’s implementation to take advantage of this property.
Although this is an FPGA focused blog, sometimes it helps to consider an algorithm in a higher level language to understand it. So let’s spend a moment and review the C++ code describing our last implementation, and then show how this would change to exploit the symmetry inherent in most FIR filters.
Then the filtering algorithm stepped through each filter coefficient, multiplying it by a corresponding data sample.
Notice that the two indices go in opposite directions. This is an important feature of any filtering, implementation where the filter might not be symmetric. For symmetric filters, the direction the coefficients are read doesn’t really matter all that much.
The algorithm was made just a touch confusing by the fact that the data is kept in a circular buffer. As a result, the relevant data might cross the boundary of the circular buffer, from the right half to the left. If you split the loop into two parts, you can avoid checking for the buffer split within the loop itself.
Finally, the algorithm ended by returning the accumulated sum of products.
That’s the basic filtering algorithm that you can apply to implement any FIR filter. (Remember, for FIR filters longer than about 64-taps or so, FFT methods are faster/better/cheaper in software than the direct form presented above.)
We’d start out exactly as before, by adding the new data sample to our input sample memory.
However, that’s about as far as we can go in common with the previous algorithm.
The next step is to calculate two pointers into the data–something we didn’t
need to do before. The first,
will be a pointer to the most recent data that we just added into our buffer,
while the second,
dpold, will point to the oldest data in our buffer. Since
the buffer has only as many samples within it as we have coefficients in our
we only need to check for wrap around once.
We’ll also set a pointer to our coefficient memory.
At each point through the summation, we’ll read two values from our
data memory, one older
*dpold and one newer
*dpnew. We’ll then add these
two values together, and multiply the result by the
So far this is all very straight forward.
Where this C++ version gets difficult is when we try to handle pointer wrapping in our circular buffer. Unlike before, when we only had one pointer to check, we now have two pointers that might wrap as we work our way through memory. Rather than work through the math of separating this loop into three parts, I’ll just add a wrap check inside the loop.
This loop will end before we get to the multiply in the middle of the filter. So, let’s handle it now and then return our result.
For those who might ask, yes, I do like redundant parentheses. That’s really beside the point, however.
The point here is that for an
N is odd, we’ve just calculated the result using only
1+(N-1)/2 multiplies, instead of the
N multiplies it would have cost us
This is the basic algorithm we’ll code up in Verilog in the next section. However, in Verilog we’ll use a memory in a circular addressing configuration. That way we don’t have to worry (much) about wrapping the pointers around. We will, however, need to worry about timing and pipeline scheduling.
This diagram shows data coming in from the left, and going through two tapped delay lines. A selector then walks through each of the samples in the tapped delay line picking a data value. At the same time, a separate selector picks a value from the filter’s impulse response coefficients. These two values are shown multiplied together, accumulated, and then output.
To build this symmetric filter, we’re going to first break the tapped delay line structure into three parts, as shown in Fig 3.
The parts on the left and right will be implemented by block RAMs, and data will appear to “move” through the RAMs by just adjusting the indices used. The data point in the middle is point of symmetry for the filter. This point will not reside within either of the block RAMs. Instead, we’ll make this the last item read from the left block RAM, and place this item into the second block RAM on any new sample.
When a sample is provided, all the data will shift right by one. That will be our first step.
Using this index we’ll write the new sample to the left data memory.
We’ll read the mid-point sample from the last value read from this memory.
Finally, we’ll write that mid-point sample to the right half of memory.
Notice that we are using the same write index for both halves of memory. We’ll have to deal with this a bit in our next step, since we are writing to the newest memory on the left, and half way through the memory on the right.
That will handle data movement, what about reading the data?
To read the data, we’ll set two data pointers–one to each block RAM–when any new sample comes in. These will initially point to the extreme locations in the memory, both the oldest and the most recent. These pointers will then walk, in unison, towards the center data point, as shown in Fig 5 above.
The next steps with the data are fairly straight forward. There’s not much magic in them. First we’ll read the data,
then add the two values together,
then multiply them by the filter coefficient, filter coefficient,
and add the result to an accumulator.
You may notice that we didn’t clear the accumulator,
acc. We’ll have to
come back to that. You may also notice we skipped some steps along the way,
although this is the basic algorithm. So, let’s go back a bit.
We’re going to need to read the filter coefficient before we use it. This will involve resetting the index to the beginning of the coefficient memory,
and then reading the next filter coefficient on every clock subsequent clock until we’re done.
At this point, our algorithm roughly looks like Fig 6 below.
We read from each data block in opposite directions and added the two values together. We also read from our coefficient memory, and then multiplied the filter coefficient by the data sum. Finally, we accumulated the products together to create an output.
We’re still missing a couple items, though. For example, we need to
multiply our mid-point by some value. Let’s
fix that value to be
2^(M-1)-1, also known as the maximum positive integer
that can be represented in
M signed bits. This works because this sample
value is usually (always?) the largest value in the
At least, that’s the basic idea of how we want the filter to work. Sadly, this simplistic approach to the algorithm is going to give us some headaches when we actually attempt to implement it in the next section. Why? Although we have a conceptual idea of what we wish to accomplish, the devil in this case lies in the details of how we handle the pipeline scheduling.
When I first set out to implement this filter, I thought I might just quickly modify the generic slow filter I had presented earlier. This is basically what I just presented above in the last section. I’m mean, all that’s required is two memory reads, a sum and then the filter is identical to what it was before. It’s that simple, right?
No. It isn’t.
As a measure of difficulty, consider this: I’ve gotten to the point where implementing a basic DSP component like this has become fairly easy and routine. I therefore gave myself two days to do the task, and even then I didn’t work on it full time for both days. Much to my surprise, I almost didn’t finish the task within the two days I’d given to myself.
It wasn’t that the filter was really all that hard to implement, but rather the problem was scheduling the pipeline. To understand this, let’s work from the fixed points in the schedule.
The first fixed points are the memory reads. In order to make certain the design tools place the data and filter coefficients into block RAM, they can only be accessed in the simplest of ways. Specifically, each RAM must takes one clock cycle to read where you do nothing else with the value read. Yet moving the time the data were valid from the clock in the slow filter, to the next (i.e. following the summation) made for all kinds of havoc within my design.
In a humble admission, I’ll admit that I almost pulled out the formal methods to formally verify the design after I struggled so hard getting this to work within the test bench. It seems I’ve gotten hooked on how easy it is to get a design to work using formal methods. (Yes, it is possible to formally verify a digital filter using formal methods. The problem is that the multiply makes the application of formal methods difficult. I’ll leave this concept for another day, though, specifically for some time after I discuss the concept of abstraction in formal verification.)
In the end, in order to get the following design to work I had to work through the design and its test bench one clock at a time, verifying every value along the way until all of a sudden the design passed. While I was successful, I do have to admit that my success took more work than I expected. Shall we just say my performance at the task just wasn’t as graceful as I might have liked?
The key to the success of this design lies in the pipeline schedule. So, as I was building this algorithm, I scribbled out the pipeline schedule diagram shown in Fig 7.
If you remember from our prior discussions of these charts, values are
shown in the column in which they are valid. Hence
i_ce is true on the
first clock of our cycle. At that same time, the sample value,
the data write index (
dwidx), mid-point sample from the last time through
mid_sample), and the final data value from the left block RAM (
are also valid. On the next clock to the right, the data values may now
be read from memory, hence
dmem1[dwidx] is valid and so forth. On this
clock also I’ll raise an
m_ce signaling flag so I know where I am in
this portion of the pipeline.
I’ve taken the time to show particular anchor points to this diagram in
red. These are the points we worked out above in the last section. They are
immovable in the pipeline. Hence, following the
m_ce (first memory read
valid) cycle there’s a first data valid cycle where
d_ce is high. This
will be the first of many such cycles. This is followed by a summation
s_ce, when the first data sum is true. This is then followed
by a product cycle,
p_ce, when the first data product is true and so on.
The items in black are the ones I really struggled to get right. In
first filter coefficient
value doesn’t have to be valid until
That means its index must be valid the clock prior and not the clock
prior to that. This one little difference marked the root of my problems.
But let’s get to these difficulties in order.
The next order of business is to move our data pipeline. This happens any
time a new sample is sent to
as indicated by
i_ce being high.
As part of this new sample routine, we’ll increment a data write index on every new sample.
There is no reset on this index, despite the initial value. This lack of a
to be cleared and flushed by holding the
i_ce lines high, while also forcing
i_sample to be zero for as
many clock cycles as
We’ll also write our new sample to the first memory bank,
our midpoint sample to the second memory bank,
and we’ll set our midpoint sample to be the last sample read from the first
memory bank, or
dleft as I’m going to call it with reference to either
of Figs 5 or 6 above.
mid_sample doesn’t need to be constrained to ultra-simple memory
logic, it can be sensitive to the reset. This will halve the number of
clock cycles it will take to clear
during a reset, since these zeros will then feed the second half of memory.
The next signal,
pre_acc_ce, is the one we are going to use to control
whether or not to accumulate a new product into our accumulator. This is
the control value that will also keep us from accumulating past the last
valid coefficient in
For this reason, we start out by counting how many coefficients are remaining.
We’ll also calculate a flag to tell us whether or not we are at the last data index.
Both of these values are referenced to the pipeline cycle where the index
is valid. What that means, though, is that this measurement needs to be
propagated to through the pipeline. This is the purpose of
the “pre” calculation of whether the accumulator should be updated.
On any new data,
pre_acc_ce gets set to
1'b1. This starts the beginning
of the accumulator pipeline.
It then stays at one until the index for the last coefficient.
Here we get to one of the surprises of the algorithm. Because the coefficient
index and the data index are off by one cycle, this
isn’t a sufficient indicator of when to turn the
pre_acc_ce value off.
Instead, we’ll turn it off as long as we aren’t starting around again on
the next incoming sample.
pre_acc_ce value now needs to be propagated through our pipeline.
We’ll use a simple shift register structure for this purpose.
We can now interpret
pre_acc_ce as follows. If
pre_acc_ce is true,
the data index is valid. If
pre_acc_ce is true, the data values are
pre_acc_ce is true, the sum is valid and if
is true then the product is valid and may be accumulated.
You’ll see how this works more in a moment.
We’ll next focus on the data read indices. These will continue to follow the
left and right convention from Figs 5 and 6 above. Hence we’ll have a left
lidx, and a right index
ridx. On any reset, these are set to zero,
although this is really more for form than anything else.
Then, on any new sample, the left data index,
lidx is set to point to the
newest value just written to memory. The right half data index is a touch
more complicated–and took me many trials to get right. This index is
given by back tracking from the oldest value by half our
filter length. This
would’ve been the entire size of the second memory, save that we are forcing the
memory to have a power of two size, while the
NTAPS can be any arbitrary
As a last step, we’ll step the indices as long as this isn’t the last data index. The left index steps backwards from the newest value to older values, while the right index steps forwards from the oldest index to the newer index.
Keep in mind, the
last_data_index criteria here is key. In particular,
we are about to read the left data value,
dleft, from the left memory
based upon this index. This will turn into the mid point value on the
next sample. For this reason, the indices cannot be allowed to just free
The second remarkable piece of this logic is the
m_ce criteria. Our
last_data_index flag depends upon the cofficient index, not the data index.
This coefficient index isn’t (yet) valid on the
m_ce cycle. Hence, we need
to continue to walk the index forward even before the first tap index is
valid. (Yes, this was another one of those details that surprised me as
I built this.)
A very similar piece of logic is used to determine the filter coefficient index.
While it might look like we might just let this value run as well, keep in mind that we are using the filter coefficient index as an indicator of when to stop accumulating. For this reason, this index also stops at the end of the coefficient memory.
That’s the data and coefficient index calculation. Let’s move on to the
memory read cycle(s). We’ll mark this cycle with an
m_ce flag, as shown
in our pipeline schedule in Fig 7 above, indicating that this is the first
cycle the memory data is valid.
On this cycle we’ll read from the two block RAMs of data.
Following the outline in Fig 7, the next pipeline cycle and flag is the
data valid cycle,
d_ce. This follows the
flag, and like it this is only valid for one clock cycle.
Now we can read the coefficient index,
and add the data values together.
We’ll use the
s_ce signal to mark that the first data summation is valid.
The data sum times the filter coefficient is our product value. If this looks familiar, it should–this matches the slow filter code, and was drawn from it. This should also matches our discusion in the last section.
midprod value is different.
midprod is the product of the mid-point
mid_sample times the maximum positive integer value that fits
TW bits, or
That then leads us right into the accumulator. There are three parts to this. First, we’ll set it to zero on any reset.
Second, on the
s_ce cycle (see Fig 7 above), we’ll initialize this
accumulator to be our midpoint product.
Third, on every subsequent sample with a valid product, we’ll add the product to our accumulator.
Finally, on the same sample that we initialize our accumulator with the mid-point product, we’ll set our result to be the last value that had been in the accumulator.
In a similar fashion, we’ll set our output clock enable strobe, so that the rest of the signal processing pipeline can continue to follow the global CE strategy.
That’s all it takes to build a
N times slower than the system clock.
Perhaps the biggest lesson learned to draw from this discussion is how
dependent I was on the
to get particular values right.
For example, did you notice the strange logic on
pre_acc_ce? Or how about
the fact that the
last_data_index compared against two instead of zero or
one? Or perhaps that the
taps_left calculation didn’t depend upon the
total number of taps, but rather
N, the number of
coefficients, must be an odd number.) All of these values were determined
with the help of running many simulations.
Test Bench Performance
When it comes to building the test bench for this system, there wasn’t a lot of work to be done–thanks to the work we did building a generic filtering test bench some time earlier. Even better, if you meld this test bench driver against the test bench driver for the slow filter,
you’ll see only a very small, limited number of (key) differences. While most
of these are concerned with adjusting
length related parameters, there is one particular difference worth commenting
on. That difference is in the
As you may recall from our discussion of the
generic filtering test
method allows us to load coefficients into the
and then test the
of the filter to insure those coefficients were loaded correctly. In our
case, this also gives us a chance to verify that the filter’s response is
as we had intended.
If the coefficient comes before the midpoint, it should match the data
given to this function and passed to
Likewise if this is the midpoint, then the coefficient should be equal to
the fixed value,
TW remains the maximum number of
bits in any coefficient.
On the other side of the midpoint, the coefficients should read back in reverse order. In other words, here is where we check the symmetry.
Finally, anything beyond the number of taps in the filter should return zero.
Once this is accomplished, we can now use our test bench structure to measure the filter’s performance against the lowpass filter configuration we’ve been testing against. The result is a stopband rejection roughly between -55 and -54 dB, which is just what we were expecting.
That’s not that bad, especially if you consider that we only used half as many clock cycles as the last slow filter we studied.
This filter implementation marks one of those times where a little bit of engineering up front will spare you a lot of design resources later. That’s why we added a touch more mathematics above to what we had done before. Sure, our implementation is a touch more complicated than the last slow filter. But look at what we gained: twice the performance! That’s what makes this filtering implementation valuable. How applicable is it? This approach applies to all filters designed via the optimal Parks and McClellan filter design algorithm. In other words, this is almost a universally applicable optimization among FIR filter implementations.
As I mentioned above, we have yet to discuss many other types of filtering implementations. Perhaps this simplest filter to build next would be a half band filter. For certain filters, a half band filter implementation can be built for half again as many resources as we just used.
We’ll have to come back again on another day, though, in order to demonstrate and discuss some of these other filters.
There is one glory of the sun, and another glory of the moon, and another glory of the stars: for one star differeth from another star in glory. (1Cor 15:41)