When we first examined filtering, we looked at the two simplest digital filters I knew of: a filter that averages adjacent values, and a filter that recursively averages multiple numbers together. These two simple filters required only a few FPGA resources, and so they were easy to implement. Sadly, they weren’t very configurable and so their filtering capability was quite limited.
We then came back to the topic and discussed how to create a generic Finite Impulse Response (FIR) filter. Such a generic FIR filter is very configurable. Using the approach I presented, you can describe any FIR digital filter in logic. We then came back to the topic a bit later, and discussed how to create a filter that required fewer flip-flops, while still maintaining the complete configurability of any generic FIR filter.
The problem with both of these generic approaches is their cost. As with most digital filters, this cost is often measured in the number of multiplies (i.e. the number of taps). While the number of logic blocks and flip-flops is also of interest, even this logic scales with the number of filter taps required.
As an example of this problem, suppose you wanted to select an FM broadcast signal (200 kHz) from somewhere within the FM broadcast band (87-108MHz RF in the US). Now suppose you also wanted to do all of this processing within your FPGA, and that you could afford a $99 Arty (plus an FM antenna, pre-amp, ADC, etc.). The Arty contains an Artix-7/35T Xilinx FPGA with 90 DSP slices. That means you can implement 90 multiplies within your logic on any given clock tick. This would allow you to create a generic FIR filter with 89 taps. (Most FIR filters have an odd number of taps.) With an 89 tap generic filter, you’d only be able to get about a 6dB separation between your channel and any other FM channel, and even the rest of your ADCs passband. Such performance is pitiful. It’s a far cry from the 70dB that I was taught to design to. Indeed, it would take an FIR filter of roughly ten thousand taps to provide a required 70dB separation.
Not only does the Artix 7/35T ($35) on the $99 Arty not have this many taps, none of the Xilinx 7-series parts has enough multiplies (DSP blocks) to implement a filter this large. The closest is the heftiest Virtex-7, which has 2,820 DSP elements. I couldn’t find this on Digikey today, though. The closest Virtex 7 that I can find today on Digikey is the Virtex 7/2000T with 2,160 DSP blocks for $18,000. In other words, money won’t buy you out of this problem.
On the other hand, if you could filter your signal down to less than 2MHz, without using any multiplies, you can then save your multiplies for a later step when you could share a single multiply between multiple taps. Indeed, if you could do that then you might be able to select an FM channel for the cost of only a pair of multiplies–sparing your other 88 multiplies for some other purpose.
This is where the
comes into play.
A moving average
requires no multiplies, only two additions, two incrementing pointers, and
some block RAM. Although the
-13 dB stopband, applying the
in a cascaded fashion
N times would give you a
-13 * N dB
Six rounds of
such a filter
may well be sufficient, especially when each
round uses only a
minimum amount of
So, let’s take a look at what it takes to implement a moving average filter. (I’ll call it a boxcar filter, based upon the fact that the impulse response of this filter is a boxcar function.) We’ll start by examining how to build a moving average filter in general, and then discuss an initial (broken) implementation of such a filter. We’ll then simplify the basic idea a bit more, and show an example of how a “boxcar filter” might be implemented. We’ll then round out the discussion with a discussion on performance, explaining what sort of response you might expect from this filter.
Since what we’re doing might not look so clear when we dig into the code
pause for a moment first to discuss what we are intending to do.
Our goal is to create a
that adds an adjacent
N samples together. As time progresses, the values
that will get averaged together will also rotate through our window as well.
This is why the operation is called a
“moving average”: because
the choice of which samples get averaged together moves with time.
Let’s back up a small step first, though. If you recall from before, a generic FIR filter has the form shown in Fig 1. In that figure, you can see how each incoming input sample goes into a delay line (at the top) and, at each stage of the delay line, gets multiplied by a constant. (The constant isn’t shown.) All of the multiplication products are then added together to form the output.
Formally, we might write the operation of this filter as,
The moving average
filter fits this form as well,
with the unique feature that all the
h[k] are all ones. This means that the multiplies are all
by one, and so they they can be removed from the
Hence, you might draw this simplified filter as shown in Fig 2, without the multiplies.
Formally, we might write this as,
With this simplification, the filter has turned into nothing more than a summation of a set of input values made on every clock tick.
Even in this form, we’re not ready to implement this
Instead, consider what would happen if we subtracted two of these
values from each other. You can see this conceptually in Fig 3.
Did you catch that? The difference between two adjacent outputs of our
y[n]-y[n-1] is given by the difference between two input samples,
Mathematically, we could write this as,
This subtle change in formula, from the generic formula for a convolution to the one above, will spare us a lot of logic–allowing us to use a running accumulator and a block RAM instead of hardware multiplies and a lot of flip-flops.
Further, despite this formula referencing
y[n-1], it is not an
Careful design will keep it from becoming one.
The Basic Equations
A very quick, though incomplete, draft of this algorithm might look like:
Walking through this implementation, our first step was to write this
new sample to memory, and update our write pointer. At the same time, we
read the value out from
N samples ago, and updated our read pointer.
We then subtracted the value we read from memory,
memval also known as
x[n-N], from the new value,
x[n], we just received. The
result of that subtraction was then added into our accumulator,
like we discussed in the last section. The final step was to
the output to the desired number of bits and we were done.
You’ll find by the end of this article that our finished algorithm, is not going to be all that much more complex than the algorithm above, although you may also find that it doesn’t look much like the algorithm above.
The first problem with our first attempt is making certain that the
has a proper initial value. One mistake in the accumulator, one mistake
y[n] wrong, and
might be turned into an
IIR filter with
an unwanted DC offset. To keep that from happening, let’s create a reset
capability so that we can guarantee the filter starts in a known configuration.
Specifically, at the time of reset, the accumulator must be set to zero,
as with any intermediate calculation values, such as
The other thing we’ll need to pay attention to is the memory. One difficulty when using block RAM, as with all memories I know of, is that there’s no circuitry for initializing all of our memory at once. Instead, we’ll “pretend” the memory is zero for a number of clocks following a reset, and then we’ll use the memory as normal.
Hence, we’ll need to modify our code such as,
This creates a new value,
full, which we can use to determine whether the
memory value is valid or not.
These are the big broad-brush differences between the quick draft above, and what we’re about to present below. Our next step, then, will be to build our final algorithm.
The Actual Algorithm
As a step number one, we’ll make this
as generic as we can. To do that, we’ll parameterize our input width,
IW, and output width,
OW. Further, we’ll parameterize the number of
averages allowable, which we shall controll by the log (based two) of the
maximum number of averages,
LGMEM. This will allow us to average by
any amount between
Hence, if you want to average fifty-five 16-bit values together, you’d set
to 16 and
LGMEM to 6. If you want an output without any
then the output width,
needs to be set to
the input width plus the log of the number of averages. In the example of
averaging fifty-five 16-bit items together, this means we’d need an output
OW=22. Any fewer output bits than that will require
the internal result to the desired number of output bits.
We’ll also allow the number of averages to be configurable as well–or not,
FIXED_NAVG parameter is set. If
FIXED_NAVG is set, then the number
of averages will be fixed, and set by an
FIXED_NAVG is not set, we’ll allow the user to set the number of averages
they want. However, because of the dependence of the feedback relationship,
y[n-1], we’ll insist that the number of averages must not change
except on an
As we’ll see later, the only part of our algorithm that depends upon this number of averages is the initial/reset value of the memory read address. For this reason, we won’t store the value in a register. We’ll come back to this later when we discuss the read address.
That brings us to the logic required for accessing memory–both writing and then reading. Of these two, the write address is simple: we’ll start writing to the first address of memory (address zero) on our first data sample, and then rotate through memory locations from there.
What that means, though, is that we have to implement the time difference
between the initial value and the value
N samples ago using the read
memory address. We’ll do so by initializing the read address to the negative
number of averages.
As with the write address, we increment the read address on every sample clock.
Following the address calculation, we’ll write out incoming sample to memory, and read our delayed sample from memory as well.
So far, this is all straightforward. Other than initial values, we haven’t
really deviated from our initial draft above. However, things get a little
trickier when adding the input sample to this logic. In particular, if we
want our input value to be aligned with the output of the memory read,
x[n-N], then we’ll need to delay the input by one sample.
It’s not really that obvious why this would be so. Why does the input need to be delayed by a sample? The answer has to do with pipeline scheduling. So, let’s look at how the internal values within our algorithm get set on subsequent clocks, as shown in Fig 4.
This figure shows a list of all of our internal registers, binned within the
clocks they are set within–starting on the clock before a particular sample,
i_sample[t], is provided,
t-1 until four clocks later at
t+4. Hence, since
memval depends upon
shows in the clock following the one when
rdaddr gets set. Likewise, since
the memory is set following the write address being set,
While our presentation through the
is going to be in chronological order, from time
t+4, the chart
above was built/scheduled backwards.
o_result is the
acc. It took
one clock to calculate.
acc is the sum of the last accumulator and the
result of the subtraction,
sub. Further, we know that we want to subtract
our new input value from the last memory value,
So why do we have the register
preval in this pipeline chart? It doesn’t
seem to do anything, so why is it there?
The answer is simple: we needed to delay the input by one clock in order to get it to line up with the memory that was just written and then read.
For example, let’s suppose we only wished to average one
element–a pass-through filter. Hence, we’d want to add
our new value,
x[n], and subtract the prior value,
x[n-1], delayed by only
one clock. Given that’s what we want to do, let’s follow that new value through
this pipeline schedule in Fig 5 below.
i_sample[t-1] shows up at time
t-1, and gets written into memory,
mem[wraddr], at time
t. It can then be read at time
memval, and then subtracted
from the new value at
t+2 to create
sub. This time,
t+2, is the earliest
the last value can be read back from memory. This is also the time when we
need our new value,
x[n]–one clock after it shows up. To get the new sample
value from when it is given to us into this clock period, we need to delay it
by a single cycle, placing it into
preval for that purpose.
The logic necessary to do this is trivial–unlike the reasoning behind it above.
The other thing we’re going to need to know, at the same time we want to
know the value we just read from memory, is whether we’ve written enough
times to the memory for the values read out of the memory to be valid. I’ve
chosen to call this
full, to indicate that the tapped-delay line memory has
Since we initialized the write address at zero, and then wrote to the zero address in our first clock, we know that the value read from memory will be valid as soon as we read from the zero address.
We’ve now set all of the values shown in the time
t+1 column from Figs 4 and 5
above, so we’ll move on to the next clock. In this clock, we’ll subtract
the sample falling off the end of our average list from our new sample. We’ll
place this result into
The rest is simple. We add this difference to our accumulated value, creating
While we might’ve skipped this delay if we didn’t need to drop any bits, doing so would cause our filter to have a different delay depending on how it was configured. Rather than deal with that maintenance headache, the result is always delayed (rounded or not) by one clock here.
If you examine this response, it’s really not that great. At best, you can get a stopband of -13 dB from this filter. That’s better than the -6dB stopband from our FM example above, but still a far cry from the -70dB we might like.
On the other hand, if you cascade
of this type together, you can get a much deeper
To illustrate this,
Fig 6 below shows, in blue, the response of a
that averages five adjacent values together. The other lines on the chart show
what happens after cascading
with itself two, three, four, five, or six times.
There are two important things to notice from this figure. First, the
of the filter, when cascaded, can get very deep. After cascading six of
these together, the
-78 dB–pretty impressive. The second thing to notice is that the
is no longer flat. As a result, we’ll need to follow
with another one to clean up any distortion of our signal of interest.
While a moving average filter is a far cry from a well-designed low-pass filter, it’s also a very simple filter to implement. From an engineering trade-off standpoint, this simplicity makes it a very attractive component for dealing with many DSP requirements on an FPGA.
We’re not done with our lessons on digital filtering, though. I’ve still got plans to discuss symmetric filters, half-band filters, Hilbert transforms, and more. Further, I’d like to discuss not only their high speed implementations, but also some slower implementations that would be appropriate for those designs where there are many clocks between input samples.
Even at that, we’re not done, since the filter above, when cascaded with itself any number of times and followed by a downsampler, can be built without needing a block RAM.
We are also going to want to discuss how to go about proving that these filters work. What sort of test bench is appropriate for testing digital filters?
These, however, are lessons still to come.
Lo, this only have I found, that God hath made man upright; but they have sought out many inventions. (Eccl 7:29)