Every time I’ve built a signal processing system, I’ve struggled with bit-width. Bit width grows when you apply a filter. It grows when you add two streams together. It grows when you multiply your signal by an sine wave or a complex exponential. Indeed, bit growth alone can be a challenge to keep track of.

If you are not careful, you will make the problem worse like I first did when you try to deal with it.

You see, on my first FPGA Digital Signal Processing (DSP) project, I just got rid of the bits I didn’t need by simply dropping them. I’d take the result of whatever arithmetic I had accomplished, and I’d only keep the top N bits. Problem solved, right?

Not quite. While that approach kept the number of bits I was using in bounds, it also created artifacts at DC that confused both myself and the team of individuals I was working with. Did the system really have an uncontrolled DC bias? Where did this bias come from? Did we need better analog filtering? Better bias control up front?

The answer was none of the above. The problem was that I had not properly handled dropping bits, although it took building my own Fast Fourier Transform (FFT) to realize I had a problem.

So let’s examine how to round numbers within an FPGA. We’ll examine several forms of rounding, and look at how each of these forms biases the result one way or another from true. Along the way, we’ll use consider diagrams, similar Fig 1 below, to explain what’s happening.

Fig 1: The need for Rounding

This figure shows a series of numbers, and a series of boxes. The boxes represent the incoming values, and the numbers above represent the possible output values. They are placed above the box where the operation would be exact.

If you look closer, you’ll notice that there are eight boxes per output integer. The boxes that line up exactly are under the arrows, and they are also shown in color, to representing the integer they are assigned to. This leaves seven boxes between exact results that need to be determined.

This is the issue of rounding. When given a value (represented by one of these boxes) between two integers, how shall it be mapped to one of the integers above? We’ll represent individual chosen mappings via color below, so you can see and get a feel the concept.

With this background, we can return to our question: how to go from too many bits, to just the right number?

Go for easy: truncation

The first way of dropping bits may be the easiest … just get rid of any excess bits. This was what I had done originally. Consider, if you have a data values i_data coming in with IWID (input width) bits, and you want to create a data value with OWID bits, why not just grab the top OWID bits?

assign o_truncate = i_data[(IWID-1):(IWID-OWID)];

If you just want something quick and simple, this approach works.

Well, sort of.

Maybe not really.

It’s what I used before I started working on an FFT algorithm. When I built my testbench for the FFT, one of my inputs would be a sine wave. I expected as an output a single non-zero value at or near the frequency of the sine wave. While I got the value I was looking for at the frequency of the sine wave, I also kept getting values at DC that didn’t make sense to me.

When I dug into this, I discovered that dropping bits in this fashion biases the result by about a half of a bit in the negative direction. If you consider Fig 1 below, you can see this effect since it shows how the area covered by each number is biased to the right of the number.

Fig 2: Truncation effects

So it was working on this FFT algorithm that sent me looking for alternatives. The best explanation I found of the various alternatives was on wikipedia’s rounding page. Here, we’ll just explain the types of rounding discussed there, and show some Verilog examples of how to do each.

Basic rounding: Round half up

My next approach to dropping bits was to add a half. This was the type of rounding I learned to do as a child in grade school. Wikipedia calls this “round half up”.

The Verilog code for doing this is nice, in that it only needs OWID+1 bits to work on.

assign	w_halfup = i_data[(IWID-1):0]
		+ { {(OWID){1'b0}}, 1'b1, {(IWID-OWID-1){1'b0}} };
always @(posedge i_clk)
	o_halfup <= w_halfup[(IWID-1):(IWID-OWID)];

Many of the other approaches we’ll discuss will require adding numbers to all IWID bits, so this approach turns out to be simpler than the other rounding approaches below.

While this approach worked a lot better, it still leaves a bias within the data values.

Fig 3: Round Half-Up

To understand what the problem is, consider the bins around our output numbers, as shown in Fig 4 below:

Fig 4: Illustrating the problem with rounding

We might argue that every bin should be attached to the output value “nearest” to it.

That leaves the bins or boxes in the exact center, exactly located between any two integer values.

If you connect all those between bins to the integers on the right, you’ll create the bias we saw in Fig 3. Sure, it’s less than the bias in Fig 2, bit its still a bias.

On the other hand, if you connect all these in-between bins to the integer values on the left (round half down), you’ll create a similar bias–only in the other direction.

Several types of rounding will fix this.

Round towards zero

One unbiased method of rounding is to round towards zero, shown in Fig 5 below.

Fig 5: Rounding towards zero

In this case, we select the integer nearest zero for our midpoint.

Doing this in Verilog is simple, too:

assign	w_tozero = i_data[(IWID-1):0] + { {(OWID){1'b0}}, i_data[(IWID-1)],
			{(IWID-OWID-1){!i_data[(IWID-1)]}}};
always @(posedge i_clk)
	o_tozero <= w_tozero[(IWID-1):(IWID-OWID)];

An almost identical method is to round away from zero. In this case, the mid point values are assigned to the integer nearest them that is farthest from zero.

assign	w_fromzero = i_data[(IWID-1):0] + { {(OWID){1'b0}}, !i_data[(IWID-1)],
			{(IWID-OWID-1){i_data[(IWID-1)]}}};
always @(posedge i_clk)
	o_fromzero <= w_fromzero[(IWID-1):(IWID-OWID)];

The problem with these two approaches is that they will subtly change the amplitude of your signal.

Is there a better approach?

Convergent rounding: Round half to even

Thankfully, I didn’t need to invent a better approach. Several solutions to the rounding problem were already listed for me in a Wikipedia article.

The better approach that I have found is called convergent rounding, although Wikipedia gives it many other names. Perhaps “Round half to even” is the most descriptive. Convergent rounding starts out in an identical fashion to the more traditional rounding (round half up), with the exception of when you are at the exact half way point. In that case, convergent rounding rounds towards the nearest even integer, as shown in Fig 6 below.

Fig 6: Convergent Rounding

If you count the number of input bins assigned to each output integer, you’ll notice that 0 gets 9 bins, 1 gets 7, 2 gets 9, etc.

Likewise, the Verilog code to perform this operation is no more difficult than rounding towards (or away from) zero.

assign	w_convergent = i_data[(IWID-1):0]
			+ { {(OWID){1'b0}},
				i_data[(IWID-OWID)],
				{(IWID-OWID-1){!i_data[(IWID-OWID)]}}};
always @(posedge i_clk)
	o_convergent <= w_convergent[(IWID-1):(IWID-OWID)];

You may notice that these last several forms of rounding required a single clock delay to calculate the answer. That extra clock is required to perform a full word’s worth of addition. This is the addition required to get the carry to propagate through the entire input number.

If you can’t afford the extra clock, consider just dropping the extra bit, as in our first example.

By the way, I’m not the only one who likes this approach. According to the Wikipedia article, it is the “default rounding mode used in IEEE 754 computing functions”. Recognize that standard? IEEE 754 is the standard that defines how floating point is done within your computer.

The DblClock FFT

In my motivation above, I mentiond the FFT I had built earlier. This FFT is built a bit differently from a lot of the other Verilog cores I’ve worked with and those I’ve come across. In particularly, the dblclockfft project consists not in a Verilog core, but rather the C++ code necessary to build a Verilog core. This gives it the flexibility to build any FFT size or bit-width you want.

It’s also a bit of a niche product, since it takes two input samples per clock.

In other words, dblclockfft can be configured for arbitrary FFT sizes (subject to the amount of logic on your device), incoming and outgoing bit widths, as well as the number of hardware multiplies used internally. It also allows you (internally) to configure the type of rounding that will be used within the resulting algorithm.

It was while testing this FFT core that I noticed sinewaves going into the core were producing results on the DC bin coming out. The problem, when fully traced out, came down to rounding. Using the C++ core generator program, I then tested and compared multiple rounding approaches against each other, and decided upon the convergent rounding approach discussed above. (That’s why the type of rounding can be configured internally.)

As a side note, with just a touch of care and feeding the dblclockfft could be adjusted to meet anyone’s FFT needs. Want a single sample in per clock? It’s almost there. Want a real (vice complex) FFT? It only needs one additional module. How about running an FFT? in a non-pipelined mode, with only two hardware multiplies required in total, and the size cost turned into a block RAM cost? Most of the pieces are there for that as well. In other words, with a touch of support, that FFT project could be made very generic.

Next steps

This particular post is actually only one in a series of posts building up to a working linear upsampling interpolator. We’ve already discussed the basic upsampling linear interpolator, and bit growth within it. Now that we understand rounding, our next step in building this interpolator will be to discuss how to debug a DSP algorithm.

Stick around, debugging the linear upsampler makes a great example of how to debug a DSP algorithm.