Bit growth in FPGA arithmetic

We just started a discussion of linear-upsampling. In many ways the linear-upsampling algorithm is much like any DSP algorithm: samples come in, multiplies, adds, and subtracts get applied, and samples go out. Like any other algorithm using these operations, every add or multiply will increase the number of bits required to represent the result. So it seems only fitting to look at bit growth, so we can know how many bits to allocate for our internal results.

Since dropping bits, once grown, is even more complicated than just tracking bit growth, we’ll save that discussion for it’s own post later.

If you’ve never dealt with, or heard of, bit growth before then it can be confusing. It doesn’t need to be so. As we’ll show here, there are only two rules you need to keep track of. Follow the two basic rules, and you should be okay. That said, we’ll present three rules today.

My basic approach for understanding bit growth is to look at the extremes of any N bit arithmetic operation, and see how many bits it takes to represent that extreme. Indeed, if you ever get confused, just come back to this approach and examine the extremes of the values you are calculated within your application.

Looking at the extremes in the representation, an N bit number, a, can be any one of a range of signed or unsigned values. In the case of signed, two’s complement values, the range of an N bit number is given by,

-2^(N-1) <= a <= 2^(N-1) -1

Hence, a four bit number can represent values between -8 and +7 inclusive.

Likewise the range of an unsigned value N bits wide is,

0 <= a < 2^(N) -1

and a four bit unsigned number can represent values from 0 to 15.

If you add, subtract, or multiply numbers at the ends of these ranges, by how much must N be increased to capture the new result without overflow?

There are a couple of cases you might not consider along the way, and so I’ll note these below and keep this page updated as a reminder of what the corner cases are.

Add one bit on every add

The rule for addition is that adding two N bit numbers together requires an extra bit to represent the result. Let’s test this to see if this is true:

(2^N-1) + (2^N-1) = 2^(N+1)-2

That certainly works for the positive end of the range, what about the negative end?

-2^N - 2^N = -2^(N+1)

So, at the negative end we are good with adding an additional bit as well.

If you know that the numbers you will be adding will have different signs, then no bit growth is needed, since the absolute value of the result will tend towards zero rather than increasing. I’ve just never found an application that can use this nuance.

Unsigned numbers are roughly the same, with exception that you need to be careful when subtracting two unsigned numbers. The problem here being that the result might be negative. This means that, unless you can guarantee that the result will not overflow (i.e. go below zero), you will need to add a sign bit. Hence, again, one bit of growth–but this comes with an alternate representation (two’s complement) as well.

So the rule of adding one bit per addition (or) subtraction has special case we’ll need to be careful of, where the data representation changes as well, but otherwise it holds.

What if you are adding values with different bit widths? Suppose you wish to add an N bit number to an M bit number? As before, the answer can be found by examining the extremes. If you add the two positive extremes:

2^(N-1)-1+2^(M-1)-1 < 2^(N) - 1

as long as M<N. The same is true for the negative extreme. Here, then, the rule is that the output must have one more bit than the widest of the two input numbers.

Now, let’s apply this to bits within an FPGA. When building DSP algorithms, you’ll want to adjust them for the appropriate width later. That way you can write one routine that works for multiple bit-widths, and may even work across multiple projects. Hence, you’ll want to declare your various bit widths as parameters. We’ll use this approach for our linear interpolator, as well.) For example, if you want to add two number, a and b, together, you might do something like:

module add(i_clk, a, b, out);
parameter	AWIDTH=16, BWIDTH=4;
		// Output width is given by one more than the maximum of A's
		// width and B's width.  Make this a localparam so that
		// it can't be overridden
localparam	OUTWID = (AWIDTH > BWIDTH) ? (AWIDTH + 1) : (BWIDTH+1);

input	wire	[(AWIDTH-1):0]	a;
input	wire	[(BWIDTH-1):0]	b;
output	reg	[(OUTWID-1):0]	out;

always @(posedge i_clk)
	out <= a + b;

This captures all of the logic we just discussed above: adding one bit on any addition or subtraction.

Long strings of addition

Since the fundamental DSP operation is a convolution, and since convolutions require adding lots of numbers together, this creates a special case of bit growth that needs to be discussed.

Imagine you had K words, N bits each, that all needed to be added together. If you applied the rule above, you might allocate K+N bits to the result. While the result would work, it’s also more bits than you need.

As before, we’ll look at the extremes. What happens when you add K extreme values together? We’ll consider only the signed case, since the unsigned case is roughly the same:

K * (2^(N-1)-1)

Before taking the next step, let’s imagine a number of bits, M, that can hold the value K, such that:

K <= 2^M

Further, we want M to be the smallest number possible so that this inequality holds. Mathematically, we might say that,

M = ceil(log(K)/log(2))

or equivalently that M is the ceiling (smallest integer above or equal to) of the base two logarithm of K. In simpler terms, it’s just the log (base two) of K, rounded up to the nearest integer.

Now, using this number, we know that

K <= 2^M

and so that,

K * (2^(N-1)-1) <= (2^M)*(2^(N-1)-1) = 2^(M+N-1)-1

And that’s our defining case. When adding a large number of values together, all having the same bit width, the number of bits required to hold the result is given by the original number of bits plus the log, base two, of the number of elements added together.

We’ll save the Verilog example for when we build an FIR filter on this blog.

Add the number of bits on every multiply

Let’s now look at multiplying two numbers. We’ll start by looking at signed multiplication, but then look at unsigned multiplication before finishing.

Our first step will be to examine two’s complement multiplication. Suppose we multiply a signed N bit number with a signed M bit number. If you multiply the two largest of these numbers together, the result will fit in N+M-1 bits:

(2^(N-1) -1) * (2^(M-1) -1) = 2^(N+M-2) -2^(M-1) - 2^(N-1) +1 < 2^(N+M-2)-1

Don’t forget to check the other extreme! Multiplying the two most negative numbers together yields a positive result, so this is an important check:

(-2^(N-1)) * (-2^(M-1)) = 2^(N+M-2) < 2^(N+M-1)

You may note that 2^(N+M-2) just barely doesn’t fit in N+M two’s complement bits, so this becomes the defining case: N+M bits are required to hold the output, although fitting within N+M-1 bits will be more common.

In Verilog, multiplying two signed numbers together would look like:

module multiply(i_clk, a, b, out);
parameter	AWIDTH=16, BWIDTH=4;
		// Output width is given by the sum of both A's
		// width and B's width
localparam	OUTWID = (AWIDTH + BWIDTH);

input	signed wire	[(AWIDTH-1):0]	a;
input	signed wire	[(BWIDTH-1):0]	b;
output	signed reg	[(OUTWID-1):0]	out;

always @(posedge i_clk)
	out <= a * b;

Notice again that, just like the add, we allow the bit width to be specified (and overruled) by a parameter. This allows us to adjust the bit-width later.

Now let’s consider unsigned multiplication. In this case, we only need to check the positive edge of the interval,

(2^(N) -1) * (2^(M) -1) = 2^(N+M) -2^(M) - 2^(N) +1 < 2^(N+M)-1

As before, N+M bits are required to hold the product of an N bit number with an M bit number. Indeed, the Verilog code for doing this is the same as before, with the only exception that the signed tag needs to be removed to do unsigned multiplication.

Rules to Live By

Let’s summarize our rules here:

When adding two numbers, increase the number of bits in the result by one.

Be careful of subtracting unsigned numbers. Your new bit may change you from unsigned to signed two’s complement
When adding long strings of numbers, each with an identical number of bits, such as you would do when implementing a convolution the bit growth goes as the log of the length of the string, rather than one bit per addition.
When multiplying two numbers, store the output in a register having as many bits within it as the two multiplicands had combined.

We’re not quite ready to finish our lesson on Linear Interpolation, though. We still need to discuss

How to drop bits when processing signals, and
How to debug DSP algorithms.

The approach you will need to debug a DSP based algorithm is very distinctly different from the tools necessary to debug the rest of your logic.

Then we can come back and finish building our linear interpolator.

If you see any mistakes in the rules above, or any rules regarding bit growth that I’ve missed, please feel free to comment and note them below.