One of the basic purposes of FPGAs is to run algorithms, and to run them fast. You know, those serious number crunching applications. While it’s not the only purpose, number crunching is certainly one of the basic ones. Many modern mathematical algorithms, such as my favorite DSP filtering algorithms, all require multiplies. Indeed, I might argue that all of the good signal processing algorithms require multiplies: interpolators, filters, Fourier transforms, Gain control, control systems, and more. All of these FPGA algorithms require multiplies.
Want to build your own CPU? If you build anything more than a basic CPU, you’ll want to implement a multiply.
Here’s the bad news: Multiplies are hard to do in hardware.
The good news? Several FPGAs, such as many of the Xilinx and Intel FPGAs, all contain a limited number of multiply units, often called DSPs for short, internal to their fabric. For these FPGAs, performing a multiply is as simple as
If you have enough of these multiplies on your FPGA, you need not read any further.
If you don’t have enough of these multiplies, then you can often multiplex two or more multiplies together to use only one DSP. If this works for you, life is good. You need not read any further.
However, what if you are building a design for an FPGA that doesn’t have any hardware accelerated multiplies within it? Perhaps you are building for an ICO board or a TinyFPGA BX. If this is the case, you’ll need to know how to build a multiply that not only uses a minimum of logic, but also one that doesn’t seriously slow down your clock frequency.
Let’s examine how we might do this.
If you look up how to perform a binary multiply on wikipedia, you’ll find some very fascinating references to Wallace trees, Kochanski multiplication and Booth multipliers. With a little more digging, you can find Wikipedia’s article on Multiplication, Multiplication algorithms, long multiplication, Lattice multiplication, Peasant multiplication, Karatsuba’s algorithm and even Fourier transform methods of doing multiplication! For our purpose today, let’s start with a simple shift-and-add multiplier, and then we’ll decrease its cost by a factor of two.
If you’ve never heard the term “shift-and-add multiplier”, then relax. It’s
the same basic long division algorithm that you learned in grade school,
only this time we are going to accomplish it using binary
numbers. The algorithm itself
is straight forward. Imagine you had two six-bit numbers,
you wanted to multiply together. You’d start by walking through all of the
digits (i.e. bits) in
b, from the least to the greatest. If the digit is a
1, you’ll copy
a to a table, only shifting it by the position within b.
Perhaps this sounds harder than it is. Here’s a figure showing what I’m talking about.
In this figure, the six
p0* digits represent the multiplication of
p0* will be equal to
a otherwise zero.
p1* digits represent multiplication of
p1* will be equal to
a. Notice from the figure, though, that the
p1* row is shifted left from the
p2* is calculated the
same way, only it gets shifted over one more column and so forth.
Again, this should look very much like the long-multiplication algorithm you are already familiar with, with the only difference being that we’re now looking at binary digits instead of decimal ones.
Once all the partial product rows, that is the
pnm rows, where
n is the
row number and
m is the position within the row, have been generated they
are all then added together to yield a result.
This is a basic “shift-and-add” multiplication algorithm. We’ve taken
and shifted it to the left one bit (digit) at a time, and added it to our
result accumulator any time the respective bit in
b was a
If this still sounds confusing, let me try explaining this idea one more time. In C++ code, this multiply might look like:
Signed multiplication will take a bit more work, so let’s just focus on unsigned multiplication for now.
This form of multiplication makes sense. It’s nearly the same as the multiplication we all learned in grade school, modified only for binary arithmetic. Indeed, it’s easy to implement in Verilog, where the core of the basic algorithm will look something like:
Sadly, this is a really inefficient implementation. If you count the LUT4s used on an iCE40, you’ll get 307 LUT4s required to implement a 32x32 bit multiply.
Part of the reason why this multiplication implementation is so expensive
can be seen by evaluating the logic for each bit in
- The bit must be reset on
- The next step depends upon
p_b[count]. Calculating this value requires a multiplexer. For 32-bits, thats a 32-bit mux–costing many LUT4s to accomplish.
- If the result of this multiplexer is one, we’ll then add to this bit the
result of another multiplexer applied to
p_bplus a carry bit.
That’s a lot of 32-bit multiplexers, especially since this logic needs to be repeated all 64-bits in the accumulator.
How much does a 32-bit multiplexer cost? Let’s find out. Let’s take this little snippet of code,
and run Yosys on it. With just the three commands,
synth_ice40, and then
stat, yosys reveals that this design costs us
Now consider that we are applying this 32-mux to not only
but also to every single bit of
p_a by evaluating
p_a << count. A quick run of Yosys
reveals the shift alone will cost us 209 LUT4s in total. That’s expensive.
Suppose we removed these LUT4s. For example, we could shift
on each round so that
p_a was always the upshifted verion of
p_b always contained the
count bit from our input
This gets us from
307 LUT4s down to
141 LUT4s. This is a nice 54%
We can do even better.
A Better Multiplication
What keeps us from simplifying this initial shift-add-multiply algorithm is the 64-bit addition (assuming we are multiplying two 32x32 bit numbers). This addition is something of a waste, since if you look at Fig 1., you’ll notice we are never adding more than 32-bits at a time. Every line in our long multiplication accumulation table includes some fixed number of least-significant bits that aren’t changing, as well as some number of more significant bits that remain at zero. Sure, the 32-bits we are adding slowly move across our accumulator, but it’s never using any more than 32-bits of that 64-bit accumulator.
Why should we be paying for all the logic it takes to add nothing?
What if we instead shifted our accumulator register to the right, so that
we were always adding to the same physical 32-bits? Then, after
every add, we’d shift one accumulator bit off to the right. So, instead of
p_b left on every step as we did above, we’ll shift the
accumulator to the right on every step.
Let’s work though what this might look like.
Let’s walk through this slowly.
First, we are shifting
p_bto the right on every clock tick. This gives us access to the next bit of
p_bat every clock tick. Unlike our first approach, we don’t need a multiplexer to find the correct bit within
p_b–it’s always going to be bit zero.
This is essentially what we did in our “better” algorithm above.
Next, we shift our partial sum to the right. By itself, this is essentially a no-cost operation. Sure, it costs us some flip-flops, 64 to be exact–but we had to use those anyway. What’s different is that once these bits are shifted, there’s no more logic used to generate them: no 32-bit multiplexers, no adds, nothing–just a basic shift.
We then add
p_ato the upper
32bits, the active bits, of our result accumulator.
That’s it. All we’ve done is move the bits we are accumulating to the right as we do our summation. Now, consider this from the perspective of the bits in the result:
For the lower 32-bits, we only need a single LUT4: on
!o_busywe set them to zero, otherwise they are set based upon the bit to the left of them.
For the upper 32-bits, we perform an addition. In every case, the addition is from the bit to the left plus a corresponding bit from
p_a, plus a carry. Let’s count those inputs: 1) the carry, 2) the previous bit, 3) a bit from
p_a, and 4)
p_bwhich determines whether we’ll add or not.
At this alone, it looks like it fits within a single LUT4.
It doesn’t. Don’t forget that we need to set our value to zero if
!o_busy, and we also need to calculate a carry bit, but it at least comes close.
How much does this logic require per bit to calculate? Let’s find out. In this case, we can use another very simple Verilog file and run Yosys again using essentially the same two commands.
In this case, we use only 4 LUT4s per-bit.
Has our algorithm really changed? Not really. In many ways this is the same identical algorithm we had before. We’re still doing a shift-and-add multiply. The big difference now is that we are shifting our result register, rather than our input multiplicand.
Even better, since we are no longer using the entire 64-bit accumulator in this modified 32x32-bit multiply, our carry chain just got shorter by a factor of two. Perhaps this multiplier is faster?
Let’s take a look at the cost of this algorithm. Remember how the algorithm
141 LUTs before? We’re now are down to
When I built my first soft-multiply in Verilog, I didn’t know how to handle
twos complement numbers. My approach was instead to take the absolute
magnitude of both inputs, record the incoming signs, multiply the two
unsigned numbers, and then negate the result if necessary. This is painful.
A basic NxN shift-add multiply requires
N clocks, whereas this signed
Then I found this wikipedia page. According to the page, I can adjust my original algorithm with some simple adjustments and get twos complement multiplication. Here’s what the adjustments look like:
Wow, that’s easy, I can do that! Even better, since it (almost) follows from the same long-multiply structure as before, I don’t need to change my algorithm (much).
Ok, it’s not quite as straightforward as it looks. I had a bit of a
misunderstanding implementing it at first. Specifically, if
zero, the corresponding row is no longer zero. Even if you include
1 in the first row, if
p05 is zero and
This was my first “discovery” while attempting to implement this algorithm. My next discovery really dampened my thoughts on this.
See, what the wikipedia page
doesn’t tell you is that the algorithm only works on an NxN multiplier.
It doesn’t work on an
NxM multiplier where
N != M. When I first
discovered this, I felt so burned I actually edited the
Wikipedia page to note
Someone removed my edits later. Who knows why. Maybe it’s been edited again since, I haven’t looked in a while.
I’ve since tried to rederive the result shown in Fig 2. After all my work,
all I can tell you now is that I know it works. I had been hoping to
rederive it in order to know how to modify it so that it applies to a
I have yet to be successful, and not for a lack of trying.
Sorry, this part of the story doesn’t (yet) have a happy ending (yet–I’m still hoping). I may need to come back and discuss this again later, but for now I’m simply accepting this two’s complement multiplication approach as something that “just works”. I’ve also had to file it under the “I’m not really sure why this works” file.
Feel free to verify it with me, and double-check my own work on this.
For now, let’s just add an
OPT_SIGNED parameter to our multiply,
and adjust our algorithm for both signed and unsigned.
A second block then adds the two
1s to our partial product in order to
create the result.
This is now the algorithm that we’ll implement below. At
243 LUT4s it’s not
nearly as good as the
112 LUT4 option from the last section, so I may yet
need to come back and “optimize” this
implementation again. Indeed, it’s worse than that since it feels like I just
slapped a twos-complement band-aid onto an awesome algorithm.
Yes, I will need to come back and optimize this signed option. For now, the algorithm has an awesome LUT usage count for unsigned multiplication.
Now that you know the algorithm, it’s time to start taking a look at the source code we’ll be using to implement in.
In many ways, the ports are much like what you’d expect, there’s a clock and
a reset line, an
i_stb line to request that a multiply be accomplished,
o_busy to say the process is on going. I’ve also added
a signal which will be high on the first clock the output is available.
The basic multiply operands themselves involve inputs
i_aux is just an auxiliary bit that will be
kept with the data, and returned (unchanged) with the product. This will
make a traveling valid pipeline
possible if desired.
Ideally, I’d like this
to work on
NA bits from
NB bits from
i_b, but like I’ve
said before, I have yet to figure out how to extend this generic approach
to dissimilar bit widths.
As you may remember, I dislike reasoning about multiple bits at a time within
the cascaded if structure of a control loop.
almost_done helps me avoid this,
by capturing whether we need to set the
o_done bit on the next cycle.
almost_done is true, we’ll be on the last cycle of our logic.
The state machine control variables themselves are
the logic is shown below.
Fig. 3 below shows how these signals play out in practice.
Here you can see the multiply being requested by
i_stb. Once the multiply
unit receives the
i_stb request, it then sets
o_busy high and starts
counting down it’s cycle from
NA-1. Once the
true, causing the
almost_done signal on the next cycle and
o_done on the
final cycle. This clock cycle, with
o_done true and
o_busy false, is the
first clock cycle when the next request can be made of the core.
The handshaking logic above is separated from the main multiply calculation, below, simply because these are the only registers in this algorithm that require a reset. Why generate reset logic if you don’t need it?
The next item of interest is the current partial product. This is the product
of one of the digits of
i_b with all of
i_a, and should be familiar from
long division. We could even write this as
pwire = p_a * p_b. I want
to separate this from the logic below, however, because the algorithm above
requires toggling particular bits in this word. Separating this into two
steps seems to make more sense to me.
With all that behind us, it’s now time to code up the multiply proper.
The first step in the multiply is to copy the multiplicands from the input.
This not only gives us scratch values to work with, but it also allows the
module feeding us to change these values while we are in the middle of our
calculation. All any external logic needs to know, therefore, is that when
i_stb && !o_busy a request is accepted. New data may then be placed on
the input ports.
When I first started coding handshakes like this, I’d use the condition of
i_stb && !o_busy. In many of my designs, such as this one, I’ve dropped the
!i_stb && !o_busy, the values above are don’t care
values. They could be anything. By dropping the check for
i_stb, the total
logic for each of these elements simplifies.
However, there’s a power cost every time a flip-flop changes. By skipping
the check for
i_stb, we’ve not only lowered our logic but also increased
our power. A quick check for
i_stb, but only if we are implementing a low
power design, and we can keep these extra flip-flops from toggling.
Now comes the real work, as we cycle through each step of this algorithm.
Now that we are busy, our first task will be to shift
p_b, our copy of the
second multiplicand, to the right. That way we can multiply by the next bit
on the next cycle.
We’re keeping our copy of the partial product in
partial. Much like the
pseudo code above, we first shift all of the bits right by one.
The next action depends upon whether this is a signed multiply or not.
If this is a signed multiply, and if we are on the last row, then we need to treat it special, as shown in Fig. 2 above.
Otherwise, if this is just a normal row for the signed multiply, we add
p_b*p_a to our accumulator while negating the high order bit. Again,
this follows from Fig. 2 above.
Finally, we have our example partial product for the case where our operands were unsigned. This is the lowest logic path through the code.
At each step through this algorithm, we’ll also drop our state machine counter,
count, by one. This is the same count used in our state machine
logic to know when we are done. This count starts at
NA-1 and counts down
to zero, essentially counting each of the
NA bits of our operands–once
for each bit in the multiply.
Right after the clock cycle where
count == 0, we’ll want to copy our data
to the output. We’ll also add all of those extra
1 bits here in the case
of the signed product.
o_aux is just a copy of what
i_aux was initially. Here we just complete
While I suppose these output registers could’ve been controlled in the logic block before this one, placing them in their own block helps convince me that their outputs will only change when the product is complete.
If you’ve read this blog much, you’ll know that I love formal verification, and for the last year I’ve tried to formally verify every core I’ve presented here.
Not this time.
Because formal verification struggles to handle multiplies. That’s just one of the (current) weaknesses of formal methods.
What does that mean? “Formal Verification struggles to handle multiplies”? It doesn’t mean that you cannot formally describe a multiplication algorithm. It doesn’t mean that you cannot create appropriate formal properties.
Perhaps I can explain what I mean best by an example. I once built a 12x12 multiplication algorithm, and a set of formal properties to capture it. I started the formal proof on a Monday morning. On Tuesday, Clifford (who now calls himself Claire) warned me that the proof might not finish successfully. On Thursday, I got frustrated and built an exhaustive Verilator test. When the exhaustive Verilator test finished in less than 15-minutes, I then killed the formal proof that I had started three days earlier.
This is what I mean by “Formal Verification struggles to handle multiplies.” The solvers just don’t know how to handle them (yet).
Therefore, we’ll build a quick Verilator script to verify that our multiply works and properly produces the right answer.
I’d much rather self-discover these parameters from the code itself, but at the time I wrote this I hadn’t yet found a way to do this that I liked. Often, when I create something like this from a core generator, I’ll have the core generator will create a header file defining the choices for the parameters. More recently, I’ve started setting output values from the core based upon the parameters within the design. This test bench does neither, meaning that I’ll need to change both Verilog and C++ code any time I want to change the parameters within the core.
Two C++ snippets of
have served me well for some time when working with
They convert from a limited bit representation to an integer representation
more appropriate for working with word-sized integers in software. The first
sbits, the next
ubits, for work with signed and unsigned numbers
sbits takes a value of
bits width, and sign extends it to the full
width of the local architecture.
ubits is a similar function, but instead of sign extending it just drops
the top several bits.
I also like to capture the test device in a class of its own when working with
In this case, we’ll call it
SLOWMPYTB. The actual Verilated
core is given by
m_slow within this class. This also helps me handle
the boiler plate associated with running
and capturing traces
from within it.
Just to make sure, we’ll double check that the number of bits in each incoming
value, summed together, will fit in the result. That is, for
incoming bits, we’ll have
2*NA output bits–make sure these will fit within
whatever our word size is.
We’ve discussed the basics of my
Basically, it just advances the clock, but does so in a way that our
interaction with the
core as well
as the VCD file
generated by it will remain consistent.
Don’t forget to flush the
before returning from our
There’s been more than one time when I’ve caught bugs within my design using
assert() statements in my test script, but where I then struggled to
figure out what was going on simply because the key information never got
written into the trace
flush() function above helps to prevent that from happening.
Our reset routine is fairly basic as well. I’ve used
rand() numbers to
try to compensate for the fact that I’m not using formal methods, but as you’ll
see going forward it’s just a token effort that doesn’t really help that much.
The actual key to this C++ test bench
test(ia, ib) function. I use this function to send values
ib to the core
to be multiplied. That way the test script can just call
and over to test various multiplication products and be sure we did them right.
The first step of any test is to set the input values and the
Of course, this entry point only works if the core is not busy. While the test bench should ensure this, I still throw an assertion in here–just because I don’t necessarily trust anything that’s “under test”. Well, that and bugs can be really difficult to find when you aren’t using formal. The assertion helps walk the bug back in time to where I can find it.
We’ve now requested the multiply, so let’s wait for it to be complete. That
NA+1 clock cycles. During this time,
i_stb must be kept low,
although we can randomize
i_b to try to verify that the design
will ignore such changes mid-multiply. Finally, we also check that the output
signals are exactly what they need to be here.
Once done, busy should be low and the done bit should be high.
We can now print out the two values used to create this product, together with their results to the terminal. This kind of output is really great early on, although it gets kind of verbose the closer you get to a working routine.
The next step is to verify that we received the right answer. For that purpose, we’ll sign extend the result.
Did we get the right answer? For this, we’ll use a second method of multiplying two numbers together and compare the results. In this case, we can (now) use the C multiplication operator. We’ll first sign extend our values, then calculate what the result should’ve been.
The test will succeed if the design’s output,
sout, matches the predicted
By stopping and exiting on any failure as soon as a failure is encountered, we can help limit how far we have to search the trace for a bug–should a bug be encountered.
We can then close our test case by returning if we have been successfull.
That’s really all the heavy lifting. The rest of our test bench is the easy stuff. We’ll start out by constructing an test bench object and (optionally) starting a VCD file to trace everything.
The next step is to work our way through some basic test cases. I chose to
start with some very basic tests, before trying to get complicated. Why?
Because basic tests, like
0 * 0 or
1 * 0, are a whole lot easier to debug
than more complex tests like
425,682 * 934,255,346.
I now repeat the same basic tests, but this time for signed numbers. Again, I’m starting the test bench tests out nice and easy–just to make debugging simpler if there’s a bug here.
Now let’s try some corner cases, where the inputs are near their maximum values.
That’s the corners, so let’s now try multiplying
2^n times one. This should
test every bit in
a, to make sure it is properly multiplied.
We can repeat this with
b, this time multiplying by
Again, these checks are a bit ad-hoc, but they are designed to capture all of the corner cases where 1) I might expect a bug, and 2) where bugs are easy to catch.
Having gotten all the corner cases I expect, I then turn my attention to some random tests.
Now, let me ask, was that good enough? I don’t know. This isn’t a formal proof, so it’s hard to tell. However, if our multiply is small enough, we might be able to check all possible inputs.
On the other hand, if you are trying to multiply two 32-bit numbers together, you might never see the end of that loop.
Still, if we’ve gotten this far, the algorithm works. All that’s left is to close up the test case and declare success.
This article presents the best slow-multiplication algorithm I have yet come across. It’s not fast–nor is it intended to be. However, it can be used cheaply in an FPGA that either doesn’t have any DSPs left, or one that is currently using all of its DSP infrastructure. Indeed, this is the multiply that the ZipCPU uses in these cases.
Without this algorithm, the ZipCPU would’ve never passed muster on an iCE40 HX8K FPGA.
I should also point out that the key to this algorithm, shifting the accumulator so that only the relevant 32-bits of the 64 are added on each cycle, can be used in a divide algorithm as well. So, I tried it. It worked, and it helped, it just … didn’t help as much as it did the multiply. It’s been a while since I did it, but I recall the improvement was 20% or less. It was good enough to keep, but not nearly enough to write home about.
And God blessed them, saying, Be fruitful, and multiply, and fill the waters in the seas, and let fowl multiply in the earth. (Gen 1:22)