Building a Better Verilog Multiply for the ZipCPU
One of the basic purposes of FPGAs is to run algorithms, and to run them fast. You know, those serious number crunching applications. While it’s not the only purpose, number crunching is certainly one of the basic ones. Many modern mathematical algorithms, such as my favorite DSP filtering algorithms, all require multiplies. Indeed, I might argue that all of the good signal processing algorithms require multiplies: interpolators, filters, Fourier transforms, Gain control, control systems, and more. All of these FPGA algorithms require multiplies.
Want to build your own CPU? If you build anything more than a basic CPU, you’ll want to implement a multiply.
Here’s the bad news: Multiplies are hard to do in hardware.
The good news? Several FPGAs, such as many of the Xilinx and Intel FPGAs, all contain a limited number of multiply units, often called DSPs for short, internal to their fabric. For these FPGAs, performing a multiply is as simple as
If you have enough of these multiplies on your FPGA, you need not read any further.
If you don’t have enough of these multiplies, then you can often multiplex two or more multiplies together to use only one DSP. If this works for you, life is good. You need not read any further.
However, what if you are building a design for an FPGA that doesn’t have any hardware accelerated multiplies within it? Perhaps you are building for an ICO board or a TinyFPGA BX. If this is the case, you’ll need to know how to build a multiply that not only uses a minimum of logic, but also one that doesn’t seriously slow down your clock frequency.
Let’s examine how we might do this.
Long Multiplication
If you look up how to perform a binary multiply on wikipedia, you’ll find some very fascinating references to Wallace trees, Kochanski multiplication and Booth multipliers. With a little more digging, you can find Wikipedia’s article on Multiplication, Multiplication algorithms, long multiplication, Lattice multiplication, Peasant multiplication, Karatsuba’s algorithm and even Fourier transform methods of doing multiplication! For our purpose today, let’s start with a simple shift-and-add multiplier, and then we’ll decrease its cost by a factor of two.
If you’ve never heard the term “shift-and-add multiplier”, then relax. It’s
the same basic long division algorithm that you learned in grade school,
only this time we are going to accomplish it using binary
numbers. The algorithm itself
is straight forward. Imagine you had two six-bit numbers, a
and b
, that
you wanted to multiply together. You’d start by walking through all of the
digits (i.e. bits) in b
, from the least to the greatest. If the digit is a
1
, you’ll copy a
to a table, only shifting it by the position within b.
Perhaps this sounds harder than it is. Here’s a figure showing what I’m talking about.
</A> |
In this figure, the six p0*
digits represent the multiplication of
a
by b0
. If b0
is 1
, p0*
will be equal to a
otherwise zero.
The six p1*
digits represent multiplication of a
by b1
. If b1
is
1
, p1*
will be equal to a
. Notice from the figure, though, that the
p1*
row is shifted left from the p0*
row. p2*
is calculated the
same way, only it gets shifted over one more column and so forth.
Again, this should look very much like the long-multiplication algorithm you are already familiar with, with the only difference being that we’re now looking at binary digits instead of decimal ones.
Once all the partial product rows, that is the pnm
rows, where n
is the
row number and m
is the position within the row, have been generated they
are all then added together to yield a result.
This is a basic “shift-and-add” multiplication algorithm. We’ve taken a
,
and shifted it to the left one bit (digit) at a time, and added it to our
result accumulator any time the respective bit in b
was a 1
.
If this still sounds confusing, let me try explaining this idea one more time. In C++ code, this multiply might look like:
Signed multiplication will take a bit more work, so let’s just focus on unsigned multiplication for now.
This form of multiplication makes sense. It’s nearly the same as the multiplication we all learned in grade school, modified only for binary arithmetic. Indeed, it’s easy to implement in Verilog, where the core of the basic algorithm will look something like:
Sadly, this is a really inefficient implementation. If you count the LUT4s used on an iCE40, you’ll get 307 LUT4s required to implement a 32x32 bit multiply.
Part of the reason why this multiplication implementation is so expensive
can be seen by evaluating the logic for each bit in result
.
- The bit must be reset on
!o_busy
. - The next step depends upon
p_b[count]
. Calculating this value requires a multiplexer. For 32-bits, thats a 32-bit mux–costing many LUT4s to accomplish. - If the result of this multiplexer is one, we’ll then add to this bit the
result of another multiplexer applied to
p_b
plus a carry bit.
That’s a lot of 32-bit multiplexers, especially since this logic needs to be repeated all 64-bits in the accumulator.
How much does a 32-bit multiplexer cost? Let’s find out. Let’s take this little snippet of code,
and run Yosys on it. With just the three commands, read_verilog mux32.v
,
synth_ice40
, and then stat
, yosys reveals that this design costs us
25 LUT4s.
Now consider that we are applying this 32-mux to not only p_b[count]
,
but also to every single bit of p_a
by evaluating
p_a << count
. A quick run of Yosys
reveals the shift alone will cost us 209 LUT4s in total. That’s expensive.
Suppose we removed these LUT4s. For example, we could shift p_a
and p_b
on each round so that p_a
was always the upshifted verion of a
and
p_b[0]
always contained the count
bit from our input b
.
This gets us from 307
LUT4s down to 141
LUT4s. This is a nice 54%
improvement.
We can do even better.
A Better Multiplication
What keeps us from simplifying this initial shift-add-multiply algorithm is the 64-bit addition (assuming we are multiplying two 32x32 bit numbers). This addition is something of a waste, since if you look at Fig 1., you’ll notice we are never adding more than 32-bits at a time. Every line in our long multiplication accumulation table includes some fixed number of least-significant bits that aren’t changing, as well as some number of more significant bits that remain at zero. Sure, the 32-bits we are adding slowly move across our accumulator, but it’s never using any more than 32-bits of that 64-bit accumulator.
Why should we be paying for all the logic it takes to add nothing?
What if we instead shifted our accumulator register to the right, so that
we were always adding to the same physical 32-bits? Then, after
every add, we’d shift one accumulator bit off to the right. So, instead of
shifting p_b
left on every step as we did above, we’ll shift the result
accumulator to the right on every step.
Let’s work though what this might look like.
Let’s walk through this slowly.
-
First, we are shifting
p_b
to the right on every clock tick. This gives us access to the next bit ofp_b
at every clock tick. Unlike our first approach, we don’t need a multiplexer to find the correct bit withinp_b
–it’s always going to be bit zero.This is essentially what we did in our “better” algorithm above.
-
Next, we shift our partial sum to the right. By itself, this is essentially a no-cost operation. Sure, it costs us some flip-flops, 64 to be exact–but we had to use those anyway. What’s different is that once these bits are shifted, there’s no more logic used to generate them: no 32-bit multiplexers, no adds, nothing–just a basic shift.
-
We then add
p_a
to the upper32
bits, the active bits, of our result accumulator.
That’s it. All we’ve done is move the bits we are accumulating to the right as we do our summation. Now, consider this from the perspective of the bits in the result:
-
For the lower 32-bits, we only need a single LUT4: on
!o_busy
we set them to zero, otherwise they are set based upon the bit to the left of them.That’s simple.
-
For the upper 32-bits, we perform an addition. In every case, the addition is from the bit to the left plus a corresponding bit from
p_a
, plus a carry. Let’s count those inputs: 1) the carry, 2) the previous bit, 3) a bit fromp_a
, and 4)p_b[0]
which determines whether we’ll add or not.At this alone, it looks like it fits within a single LUT4.
It doesn’t. Don’t forget that we need to set our value to zero if
!o_busy
, and we also need to calculate a carry bit, but it at least comes close.
How much does this logic require per bit to calculate? Let’s find out. In this case, we can use another very simple Verilog file and run Yosys again using essentially the same two commands.
In this case, we use only 4 LUT4s per-bit.
Has our algorithm really changed? Not really. In many ways this is the same identical algorithm we had before. We’re still doing a shift-and-add multiply. The big difference now is that we are shifting our result register, rather than our input multiplicand.
Even better, since we are no longer using the entire 64-bit accumulator in this modified 32x32-bit multiply, our carry chain just got shorter by a factor of two. Perhaps this multiplier is faster?
Let’s take a look at the cost of this algorithm. Remember how the algorithm
cost 141
LUTs before? We’re now are down to 112
LUT4s.
Not bad.
Twos Complement
When I built my first soft-multiply in Verilog, I didn’t know how to handle
twos complement numbers. My approach was instead to take the absolute
magnitude of both inputs, record the incoming signs, multiply the two
unsigned numbers, and then negate the result if necessary. This is painful.
A basic NxN shift-add multiply requires N
clocks, whereas this signed
multiply cost N+2
clocks.
Then I found this wikipedia page. According to the page, I can adjust my original algorithm with some simple adjustments and get twos complement multiplication. Here’s what the adjustments look like:
</A> |
Wow, that’s easy, I can do that! Even better, since it (almost) follows from the same long-multiply structure as before, I don’t need to change my algorithm (much).
Ok, it’s not quite as straightforward as it looks. I had a bit of a
misunderstanding implementing it at first. Specifically, if p_b[0]
is
zero, the corresponding row is no longer zero. Even if you include
the 1
in the first row, if p_b[0]==0
then p05
is zero and !p05
is ONE.
This was my first “discovery” while attempting to implement this algorithm. My next discovery really dampened my thoughts on this.
See, what the wikipedia page
doesn’t tell you is that the algorithm only works on an NxN multiplier.
It doesn’t work on an NxM
multiplier where N != M
. When I first
discovered this, I felt so burned I actually edited the
Wikipedia page to note
this fact.
Someone removed my edits later. Who knows why. Maybe it’s been edited again since, I haven’t looked in a while.
I’ve since tried to rederive the result shown in Fig 2. After all my work,
all I can tell you now is that I know it works. I had been hoping to
rederive it in order to know how to modify it so that it applies to a
generic NxM
multiply.
I have yet to be successful, and not for a lack of trying.
Sorry, this part of the story doesn’t (yet) have a happy ending (yet–I’m still hoping). I may need to come back and discuss this again later, but for now I’m simply accepting this two’s complement multiplication approach as something that “just works”. I’ve also had to file it under the “I’m not really sure why this works” file.
Feel free to verify it with me, and double-check my own work on this.
For now, let’s just add an OPT_SIGNED
parameter to our multiply,
and adjust our algorithm for both signed and unsigned.
A second block then adds the two 1
s to our partial product in order to
create the result.
This is now the algorithm that we’ll implement below. At 243
LUT4s it’s not
nearly as good as the 112
LUT4 option from the last section, so I may yet
need to come back and “optimize” this
twos complement
implementation again. Indeed, it’s worse than that since it feels like I just
slapped a twos-complement band-aid onto an awesome algorithm.
Yes, I will need to come back and optimize this signed option. For now, the algorithm has an awesome LUT usage count for unsigned multiplication.
Source code
Now that you know the algorithm, it’s time to start taking a look at the source code we’ll be using to implement in.
In many ways, the ports are much like what you’d expect, there’s a clock and
a reset line, an i_stb
line to request that a multiply be accomplished,
and an o_busy
to say the process is on going. I’ve also added o_done
,
a signal which will be high on the first clock the output is available.
The basic multiply operands themselves involve inputs i_a
and i_b
, with
the output o_p
(product). i_aux
is just an auxiliary bit that will be
kept with the data, and returned (unchanged) with the product. This will
make a traveling valid pipeline
signal
possible if desired.
Ideally, I’d like this
multiply
to work on NA
bits from i_a
times NB
bits from i_b
, but like I’ve
said before, I have yet to figure out how to extend this generic approach
to dissimilar bit widths.
As you may remember, I dislike reasoning about multiple bits at a time within
the cascaded if structure of a control loop. almost_done
helps me avoid this,
by capturing whether we need to set the o_done
bit on the next cycle.
Hence, when almost_done
is true, we’ll be on the last cycle of our logic.
The state machine control variables themselves are o_done
and o_busy
, and
the logic is shown below.
Fig. 3 below shows how these signals play out in practice.
Here you can see the multiply being requested by i_stb
. Once the multiply
unit receives the i_stb
request, it then sets o_busy
high and starts
counting down it’s cycle from NA-1
. Once the count==0
, pre_done
becomes
true, causing the almost_done
signal on the next cycle and o_done
on the
final cycle. This clock cycle, with o_done
true and o_busy
false, is the
first clock cycle when the next request can be made of the core.
The handshaking logic above is separated from the main multiply calculation, below, simply because these are the only registers in this algorithm that require a reset. Why generate reset logic if you don’t need it?
The next item of interest is the current partial product. This is the product
of one of the digits of i_b
with all of i_a
, and should be familiar from
long division. We could even write this as pwire = p_a * p_b[0]
. I want
to separate this from the logic below, however, because the algorithm above
requires toggling particular bits in this word. Separating this into two
steps seems to make more sense to me.
With all that behind us, it’s now time to code up the multiply proper.
The first step in the multiply is to copy the multiplicands from the input.
This not only gives us scratch values to work with, but it also allows the
module feeding us to change these values while we are in the middle of our
calculation. All any external logic needs to know, therefore, is that when
i_stb && !o_busy
a request is accepted. New data may then be placed on
the input ports.
When I first started coding handshakes like this, I’d use the condition of
i_stb && !o_busy
. In many of my designs, such as this one, I’ve dropped the
check for i_stb
. If !i_stb && !o_busy
, the values above are don’t care
values. They could be anything. By dropping the check for i_stb
, the total
logic for each of these elements simplifies.
However, there’s a power cost every time a flip-flop changes. By skipping
the check for i_stb
, we’ve not only lowered our logic but also increased
our power. A quick check for i_stb
, but only if we are implementing a low
power design, and we can keep these extra flip-flops from toggling.
Now comes the real work, as we cycle through each step of this algorithm.
Now that we are busy, our first task will be to shift p_b
, our copy of the
second multiplicand, to the right. That way we can multiply by the next bit
on the next cycle.
We’re keeping our copy of the partial product in partial
. Much like the
pseudo code above, we first shift all of the bits right by one.
The next action depends upon whether this is a signed multiply or not.
If this is a signed multiply, and if we are on the last row, then we need to treat it special, as shown in Fig. 2 above.
Otherwise, if this is just a normal row for the signed multiply, we add
p_b[0]*p_a
to our accumulator while negating the high order bit. Again,
this follows from Fig. 2 above.
Finally, we have our example partial product for the case where our operands were unsigned. This is the lowest logic path through the code.
At each step through this algorithm, we’ll also drop our state machine counter,
here called count
, by one. This is the same count used in our state machine
logic to know when we are done. This count starts at NA-1
and counts down
to zero, essentially counting each of the NA
bits of our operands–once
for each bit in the multiply.
Right after the clock cycle where count == 0
, we’ll want to copy our data
to the output. We’ll also add all of those extra 1
bits here in the case
of the signed product.
o_aux
is just a copy of what i_aux
was initially. Here we just complete
that copy.
While I suppose these output registers could’ve been controlled in the logic block before this one, placing them in their own block helps convince me that their outputs will only change when the product is complete.
Test bench
If you’ve read this blog much, you’ll know that I love formal verification, and for the last year I’ve tried to formally verify every core I’ve presented here.
Not this time.
Why not?
Because formal verification struggles to handle multiplies. That’s just one of the (current) weaknesses of formal methods.
What does that mean? “Formal Verification struggles to handle multiplies”? It doesn’t mean that you cannot formally describe a multiplication algorithm. It doesn’t mean that you cannot create appropriate formal properties.
Perhaps I can explain what I mean best by an example. I once built a 12x12 multiplication algorithm, and a set of formal properties to capture it. I started the formal proof on a Monday morning. On Tuesday, Clifford (who now calls himself Claire) warned me that the proof might not finish successfully. On Thursday, I got frustrated and built an exhaustive Verilator test. When the exhaustive Verilator test finished in less than 15-minutes, I then killed the formal proof that I had started three days earlier.
This is what I mean by “Formal Verification struggles to handle multiplies.” The solvers just don’t know how to handle them (yet).
Therefore, we’ll build a quick Verilator script to verify that our multiply works and properly produces the right answer.
Unlike my FFT code, this script includes hardwired values for the parameters the Verilog code was built with.
I’d much rather self-discover these parameters from the code itself, but at the time I wrote this I hadn’t yet found a way to do this that I liked. Often, when I create something like this from a core generator, I’ll have the core generator will create a header file defining the choices for the parameters. More recently, I’ve started setting output values from the core based upon the parameters within the design. This test bench does neither, meaning that I’ll need to change both Verilog and C++ code any time I want to change the parameters within the core.
Two C++ snippets of
code
have served me well for some time when working with
Verilator.
They convert from a limited bit representation to an integer representation
more appropriate for working with word-sized integers in software. The first
is sbits
, the next ubits
, for work with signed and unsigned numbers
respectively.
sbits
takes a value of bits
width, and sign extends it to the full long
width of the local architecture.
ubits
is a similar function, but instead of sign extending it just drops
the top several bits.
I also like to capture the test device in a class of its own when working with
Verilator.
In this case, we’ll call it SLOWMPYTB
. The actual Verilated
core is given by m_slow
within this class. This also helps me handle
the boiler plate associated with running
Verilator
and capturing traces
from within it.
Just to make sure, we’ll double check that the number of bits in each incoming
value, summed together, will fit in the result. That is, for NA
incoming bits, we’ll have 2*NA
output bits–make sure these will fit within
whatever our word size is.
We’ve discussed the basics of my tick()
method
before.
Basically, it just advances the clock, but does so in a way that our
interaction with the
core as well
as the VCD file
generated by it will remain consistent.
Don’t forget to flush the
VCD file
before returning from our tick()
routine!
There’s been more than one time when I’ve caught bugs within my design using
C++ assert()
statements in my test script, but where I then struggled to
figure out what was going on simply because the key information never got
written into the trace
file. The
Verilator
flush()
function above helps to prevent that from happening.
Our reset routine is fairly basic as well. I’ve used rand()
numbers to
try to compensate for the fact that I’m not using formal methods, but as you’ll
see going forward it’s just a token effort that doesn’t really help that much.
The actual key to this C++ test bench
code
is the test(ia, ib)
function. I use this function to send values ia
and
ib
to the core
to be multiplied. That way the test script can just call test(ia,ib)
over
and over to test various multiplication products and be sure we did them right.
The first step of any test is to set the input values and the i_stb
value.
Of course, this entry point only works if the core is not busy. While the test bench should ensure this, I still throw an assertion in here–just because I don’t necessarily trust anything that’s “under test”. Well, that and bugs can be really difficult to find when you aren’t using formal. The assertion helps walk the bug back in time to where I can find it.
We’ve now requested the multiply, so let’s wait for it to be complete. That
should take NA+1
clock cycles. During this time, i_stb
must be kept low,
although we can randomize i_a
and i_b
to try to verify that the design
will ignore such changes mid-multiply. Finally, we also check that the output
signals are exactly what they need to be here.
Once done, busy should be low and the done bit should be high.
We can now print out the two values used to create this product, together with their results to the terminal. This kind of output is really great early on, although it gets kind of verbose the closer you get to a working routine.
The next step is to verify that we received the right answer. For that purpose, we’ll sign extend the result.
Did we get the right answer? For this, we’ll use a second method of multiplying two numbers together and compare the results. In this case, we can (now) use the C multiplication operator. We’ll first sign extend our values, then calculate what the result should’ve been.
The test will succeed if the design’s output, sout
, matches the predicted
output, sval
, exactly.
By stopping and exiting on any failure as soon as a failure is encountered, we can help limit how far we have to search the trace for a bug–should a bug be encountered.
We can then close our test case by returning if we have been successfull.
That’s really all the heavy lifting. The rest of our test bench is the easy stuff. We’ll start out by constructing an test bench object and (optionally) starting a VCD file to trace everything.
The next step is to work our way through some basic test cases. I chose to
start with some very basic tests, before trying to get complicated. Why?
Because basic tests, like 0 * 0
or 1 * 0
, are a whole lot easier to debug
than more complex tests like 425,682 * 934,255,346
.
I now repeat the same basic tests, but this time for signed numbers. Again, I’m starting the test bench tests out nice and easy–just to make debugging simpler if there’s a bug here.
Now let’s try some corner cases, where the inputs are near their maximum values.
That’s the corners, so let’s now try multiplying 2^n
times one. This should
test every bit in a
, to make sure it is properly multiplied.
We can repeat this with b
, this time multiplying by 2^15
in a
.
Again, these checks are a bit ad-hoc, but they are designed to capture all of the corner cases where 1) I might expect a bug, and 2) where bugs are easy to catch.
Having gotten all the corner cases I expect, I then turn my attention to some random tests.
Now, let me ask, was that good enough? I don’t know. This isn’t a formal proof, so it’s hard to tell. However, if our multiply is small enough, we might be able to check all possible inputs.
On the other hand, if you are trying to multiply two 32-bit numbers together, you might never see the end of that loop.
Still, if we’ve gotten this far, the algorithm works. All that’s left is to close up the test case and declare success.
Conclusion
This article presents the best slow-multiplication algorithm I have yet come across. It’s not fast–nor is it intended to be. However, it can be used cheaply in an FPGA that either doesn’t have any DSPs left, or one that is currently using all of its DSP infrastructure. Indeed, this is the multiply that the ZipCPU uses in these cases.
Without this algorithm, the ZipCPU would’ve never passed muster on an iCE40 HX8K FPGA.
I should also point out that the key to this algorithm, shifting the accumulator so that only the relevant 32-bits of the 64 are added on each cycle, can be used in a divide algorithm as well. So, I tried it. It worked, and it helped, it just … didn’t help as much as it did the multiply. It’s been a while since I did it, but I recall the improvement was 20% or less. It was good enough to keep, but not nearly enough to write home about.
And God blessed them, saying, Be fruitful, and multiply, and fill the waters in the seas, and let fowl multiply in the earth. (Gen 1:22)