One of the basic purposes of FPGAs is to run algorithms, and to run them fast. You know, those serious number crunching applications. While it’s not the only purpose, number crunching is certainly one of the basic ones. Many modern mathematical algorithms, such as my favorite DSP filtering algorithms, all require multiplies. Indeed, I might argue that all of the good signal processing algorithms require multiplies: interpolators, filters, Fourier transforms, Gain control, control systems, and more. All of these FPGA algorithms require multiplies.

Want to build your own CPU? If you build anything more than a basic CPU, you’ll want to implement a multiply.

Here’s the bad news: Multiplies are hard to do in hardware.

The good news? Several FPGAs, such as many of the Xilinx and Intel FPGAs, all contain a limited number of multiply units, often called DSPs for short, internal to their fabric. For these FPGAs, performing a multiply is as simple as

If you have enough of these multiplies on your FPGA, you need not read any further.

If you don’t have enough of these multiplies, then you can often multiplex two or more multiplies together to use only one DSP. If this works for you, life is good. You need not read any further.

However, what if you are building a design for an FPGA that doesn’t have any hardware accelerated multiplies within it? Perhaps you are building for an ICO board or a TinyFPGA BX. If this is the case, you’ll need to know how to build a multiply that not only uses a minimum of logic, but also one that doesn’t seriously slow down your clock frequency.

Let’s examine how we might do this.

## Long Multiplication

If you look up how to perform a binary multiply on wikipedia, you’ll find some very fascinating references to Wallace trees, Kochanski multiplication and Booth multipliers. With a little more digging, you can find Wikipedia’s article on Multiplication, Multiplication algorithms, long multiplication, Lattice multiplication, Peasant multiplication, Karatsuba’s algorithm and even Fourier transform methods of doing multiplication! For our purpose today, let’s start with a simple shift-and-add multiplier, and then we’ll decrease its cost by a factor of two.

If you’ve never heard the term “shift-and-add multiplier”, then relax. It’s the same basic long division algorithm that you learned in grade school, only this time we are going to accomplish it using binary numbers. The algorithm itself is straight forward. Imagine you had two six-bit numbers, `a` and `b`, that you wanted to multiply together. You’d start by walking through all of the digits (i.e. bits) in `b`, from the least to the greatest. If the digit is a `1`, you’ll copy `a` to a table, only shifting it by the position within b.

Perhaps this sounds harder than it is. Here’s a figure showing what I’m talking about. In this figure, the six `p0*` digits represent the multiplication of `a` by `b0`. If `b0` is `1`, `p0*` will be equal to `a` otherwise zero. The six `p1*` digits represent multiplication of `a` by `b1`. If `b1` is `1`, `p1*` will be equal to `a`. Notice from the figure, though, that the `p1*` row is shifted left from the `p0*` row. `p2*` is calculated the same way, only it gets shifted over one more column and so forth.

Again, this should look very much like the long-multiplication algorithm you are already familiar with, with the only difference being that we’re now looking at binary digits instead of decimal ones.

Once all the partial product rows, that is the `pnm` rows, where `n` is the row number and `m` is the position within the row, have been generated they are all then added together to yield a result.

This is a basic “shift-and-add” multiplication algorithm. We’ve taken `a`, and shifted it to the left one bit (digit) at a time, and added it to our result accumulator any time the respective bit in `b` was a `1`.

If this still sounds confusing, let me try explaining this idea one more time. In C++ code, this multiply might look like:

Signed multiplication will take a bit more work, so let’s just focus on unsigned multiplication for now.

This form of multiplication makes sense. It’s nearly the same as the multiplication we all learned in grade school, modified only for binary arithmetic. Indeed, it’s easy to implement in Verilog, where the core of the basic algorithm will look something like:

Sadly, this is a really inefficient implementation. If you count the LUT4s used on an iCE40, you’ll get 307 LUT4s required to implement a 32x32 bit multiply.

Part of the reason why this multiplication implementation is so expensive can be seen by evaluating the logic for each bit in `result`.

1. The bit must be reset on `!o_busy`.
2. The next step depends upon `p_b[count]`. Calculating this value requires a multiplexer. For 32-bits, thats a 32-bit mux–costing many LUT4s to accomplish.
3. If the result of this multiplexer is one, we’ll then add to this bit the result of another multiplexer applied to `p_b` plus a carry bit.

That’s a lot of 32-bit multiplexers, especially since this logic needs to be repeated all 64-bits in the accumulator.

How much does a 32-bit multiplexer cost? Let’s find out. Let’s take this little snippet of code,

and run Yosys on it. With just the three commands, `read_verilog mux32.v`, `synth_ice40`, and then `stat`, yosys reveals that this design costs us 25 LUT4s.

Now consider that we are applying this 32-mux to not only `p_b[count]`, but also to every single bit of `p_a` by evaluating `p_a << count`. A quick run of Yosys reveals the shift alone will cost us 209 LUT4s in total. That’s expensive.

Suppose we removed these LUT4s. For example, we could shift `p_a` and `p_b` on each round so that `p_a` was always the upshifted verion of `a` and `p_b` always contained the `count` bit from our input `b`.

This gets us from `307` LUT4s down to `141` LUT4s. This is a nice 54% improvement.

We can do even better.

## A Better Multiplication

What keeps us from simplifying this initial shift-add-multiply algorithm is the 64-bit addition (assuming we are multiplying two 32x32 bit numbers). This addition is something of a waste, since if you look at Fig 1., you’ll notice we are never adding more than 32-bits at a time. Every line in our long multiplication accumulation table includes some fixed number of least-significant bits that aren’t changing, as well as some number of more significant bits that remain at zero. Sure, the 32-bits we are adding slowly move across our accumulator, but it’s never using any more than 32-bits of that 64-bit accumulator.

Why should we be paying for all the logic it takes to add nothing?

What if we instead shifted our accumulator register to the right, so that we were always adding to the same physical 32-bits? Then, after every add, we’d shift one accumulator bit off to the right. So, instead of shifting `p_b` left on every step as we did above, we’ll shift the `result` accumulator to the right on every step.

Let’s work though what this might look like.

Let’s walk through this slowly.

1. First, we are shifting `p_b` to the right on every clock tick. This gives us access to the next bit of `p_b` at every clock tick. Unlike our first approach, we don’t need a multiplexer to find the correct bit within `p_b`–it’s always going to be bit zero.

This is essentially what we did in our “better” algorithm above.

2. Next, we shift our partial sum to the right. By itself, this is essentially a no-cost operation. Sure, it costs us some flip-flops, 64 to be exact–but we had to use those anyway. What’s different is that once these bits are shifted, there’s no more logic used to generate them: no 32-bit multiplexers, no adds, nothing–just a basic shift.

3. We then add `p_a` to the upper `32` bits, the active bits, of our result accumulator.

That’s it. All we’ve done is move the bits we are accumulating to the right as we do our summation. Now, consider this from the perspective of the bits in the result:

1. For the lower 32-bits, we only need a single LUT4: on `!o_busy` we set them to zero, otherwise they are set based upon the bit to the left of them.

That’s simple.

2. For the upper 32-bits, we perform an addition. In every case, the addition is from the bit to the left plus a corresponding bit from `p_a`, plus a carry. Let’s count those inputs: 1) the carry, 2) the previous bit, 3) a bit from `p_a`, and 4) `p_b` which determines whether we’ll add or not.

At this alone, it looks like it fits within a single LUT4.

It doesn’t. Don’t forget that we need to set our value to zero if `!o_busy`, and we also need to calculate a carry bit, but it at least comes close.

How much does this logic require per bit to calculate? Let’s find out. In this case, we can use another very simple Verilog file and run Yosys again using essentially the same two commands.

In this case, we use only 4 LUT4s per-bit.

Has our algorithm really changed? Not really. In many ways this is the same identical algorithm we had before. We’re still doing a shift-and-add multiply. The big difference now is that we are shifting our result register, rather than our input multiplicand.

Even better, since we are no longer using the entire 64-bit accumulator in this modified 32x32-bit multiply, our carry chain just got shorter by a factor of two. Perhaps this multiplier is faster?

Let’s take a look at the cost of this algorithm. Remember how the algorithm cost `141` LUTs before? We’re now are down to `112` LUT4s.

## Twos Complement

When I built my first soft-multiply in Verilog, I didn’t know how to handle twos complement numbers. My approach was instead to take the absolute magnitude of both inputs, record the incoming signs, multiply the two unsigned numbers, and then negate the result if necessary. This is painful. A basic NxN shift-add multiply requires `N` clocks, whereas this signed multiply cost `N+2` clocks.

Then I found this wikipedia page. According to the page, I can adjust my original algorithm with some simple adjustments and get twos complement multiplication. Here’s what the adjustments look like: Wow, that’s easy, I can do that! Even better, since it (almost) follows from the same long-multiply structure as before, I don’t need to change my algorithm (much).

Ok, it’s not quite as straightforward as it looks. I had a bit of a misunderstanding implementing it at first. Specifically, if `p_b` is zero, the corresponding row is no longer zero. Even if you include the `1` in the first row, if `p_b==0` then `p05` is zero and `!p05` is ONE.

This was my first “discovery” while attempting to implement this algorithm. My next discovery really dampened my thoughts on this.

See, what the wikipedia page doesn’t tell you is that the algorithm only works on an NxN multiplier. It doesn’t work on an `NxM` multiplier where `N != M`. When I first discovered this, I felt so burned I actually edited the Wikipedia page to note this fact.

Someone removed my edits later. Who knows why. Maybe it’s been edited again since, I haven’t looked in a while.

I’ve since tried to rederive the result shown in Fig 2. After all my work, all I can tell you now is that I know it works. I had been hoping to rederive it in order to know how to modify it so that it applies to a generic `NxM` multiply.

I have yet to be successful, and not for a lack of trying.

Sorry, this part of the story doesn’t (yet) have a happy ending (yet–I’m still hoping). I may need to come back and discuss this again later, but for now I’m simply accepting this two’s complement multiplication approach as something that “just works”. I’ve also had to file it under the “I’m not really sure why this works” file.

Feel free to verify it with me, and double-check my own work on this.

For now, let’s just add an `OPT_SIGNED` parameter to our multiply, and adjust our algorithm for both signed and unsigned.

A second block then adds the two `1`s to our partial product in order to create the result.

This is now the algorithm that we’ll implement below. At `243` LUT4s it’s not nearly as good as the `112` LUT4 option from the last section, so I may yet need to come back and “optimize” this twos complement implementation again. Indeed, it’s worse than that since it feels like I just slapped a twos-complement band-aid onto an awesome algorithm.

Yes, I will need to come back and optimize this signed option. For now, the algorithm has an awesome LUT usage count for unsigned multiplication.

## Source code

Now that you know the algorithm, it’s time to start taking a look at the source code we’ll be using to implement in.

In many ways, the ports are much like what you’d expect, there’s a clock and a reset line, an `i_stb` line to request that a multiply be accomplished, and an `o_busy` to say the process is on going. I’ve also added `o_done`, a signal which will be high on the first clock the output is available. The basic multiply operands themselves involve inputs `i_a` and `i_b`, with the output `o_p` (product). `i_aux` is just an auxiliary bit that will be kept with the data, and returned (unchanged) with the product. This will make a traveling valid pipeline signal possible if desired.

Ideally, I’d like this multiply to work on `NA` bits from `i_a` times `NB` bits from `i_b`, but like I’ve said before, I have yet to figure out how to extend this generic approach to dissimilar bit widths.

As you may remember, I dislike reasoning about multiple bits at a time within the cascaded if structure of a control loop. `almost_done` helps me avoid this, by capturing whether we need to set the `o_done` bit on the next cycle. Hence, when `almost_done` is true, we’ll be on the last cycle of our logic.

The state machine control variables themselves are `o_done` and `o_busy`, and the logic is shown below.

Fig. 3 below shows how these signals play out in practice. Here you can see the multiply being requested by `i_stb`. Once the multiply unit receives the `i_stb` request, it then sets `o_busy` high and starts counting down it’s cycle from `NA-1`. Once the `count==0`, `pre_done` becomes true, causing the `almost_done` signal on the next cycle and `o_done` on the final cycle. This clock cycle, with `o_done` true and `o_busy` false, is the first clock cycle when the next request can be made of the core.

The handshaking logic above is separated from the main multiply calculation, below, simply because these are the only registers in this algorithm that require a reset. Why generate reset logic if you don’t need it?

The next item of interest is the current partial product. This is the product of one of the digits of `i_b` with all of `i_a`, and should be familiar from long division. We could even write this as `pwire = p_a * p_b`. I want to separate this from the logic below, however, because the algorithm above requires toggling particular bits in this word. Separating this into two steps seems to make more sense to me.

With all that behind us, it’s now time to code up the multiply proper.

The first step in the multiply is to copy the multiplicands from the input. This not only gives us scratch values to work with, but it also allows the module feeding us to change these values while we are in the middle of our calculation. All any external logic needs to know, therefore, is that when `i_stb && !o_busy` a request is accepted. New data may then be placed on the input ports.

When I first started coding handshakes like this, I’d use the condition of `i_stb && !o_busy`. In many of my designs, such as this one, I’ve dropped the check for `i_stb`. If `!i_stb && !o_busy`, the values above are don’t care values. They could be anything. By dropping the check for `i_stb`, the total logic for each of these elements simplifies.

However, there’s a power cost every time a flip-flop changes. By skipping the check for `i_stb`, we’ve not only lowered our logic but also increased our power. A quick check for `i_stb`, but only if we are implementing a low power design, and we can keep these extra flip-flops from toggling.

Now comes the real work, as we cycle through each step of this algorithm.

Now that we are busy, our first task will be to shift `p_b`, our copy of the second multiplicand, to the right. That way we can multiply by the next bit on the next cycle.

We’re keeping our copy of the partial product in `partial`. Much like the pseudo code above, we first shift all of the bits right by one.

The next action depends upon whether this is a signed multiply or not.

If this is a signed multiply, and if we are on the last row, then we need to treat it special, as shown in Fig. 2 above.

Otherwise, if this is just a normal row for the signed multiply, we add `p_b*p_a` to our accumulator while negating the high order bit. Again, this follows from Fig. 2 above.

Finally, we have our example partial product for the case where our operands were unsigned. This is the lowest logic path through the code.

At each step through this algorithm, we’ll also drop our state machine counter, here called `count`, by one. This is the same count used in our state machine logic to know when we are done. This count starts at `NA-1` and counts down to zero, essentially counting each of the `NA` bits of our operands–once for each bit in the multiply.

Right after the clock cycle where `count == 0`, we’ll want to copy our data to the output. We’ll also add all of those extra `1` bits here in the case of the signed product.

`o_aux` is just a copy of what `i_aux` was initially. Here we just complete that copy.

While I suppose these output registers could’ve been controlled in the logic block before this one, placing them in their own block helps convince me that their outputs will only change when the product is complete.

## Test bench

If you’ve read this blog much, you’ll know that I love formal verification, and for the last year I’ve tried to formally verify every core I’ve presented here.

Not this time.

Why not?

Because formal verification struggles to handle multiplies. That’s just one of the (current) weaknesses of formal methods.

What does that mean? “Formal Verification struggles to handle multiplies”? It doesn’t mean that you cannot formally describe a multiplication algorithm. It doesn’t mean that you cannot create appropriate formal properties.

Perhaps I can explain what I mean best by an example. I once built a 12x12 multiplication algorithm, and a set of formal properties to capture it. I started the formal proof on a Monday morning. On Tuesday, Clifford (who now calls himself Claire) warned me that the proof might not finish successfully. On Thursday, I got frustrated and built an exhaustive Verilator test. When the exhaustive Verilator test finished in less than 15-minutes, I then killed the formal proof that I had started three days earlier.

This is what I mean by “Formal Verification struggles to handle multiplies.” The solvers just don’t know how to handle them (yet).

Therefore, we’ll build a quick Verilator script to verify that our multiply works and properly produces the right answer.

Unlike my FFT code, this script includes hardwired values for the parameters the Verilog code was built with.

I’d much rather self-discover these parameters from the code itself, but at the time I wrote this I hadn’t yet found a way to do this that I liked. Often, when I create something like this from a core generator, I’ll have the core generator will create a header file defining the choices for the parameters. More recently, I’ve started setting output values from the core based upon the parameters within the design. This test bench does neither, meaning that I’ll need to change both Verilog and C++ code any time I want to change the parameters within the core.

Two C++ snippets of code have served me well for some time when working with Verilator. They convert from a limited bit representation to an integer representation more appropriate for working with word-sized integers in software. The first is `sbits`, the next `ubits`, for work with signed and unsigned numbers respectively.

`sbits` takes a value of `bits` width, and sign extends it to the full `long` width of the local architecture.

`ubits` is a similar function, but instead of sign extending it just drops the top several bits.

I also like to capture the test device in a class of its own when working with Verilator. In this case, we’ll call it `SLOWMPYTB`. The actual Verilated core is given by `m_slow` within this class. This also helps me handle the boiler plate associated with running Verilator and capturing traces from within it.

Just to make sure, we’ll double check that the number of bits in each incoming value, summed together, will fit in the result. That is, for `NA` incoming bits, we’ll have `2*NA` output bits–make sure these will fit within whatever our word size is.

We’ve discussed the basics of my `tick()` method before. Basically, it just advances the clock, but does so in a way that our interaction with the core as well as the VCD file generated by it will remain consistent.

Don’t forget to flush the VCD file before returning from our `tick()` routine!

There’s been more than one time when I’ve caught bugs within my design using C++ `assert()` statements in my test script, but where I then struggled to figure out what was going on simply because the key information never got written into the trace file. The Verilator `flush()` function above helps to prevent that from happening.

Our reset routine is fairly basic as well. I’ve used `rand()` numbers to try to compensate for the fact that I’m not using formal methods, but as you’ll see going forward it’s just a token effort that doesn’t really help that much.

The actual key to this C++ test bench code is the `test(ia, ib)` function. I use this function to send values `ia` and `ib` to the core to be multiplied. That way the test script can just call `test(ia,ib)` over and over to test various multiplication products and be sure we did them right.

The first step of any test is to set the input values and the `i_stb` value.

Of course, this entry point only works if the core is not busy. While the test bench should ensure this, I still throw an assertion in here–just because I don’t necessarily trust anything that’s “under test”. Well, that and bugs can be really difficult to find when you aren’t using formal. The assertion helps walk the bug back in time to where I can find it.

We’ve now requested the multiply, so let’s wait for it to be complete. That should take `NA+1` clock cycles. During this time, `i_stb` must be kept low, although we can randomize `i_a` and `i_b` to try to verify that the design will ignore such changes mid-multiply. Finally, we also check that the output signals are exactly what they need to be here.

Once done, busy should be low and the done bit should be high.

We can now print out the two values used to create this product, together with their results to the terminal. This kind of output is really great early on, although it gets kind of verbose the closer you get to a working routine.

The next step is to verify that we received the right answer. For that purpose, we’ll sign extend the result.

Did we get the right answer? For this, we’ll use a second method of multiplying two numbers together and compare the results. In this case, we can (now) use the C multiplication operator. We’ll first sign extend our values, then calculate what the result should’ve been.

The test will succeed if the design’s output, `sout`, matches the predicted output, `sval`, exactly.

By stopping and exiting on any failure as soon as a failure is encountered, we can help limit how far we have to search the trace for a bug–should a bug be encountered.

We can then close our test case by returning if we have been successfull.

That’s really all the heavy lifting. The rest of our test bench is the easy stuff. We’ll start out by constructing an test bench object and (optionally) starting a VCD file to trace everything.

The next step is to work our way through some basic test cases. I chose to start with some very basic tests, before trying to get complicated. Why? Because basic tests, like `0 * 0` or `1 * 0`, are a whole lot easier to debug than more complex tests like `425,682 * 934,255,346`.

I now repeat the same basic tests, but this time for signed numbers. Again, I’m starting the test bench tests out nice and easy–just to make debugging simpler if there’s a bug here.

Now let’s try some corner cases, where the inputs are near their maximum values.

That’s the corners, so let’s now try multiplying `2^n` times one. This should test every bit in `a`, to make sure it is properly multiplied.

We can repeat this with `b`, this time multiplying by `2^15` in `a`.

Again, these checks are a bit ad-hoc, but they are designed to capture all of the corner cases where 1) I might expect a bug, and 2) where bugs are easy to catch.

Having gotten all the corner cases I expect, I then turn my attention to some random tests.

Now, let me ask, was that good enough? I don’t know. This isn’t a formal proof, so it’s hard to tell. However, if our multiply is small enough, we might be able to check all possible inputs.

On the other hand, if you are trying to multiply two 32-bit numbers together, you might never see the end of that loop.

Still, if we’ve gotten this far, the algorithm works. All that’s left is to close up the test case and declare success.

## Conclusion

This article presents the best slow-multiplication algorithm I have yet come across. It’s not fast–nor is it intended to be. However, it can be used cheaply in an FPGA that either doesn’t have any DSPs left, or one that is currently using all of its DSP infrastructure. Indeed, this is the multiply that the ZipCPU uses in these cases.

Without this algorithm, the ZipCPU would’ve never passed muster on an iCE40 HX8K FPGA.

I should also point out that the key to this algorithm, shifting the accumulator so that only the relevant 32-bits of the 64 are added on each cycle, can be used in a divide algorithm as well. So, I tried it. It worked, and it helped, it just … didn’t help as much as it did the multiply. It’s been a while since I did it, but I recall the improvement was 20% or less. It was good enough to keep, but not nearly enough to write home about.