Building a Better Verilog Multiply for the ZipCPU

One of the basic purposes of FPGAs is to run algorithms, and to run them fast. You know, those serious number crunching applications. While it’s not the only purpose, number crunching is certainly one of the basic ones. Many modern mathematical algorithms, such as my favorite DSP filtering algorithms, all require multiplies. Indeed, I might argue that all of the good signal processing algorithms require multiplies: interpolators, filters, Fourier transforms, Gain control, control systems, and more. All of these FPGA algorithms require multiplies.

Want to build your own CPU? If you build anything more than a basic CPU, you’ll want to implement a multiply.

Here’s the bad news: Multiplies are hard to do in hardware.

The good news? Several FPGAs, such as many of the Xilinx and Intel FPGAs, all contain a limited number of multiply units, often called DSPs for short, internal to their fabric. For these FPGAs, performing a multiply is as simple as

	always @(posedge i_clk)
	if (ce)
		result <= in_one * in_two;

If you have enough of these multiplies on your FPGA, you need not read any further.

If you don’t have enough of these multiplies, then you can often multiplex two or more multiplies together to use only one DSP. If this works for you, life is good. You need not read any further.

However, what if you are building a design for an FPGA that doesn’t have any hardware accelerated multiplies within it? Perhaps you are building for an ICO board or a TinyFPGA BX. If this is the case, you’ll need to know how to build a multiply that not only uses a minimum of logic, but also one that doesn’t seriously slow down your clock frequency.

Let’s examine how we might do this.

Long Multiplication

If you look up how to perform a binary multiply on wikipedia, you’ll find some very fascinating references to Wallace trees, Kochanski multiplication and Booth multipliers. With a little more digging, you can find Wikipedia’s article on Multiplication, Multiplication algorithms, long multiplication, Lattice multiplication, Peasant multiplication, Karatsuba’s algorithm and even Fourier transform methods of doing multiplication! For our purpose today, let’s start with a simple shift-and-add multiplier, and then we’ll decrease its cost by a factor of two.

If you’ve never heard the term “shift-and-add multiplier”, then relax. It’s the same basic long division algorithm that you learned in grade school, only this time we are going to accomplish it using binary numbers. The algorithm itself is straight forward. Imagine you had two six-bit numbers, a and b, that you wanted to multiply together. You’d start by walking through all of the digits (i.e. bits) in b, from the least to the greatest. If the digit is a 1, you’ll copy a to a table, only shifting it by the position within b.

Perhaps this sounds harder than it is. Here’s a figure showing what I’m talking about.

Fig 1. A Basic 6x6-bit Multiply
</A>

In this figure, the six p0* digits represent the multiplication of a by b0. If b0 is 1, p0* will be equal to a otherwise zero. The six p1* digits represent multiplication of a by b1. If b1 is 1, p1* will be equal to a. Notice from the figure, though, that the p1* row is shifted left from the p0* row. p2* is calculated the same way, only it gets shifted over one more column and so forth.

Again, this should look very much like the long-multiplication algorithm you are already familiar with, with the only difference being that we’re now looking at binary digits instead of decimal ones.

Once all the partial product rows, that is the pnm rows, where n is the row number and m is the position within the row, have been generated they are all then added together to yield a result.

This is a basic “shift-and-add” multiplication algorithm. We’ve taken a, and shifted it to the left one bit (digit) at a time, and added it to our result accumulator any time the respective bit in b was a 1.

If this still sounds confusing, let me try explaining this idea one more time. In C++ code, this multiply might look like:

unsigned multiply(unsigned a, unsigned b) {
	unsigned acc = 0;

	for(unsigned k=0; k<NBITS; k++)
		acc += (b&(1<<k)) ? a<<k : 0;

	return acc;
}

Signed multiplication will take a bit more work, so let’s just focus on unsigned multiplication for now.

This form of multiplication makes sense. It’s nearly the same as the multiplication we all learned in grade school, modified only for binary arithmetic. Indeed, it’s easy to implement in Verilog, where the core of the basic algorithm will look something like:

	always @(posedge i_clk)
	if (!o_busy)
	begin
		count   <= 0;
		result  <= 0;
		p_a     <= i_a;
		p_b     <= i_b;
		o_busy  <= i_stb; // A multiply request
	end else begin
		if (p_b[count])
			result  <= result  + (p_a << count);
		count <= count + 1;
		o_busy <= (count < 31);
	end

Sadly, this is a really inefficient implementation. If you count the LUT4s used on an iCE40, you’ll get 307 LUT4s required to implement a 32x32 bit multiply.

   Number of cells:                507
     SB_CARRY                       66
     SB_DFF                          1
     SB_DFFE                        64
     SB_DFFESR                      65
     SB_DFFSR                        4
     SB_LUT4                       307

Part of the reason why this multiplication implementation is so expensive can be seen by evaluating the logic for each bit in result.

The bit must be reset on !o_busy.
The next step depends upon p_b[count]. Calculating this value requires a multiplexer. For 32-bits, thats a 32-bit mux–costing many LUT4s to accomplish.
If the result of this multiplexer is one, we’ll then add to this bit the result of another multiplexer applied to p_b plus a carry bit.

That’s a lot of 32-bit multiplexers, especially since this logic needs to be repeated all 64-bits in the accumulator.

How much does a 32-bit multiplexer cost? Let’s find out. Let’s take this little snippet of code,

module mux32(i_bit, i_val, o_r);
	input	wire	[4:0]	i_bit;
	input	wire	[31:0]	i_val;
	output	wire		o_r;

	always @(*)
		o_r = i_val[i_bit];
endmodule

and run Yosys on it. With just the three commands, read_verilog mux32.v, synth_ice40, and then stat, yosys reveals that this design costs us 25 LUT4s.

   Number of cells:                 25
     SB_LUT4                        25

Now consider that we are applying this 32-mux to not only p_b[count], but also to every single bit of p_a by evaluating p_a << count. A quick run of Yosys reveals the shift alone will cost us 209 LUT4s in total. That’s expensive.

Suppose we removed these LUT4s. For example, we could shift p_a and p_b on each round so that p_a was always the upshifted verion of a and p_b[0] always contained the count bit from our input b.

	always @(posedge i_clk)
	if (!o_busy)
		// No changes here
	else begin
		if (p_b[0])
			result  <= result  + p_a;
		p_a <= (p_a << 1);
		p_b <= (p_b >> 1);
		count <= count + 1;
		o_busy <= (count < 31);
	end

This gets us from 307 LUT4s down to 141 LUT4s. This is a nice 54% improvement.

We can do even better.

A Better Multiplication

What keeps us from simplifying this initial shift-add-multiply algorithm is the 64-bit addition (assuming we are multiplying two 32x32 bit numbers). This addition is something of a waste, since if you look at Fig 1., you’ll notice we are never adding more than 32-bits at a time. Every line in our long multiplication accumulation table includes some fixed number of least-significant bits that aren’t changing, as well as some number of more significant bits that remain at zero. Sure, the 32-bits we are adding slowly move across our accumulator, but it’s never using any more than 32-bits of that 64-bit accumulator.

Why should we be paying for all the logic it takes to add nothing?

What if we instead shifted our accumulator register to the right, so that we were always adding to the same physical 32-bits? Then, after every add, we’d shift one accumulator bit off to the right. So, instead of shifting p_b left on every step as we did above, we’ll shift the result accumulator to the right on every step.

Let’s work though what this might look like.

	always @(posedge i_clk)
	if (!o_busy)
		// Skip this for now
	else begin
		p_b <= p_b >> 1;
		// Shift our last result to the right
		//   NB is the number of bits in either p_a or p_b
		result[2*NB-1:0] <= { 1'b0, result[2*NB-1:1] };
		// Now add in the high bits only
		if (p_b[0])
			result[2*NB-1:NB-1] <= { 1'b0, result[2*NB-1:NB]} + p_a;
	end

Let’s walk through this slowly.

First, we are shifting p_b to the right on every clock tick. This gives us access to the next bit of p_b at every clock tick. Unlike our first approach, we don’t need a multiplexer to find the correct bit within p_b–it’s always going to be bit zero.

This is essentially what we did in our “better” algorithm above.
Next, we shift our partial sum to the right. By itself, this is essentially a no-cost operation. Sure, it costs us some flip-flops, 64 to be exact–but we had to use those anyway. What’s different is that once these bits are shifted, there’s no more logic used to generate them: no 32-bit multiplexers, no adds, nothing–just a basic shift.
We then add p_a to the upper 32 bits, the active bits, of our result accumulator.

That’s it. All we’ve done is move the bits we are accumulating to the right as we do our summation. Now, consider this from the perspective of the bits in the result:

For the lower 32-bits, we only need a single LUT4: on !o_busy we set them to zero, otherwise they are set based upon the bit to the left of them.

That’s simple.
For the upper 32-bits, we perform an addition. In every case, the addition is from the bit to the left plus a corresponding bit from p_a, plus a carry. Let’s count those inputs: 1) the carry, 2) the previous bit, 3) a bit from p_a, and 4) p_b[0] which determines whether we’ll add or not.

At this alone, it looks like it fits within a single LUT4.

It doesn’t. Don’t forget that we need to set our value to zero if !o_busy, and we also need to calculate a carry bit, but it at least comes close.

How much does this logic require per bit to calculate? Let’s find out. In this case, we can use another very simple Verilog file and run Yosys again using essentially the same two commands.

module add2(i_busy, i_a, i_b, i_c, i_s, o_r, o_c);
	input	wire	i_busy, i_a, i_b, i_c, i_s;
	output	reg	o_r, o_c;

	always @(*)
	if (!i_busy)
		{ o_c, o_r } = 0;
	else if (i_s)
		{ o_c, o_r } = i_a + i_b + i_c;
	else
		{ o_c, o_r } = i_a + i_c;
endmodule

In this case, we use only 4 LUT4s per-bit.

Has our algorithm really changed? Not really. In many ways this is the same identical algorithm we had before. We’re still doing a shift-and-add multiply. The big difference now is that we are shifting our result register, rather than our input multiplicand.

Even better, since we are no longer using the entire 64-bit accumulator in this modified 32x32-bit multiply, our carry chain just got shorter by a factor of two. Perhaps this multiplier is faster?

Let’s take a look at the cost of this algorithm. Remember how the algorithm cost 141 LUTs before? We’re now are down to 112 LUT4s.

Not bad.

Twos Complement

When I built my first soft-multiply in Verilog, I didn’t know how to handle twos complement numbers. My approach was instead to take the absolute magnitude of both inputs, record the incoming signs, multiply the two unsigned numbers, and then negate the result if necessary. This is painful. A basic NxN shift-add multiply requires N clocks, whereas this signed multiply cost N+2 clocks.

Then I found this wikipedia page. According to the page, I can adjust my original algorithm with some simple adjustments and get twos complement multiplication. Here’s what the adjustments look like:

Fig 2. Two's complement multiplication
</A>

Wow, that’s easy, I can do that! Even better, since it (almost) follows from the same long-multiply structure as before, I don’t need to change my algorithm (much).

Ok, it’s not quite as straightforward as it looks. I had a bit of a misunderstanding implementing it at first. Specifically, if p_b[0] is zero, the corresponding row is no longer zero. Even if you include the 1 in the first row, if p_b[0]==0 then p05 is zero and !p05 is ONE.

This was my first “discovery” while attempting to implement this algorithm. My next discovery really dampened my thoughts on this.

See, what the wikipedia page doesn’t tell you is that the algorithm only works on an NxN multiplier. It doesn’t work on an NxM multiplier where N != M. When I first discovered this, I felt so burned I actually edited the Wikipedia page to note this fact.

Someone removed my edits later. Who knows why. Maybe it’s been edited again since, I haven’t looked in a while.

I’ve since tried to rederive the result shown in Fig 2. After all my work, all I can tell you now is that I know it works. I had been hoping to rederive it in order to know how to modify it so that it applies to a generic NxM multiply.

I have yet to be successful, and not for a lack of trying.

Sorry, this part of the story doesn’t (yet) have a happy ending (yet–I’m still hoping). I may need to come back and discuss this again later, but for now I’m simply accepting this two’s complement multiplication approach as something that “just works”. I’ve also had to file it under the “I’m not really sure why this works” file.

Feel free to verify it with me, and double-check my own work on this.

For now, let’s just add an OPT_SIGNED parameter to our multiply, and adjust our algorithm for both signed and unsigned.

	// Just to keep the notation easier, let's do our one-bit
	// multiply outside of our always block
	assign	pwire = p_b[0] ? p_a : 0;

	always @(posedge i_clk)
	if (!o_busy)
		// We'll come back to this
	else begin
		// This should look familiar: shift b to the right to get
		// the bit to multiply by
		p_b <= (p_b >> 1);

		// Here's the right shift
		partial[NB-2:0] <= partial[NB-1:1];

		// On the last add, we need to negate all the bits
		// This is shown by the last row in Fig. 2 above.
		//
		//   Notice the right-shift is still within here
		if ((OPT_SIGNED)&&(pre_done))
			partial[2*NB-1:NB-1] <= { 1'b0, partial[2*NB-1:NB]} +
				{ 1'b0, pwire[NB-1], ~pwire[NB-2:0] };

		// Otherwise in general we flip the top bit of our
		// multiply result
		else if (OPT_SIGNED)
			partial[2*NB-1:NB-1] <= {1'b0,partial[2*NB-1:NB]} +
				{ 1'b0, !pwire[NA-1], pwire[NA-2:0] };
		//
		// If OPT_SIGNED isn't true, then this is just an ordinary
		// add, as we've discussed above
		else
			partial[NA+NB-1:NB-1] <= {1'b0, partial[NA+NB-1:NB]}
				+ ((p_b[0]) ? {1'b0,p_a} : 0);
		// ...
	end

A second block then adds the two 1s to our partial product in order to create the result.

	always @(posedge i_clk)
	if (almost_done)
	begin
		// Create our output product
		if (OPT_SIGNED)
			o_p   <= partial[2*NB-1:0]
				+ {1'b1,{(NB-2){1'b0}},1'b1, {(NB){1'b0}}};
		else
			o_p   <= partial[2*NB-1:0];
		// ...
	end

This is now the algorithm that we’ll implement below. At 243 LUT4s it’s not nearly as good as the 112 LUT4 option from the last section, so I may yet need to come back and “optimize” this twos complement implementation again. Indeed, it’s worse than that since it feels like I just slapped a twos-complement band-aid onto an awesome algorithm.

Yes, I will need to come back and optimize this signed option. For now, the algorithm has an awesome LUT usage count for unsigned multiplication.

Source code

Now that you know the algorithm, it’s time to start taking a look at the source code we’ll be using to implement in.

In many ways, the ports are much like what you’d expect, there’s a clock and a reset line, an i_stb line to request that a multiply be accomplished, and an o_busy to say the process is on going. I’ve also added o_done, a signal which will be high on the first clock the output is available. The basic multiply operands themselves involve inputs i_a and i_b, with the output o_p (product). i_aux is just an auxiliary bit that will be kept with the data, and returned (unchanged) with the product. This will make a traveling valid pipeline signal possible if desired.

Ideally, I’d like this multiply to work on NA bits from i_a times NB bits from i_b, but like I’ve said before, I have yet to figure out how to extend this generic approach to dissimilar bit widths.

module	slowmpy #(
		parameter			LGNA = 6,
		parameter	[LGNA:0]	NA = 33,
		parameter	[0:0]		OPT_SIGNED = 1'b1,
		localparam	NB = NA // Must be = NA for OPT_SIGNED to work
	) (
		input	wire				i_clk, i_reset,
		//
		input	wire				i_stb,
		input	wire	signed	[(NA-1):0]	i_a,
		input	wire	signed	[(NB-1):0]	i_b,
		input	wire				i_aux,
		output	reg				o_busy, o_done,
		output	reg	signed	[(NA+NB-1):0]	o_p,
		output	reg				o_aux
	);

As you may remember, I dislike reasoning about multiple bits at a time within the cascaded if structure of a control loop. almost_done helps me avoid this, by capturing whether we need to set the o_done bit on the next cycle. Hence, when almost_done is true, we’ll be on the last cycle of our logic.

	assign	pre_done = (count == 0);
	initial	almost_done = 1'b0;
	always @(posedge i_clk)
		almost_done <= (!i_reset)&&(o_busy)&&(pre_done);

The state machine control variables themselves are o_done and o_busy, and the logic is shown below.

	initial	aux    = 0;
	initial	o_done = 0;
	initial	o_busy = 0;
	always @(posedge i_clk)
	if (i_reset)
	begin
		aux    <= 0;
		o_done <= 0;
		o_busy <= 0;
	end else if (!o_busy)
	begin
		o_done <= 0;
		// Start a multiply on i_stb
		o_busy <= i_stb;
		aux    <= i_aux;
	end else if ((o_busy)&&(almost_done))
	begin
		// Mark the result as complete
		o_done <= 1;
		o_busy <= 0;
	end else
		// Always clear the done flag on the next cycle
		o_done <= 0;

Fig. 3 below shows how these signals play out in practice.

Fig 3. The Slow Multiply Handshake

Here you can see the multiply being requested by i_stb. Once the multiply unit receives the i_stb request, it then sets o_busy high and starts counting down it’s cycle from NA-1. Once the count==0, pre_done becomes true, causing the almost_done signal on the next cycle and o_done on the final cycle. This clock cycle, with o_done true and o_busy false, is the first clock cycle when the next request can be made of the core.

The handshaking logic above is separated from the main multiply calculation, below, simply because these are the only registers in this algorithm that require a reset. Why generate reset logic if you don’t need it?

The next item of interest is the current partial product. This is the product of one of the digits of i_b with all of i_a, and should be familiar from long division. We could even write this as pwire = p_a * p_b[0]. I want to separate this from the logic below, however, because the algorithm above requires toggling particular bits in this word. Separating this into two steps seems to make more sense to me.

	assign	pwire = (p_b[0] ? p_a : 0);

With all that behind us, it’s now time to code up the multiply proper.

The first step in the multiply is to copy the multiplicands from the input. This not only gives us scratch values to work with, but it also allows the module feeding us to change these values while we are in the middle of our calculation. All any external logic needs to know, therefore, is that when i_stb && !o_busy a request is accepted. New data may then be placed on the input ports.

	always @(posedge i_clk)
	if (!o_busy)
	begin
		count <= NA[LGNA-1:0]-1;
		partial <= 0;
		p_a <= i_a;
		p_b <= i_b;

When I first started coding handshakes like this, I’d use the condition of i_stb && !o_busy. In many of my designs, such as this one, I’ve dropped the check for i_stb. If !i_stb && !o_busy, the values above are don’t care values. They could be anything. By dropping the check for i_stb, the total logic for each of these elements simplifies.

However, there’s a power cost every time a flip-flop changes. By skipping the check for i_stb, we’ve not only lowered our logic but also increased our power. A quick check for i_stb, but only if we are implementing a low power design, and we can keep these extra flip-flops from toggling.

		if (OPT_LOWPOWER && !i_stb)
		begin
			count <= NA[LGNA-1:0]-1;
			partial <= 0;
			p_a <= 0;
			p_b <= 0;
		end

Now comes the real work, as we cycle through each step of this algorithm.

Now that we are busy, our first task will be to shift p_b, our copy of the second multiplicand, to the right. That way we can multiply by the next bit on the next cycle.

	end else begin
		p_b <= (p_b >> 1);

We’re keeping our copy of the partial product in partial. Much like the pseudo code above, we first shift all of the bits right by one.

		partial[NB-2:0] <= partial[NB-1:1];

The next action depends upon whether this is a signed multiply or not.

If this is a signed multiply, and if we are on the last row, then we need to treat it special, as shown in Fig. 2 above.

		if ((OPT_SIGNED)&&(pre_done))
			partial[NA+NB-1:NB-1] <= { 1'b0, partial[NA+NB-1:NB]} +
				{ 1'b0, pwire[NA-1], ~pwire[NA-2:0] };

Otherwise, if this is just a normal row for the signed multiply, we add p_b[0]*p_a to our accumulator while negating the high order bit. Again, this follows from Fig. 2 above.

		else if (OPT_SIGNED)
			partial[NA+NB-1:NB-1] <= {1'b0,partial[NA+NB-1:NB]} +
				{ 1'b0, !pwire[NA-1], pwire[NA-2:0] };

Finally, we have our example partial product for the case where our operands were unsigned. This is the lowest logic path through the code.

		else
			partial[NA+NB-1:NB-1] <= {1'b0, partial[NA+NB-1:NB]}
				+ pwire;

At each step through this algorithm, we’ll also drop our state machine counter, here called count, by one. This is the same count used in our state machine logic to know when we are done. This count starts at NA-1 and counts down to zero, essentially counting each of the NA bits of our operands–once for each bit in the multiply.

		count <= count - 1;
	end

Right after the clock cycle where count == 0, we’ll want to copy our data to the output. We’ll also add all of those extra 1 bits here in the case of the signed product.

	always @(posedge i_clk)
	if (almost_done)
	begin
		if (OPT_SIGNED)
			o_p   <= partial[NA+NB-1:0]
				+ {1'b1,{(NA-2){1'b0}},1'b1, {(NB){1'b0}}};
		else
			o_p   <= partial[NA+NB-1:0];

o_aux is just a copy of what i_aux was initially. Here we just complete that copy.

		o_aux <= aux;
	end

While I suppose these output registers could’ve been controlled in the logic block before this one, placing them in their own block helps convince me that their outputs will only change when the product is complete.

Test bench

If you’ve read this blog much, you’ll know that I love formal verification, and for the last year I’ve tried to formally verify every core I’ve presented here.

Not this time.

`ifdef	FORMAL
	// We'll skip these
`endif
endmodule

Why not?

Because formal verification struggles to handle multiplies. That’s just one of the (current) weaknesses of formal methods.

What does that mean? “Formal Verification struggles to handle multiplies”? It doesn’t mean that you cannot formally describe a multiplication algorithm. It doesn’t mean that you cannot create appropriate formal properties.

Perhaps I can explain what I mean best by an example. I once built a 12x12 multiplication algorithm, and a set of formal properties to capture it. I started the formal proof on a Monday morning. On Tuesday, Clifford (who now calls himself Claire) warned me that the proof might not finish successfully. On Thursday, I got frustrated and built an exhaustive Verilator test. When the exhaustive Verilator test finished in less than 15-minutes, I then killed the formal proof that I had started three days earlier.

This is what I mean by “Formal Verification struggles to handle multiplies.” The solvers just don’t know how to handle them (yet).

Therefore, we’ll build a quick Verilator script to verify that our multiply works and properly produces the right answer.

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#include "verilated.h"
#include "verilated_vcd_c.h"

#include "Vslowmpy.h"

Unlike my FFT code, this script includes hardwired values for the parameters the Verilog code was built with.

const	bool	trace = false;
const	int	NA=12, NB = NA;
const	bool	OPT_SIGNED = true;

I’d much rather self-discover these parameters from the code itself, but at the time I wrote this I hadn’t yet found a way to do this that I liked. Often, when I create something like this from a core generator, I’ll have the core generator will create a header file defining the choices for the parameters. More recently, I’ve started setting output values from the core based upon the parameters within the design. This test bench does neither, meaning that I’ll need to change both Verilog and C++ code any time I want to change the parameters within the core.

Two C++ snippets of code have served me well for some time when working with Verilator. They convert from a limited bit representation to an integer representation more appropriate for working with word-sized integers in software. The first is sbits, the next ubits, for work with signed and unsigned numbers respectively.

sbits takes a value of bits width, and sign extends it to the full long width of the local architecture.

long	sbits(const long val, const int bits) {
	long	r;

	// Limit the input to bits wide
	r = val & ((1l<<bits)-1);

	// If the sign bits is set, extend to the full word width
	if (r & (1l << (bits-1)))
		r |= (-1l << bits);
	return r;
}

ubits is a similar function, but instead of sign extending it just drops the top several bits.

unsigned long	ubits(const long val, const int bits) {
	unsigned long r = val & ((1l<<bits)-1);
	return r;
}

I also like to capture the test device in a class of its own when working with Verilator. In this case, we’ll call it SLOWMPYTB. The actual Verilated core is given by m_slow within this class. This also helps me handle the boiler plate associated with running Verilator and capturing traces from within it.

class	SLOWMPYTB {
public:
	Vslowmpy	*m_slow;
	long		svals[32];
	int		m_addr;
	VerilatedVcdC	*m_strace;
	long		m_tickcount;

	SLOWMPYTB(void) {
		m_slow = new Vslowmpy;

		Verilated::traceEverOn(true);

		for(int i=0; i<32; i++)
			svals[i] = 0;
		m_addr = 0;

		m_strace = NULL;
		m_tickcount = 0;

Just to make sure, we’ll double check that the number of bits in each incoming value, summed together, will fit in the result. That is, for NA incoming bits, we’ll have 2*NA output bits–make sure these will fit within whatever our word size is.

		assert(NA+NB < 8*sizeof(long));
	}

We’ve discussed the basics of my tick() method before. Basically, it just advances the clock, but does so in a way that our interaction with the core as well as the VCD file generated by it will remain consistent.

	void	tick(void) {
		m_tickcount++;

		// First, resolve any combinatorial logic
		m_slow->i_clk = 0;
		m_slow->eval();
		if (m_strace) m_strace->dump((uint64_t)(10*m_tickcount-2));

		// Then raise the clock and issue a positive clock edge to
		// the core
		m_slow->i_clk = 1;
		m_slow->eval();
		if (m_strace) m_strace->dump((uint64_t)(10*m_tickcount));

		// Finally, drop the clock
		m_slow->i_clk = 0;
		m_slow->eval();
		if (m_strace) {

Don’t forget to flush the VCD file before returning from our tick() routine!

			m_strace->dump((uint64_t)(10*m_tickcount+5));
			m_strace->flush();
		}
	}

There’s been more than one time when I’ve caught bugs within my design using C++ assert() statements in my test script, but where I then struggled to figure out what was going on simply because the key information never got written into the trace file. The Verilator flush() function above helps to prevent that from happening.

Our reset routine is fairly basic as well. I’ve used rand() numbers to try to compensate for the fact that I’m not using formal methods, but as you’ll see going forward it’s just a token effort that doesn’t really help that much.

	void	reset(void) {
		m_slow->i_clk = 0;
		m_slow->i_stb = 0;
		m_slow->i_a = rand();
		m_slow->i_b = rand();
		m_slow->i_aux = rand();

		// Reset the design
		m_slow->i_reset = 1;
		tick();

		// Clear the reset
		m_slow->i_aux = 0;
		m_slow->i_reset = 0;

		m_addr = 0;
	}

The actual key to this C++ test bench code is the test(ia, ib) function. I use this function to send values ia and ib to the core to be multiplied. That way the test script can just call test(ia,ib) over and over to test various multiplication products and be sure we did them right.

	bool	test(const int ia, const int ib) {
		bool		success = true;
		int		aux;
		long		sout;

The first step of any test is to set the input values and the i_stb value.

		m_slow->i_stb = 1;
		m_slow->i_a = ubits(ia, NA);
		m_slow->i_b = ubits(ib, NB);
		m_slow->i_aux = rand();

Of course, this entry point only works if the core is not busy. While the test bench should ensure this, I still throw an assertion in here–just because I don’t necessarily trust anything that’s “under test”. Well, that and bugs can be really difficult to find when you aren’t using formal. The assertion helps walk the bug back in time to where I can find it.

		assert(!m_slow->o_busy);

We’ve now requested the multiply, so let’s wait for it to be complete. That should take NA+1 clock cycles. During this time, i_stb must be kept low, although we can randomize i_a and i_b to try to verify that the design will ignore such changes mid-multiply. Finally, we also check that the output signals are exactly what they need to be here.

		for(int i=0; i<NA+1; i++) {
			tick();

			m_slow->i_stb = 0;
			m_slow->i_a = ubits(rand(), NA);
			m_slow->i_b = ubits(rand(), NB);
			assert( m_slow->o_busy);
			assert(!m_slow->o_done);
		} tick();

Once done, busy should be low and the done bit should be high.

		assert(!m_slow->o_busy);
		assert( m_slow->o_done);

We can now print out the two values used to create this product, together with their results to the terminal. This kind of output is really great early on, although it gets kind of verbose the closer you get to a working routine.

		if (trace) {
		printf("k=%3d: A =%06x, B =%06x, AUX=%d -> S(O) = %9lx, SAUX=%d\n",
			m_addr, (int)ubits(ia, NA), (int)ubits(ib,NB), aux,
			(unsigned long)m_slow->o_p, m_slow->o_aux);
		}

The next step is to verify that we received the right answer. For that purpose, we’ll sign extend the result.

		if (OPT_SIGNED)
			sout = sbits(m_slow->o_p, NA+NB);
		else
			sout = ubits(m_slow->o_p, NA+NB);

Did we get the right answer? For this, we’ll use a second method of multiplying two numbers together and compare the results. In this case, we can (now) use the C multiplication operator. We’ll first sign extend our values, then calculate what the result should’ve been.

		if (success) {
			long	sval;
		       
			if (OPT_SIGNED)
				sval = sbits(ia, NA) * sbits(ib, NB);
			else
				sval = ubits(ia, NA) * ubits(ib, NB);

The test will succeed if the design’s output, sout, matches the predicted output, sval, exactly.

			success = success && (sout== sval);
			if (!success) {
				printf("WRONG SGN-ANSWER: %8lx (expected) != %8lx (actual)\n", sval, sout);
				exit(EXIT_FAILURE);
			}
		}

By stopping and exiting on any failure as soon as a failure is encountered, we can help limit how far we have to search the trace for a bug–should a bug be encountered.

We can then close our test case by returning if we have been successfull.

		return success;
	}
};

That’s really all the heavy lifting. The rest of our test bench is the easy stuff. We’ll start out by constructing an test bench object and (optionally) starting a VCD file to trace everything.

const	bool	trace = false;

int	main(int argc, char **argv, char **envp) {
	Verilated::commandArgs(argc, argv);
	SLOWMPYTB		*tb = new SLOWMPYTB;

	if (trace)
		tb->opentrace("slowtrace.vcd");
	tb->reset();

The next step is to work our way through some basic test cases. I chose to start with some very basic tests, before trying to get complicated. Why? Because basic tests, like 0 * 0 or 1 * 0, are a whole lot easier to debug than more complex tests like 425,682 * 934,255,346.

	tb->test(0, 0);
	tb->test(1, 0);
	tb->test(2, 0);
	tb->test(4, 0);
	tb->test(0, 1);
	tb->test(0, 2);
	tb->test(0, 4);
	tb->test(1, 1);
	tb->test(2, 1);
	tb->test(4, 1);
	tb->test(1, 2);
	tb->test(1, 4);
	tb->test(2, 4);
	tb->test(4, 4);

I now repeat the same basic tests, but this time for signed numbers. Again, I’m starting the test bench tests out nice and easy–just to make debugging simpler if there’s a bug here.

	tb->test(-1,  1);
	tb->test(-1,  2);
	tb->test(-1,  4);
	tb->test(-2,  4);
	tb->test(-4,  4);
	tb->test(-1, -1);
	tb->test(-1, -2);
	tb->test(-1, -4);
	tb->test(-2, -4);
	tb->test(-4, -4);
	tb->test( 1, -1);
	tb->test( 1, -2);
	tb->test( 1, -4);
	tb->test( 2, -4);
	tb->test( 4, -4);

Now let’s try some corner cases, where the inputs are near their maximum values.

	tb->test((1<<(NA-1)), 0);
	tb->test((1<<(NA-2)), (1<<(NB-1)));
	tb->test((1<<(NA-1)), (1<<(NB-2)));
	tb->test((1<<(NA-1)), (1<<(NB-1)));
	tb->test(0, (1<<(NB-1)));

	tb->test(0, 0);
	tb->test((1<<(NA-1))-1, 0);
	tb->test((1<<(NA-1))-1, (1<<(NB-1))-1);
	tb->test(0, (1<<(NB-1))-1);

	tb->test((1<<(NA-1))  , (1<<(NB-1)));
	tb->test((1<<(NA-1))-1, (1<<(NB-1)));
	tb->test((1<<(NA-1))-1, (1<<(NB-1))-1);
	tb->test((1<<(NA-1))  , (1<<(NB-1))-1);

That’s the corners, so let’s now try multiplying 2^n times one. This should test every bit in a, to make sure it is properly multiplied.

	for(int k=0; k<(NA-1); k++) {
		int	a, b;

		a = (1<<k);
		b = 1;
		tb->test(a, b);
	}

We can repeat this with b, this time multiplying by 2^15 in a.

	for(int k=0; k<(NB-1); k++) {
		int	a, b;

		a = (1<<15);
		b = (1<<k);
		tb->test(a, b);
	}

Again, these checks are a bit ad-hoc, but they are designed to capture all of the corner cases where 1) I might expect a bug, and 2) where bugs are easy to catch.

Having gotten all the corner cases I expect, I then turn my attention to some random tests.

	for(int k=0; k<1024; k++)
		tb->test(rand(), rand());

Now, let me ask, was that good enough? I don’t know. This isn’t a formal proof, so it’s hard to tell. However, if our multiply is small enough, we might be able to check all possible inputs.

	for(int k=0; k<(1<<NA); k++) {
		for(int j=0; j<(1<<NB); j++) {
			tb->test(k, j);
		}
	}

On the other hand, if you are trying to multiply two 32-bit numbers together, you might never see the end of that loop.

Still, if we’ve gotten this far, the algorithm works. All that’s left is to close up the test case and declare success.

	delete	tb;

	printf("SUCCESS!\n");
	exit(0);
}

Conclusion

This article presents the best slow-multiplication algorithm I have yet come across. It’s not fast–nor is it intended to be. However, it can be used cheaply in an FPGA that either doesn’t have any DSPs left, or one that is currently using all of its DSP infrastructure. Indeed, this is the multiply that the ZipCPU uses in these cases.

Without this algorithm, the ZipCPU would’ve never passed muster on an iCE40 HX8K FPGA.

I should also point out that the key to this algorithm, shifting the accumulator so that only the relevant 32-bits of the 64 are added on each cycle, can be used in a divide algorithm as well. So, I tried it. It worked, and it helped, it just … didn’t help as much as it did the multiply. It’s been a while since I did it, but I recall the improvement was 20% or less. It was good enough to keep, but not nearly enough to write home about.