A Simple ALU, drawn from the ZipCPU

Many digital logic design courses end with a discussion of how to build a CPU. The common lesson tends to focus on the arithmetic logic unit (ALU) as the work horse within the center of the CPU.

Since ALUs tend to be very simple, they are easy to look at and examine.

For this post, we’ll look at the ALU found within the ZipCPU. Unlike many of the “classroom” ALUs you may have come across, this simple ALU is also complete enough to support the newlib C library. As a result, you might find a couple of features within this ALU that you are not expecting.

A Basic ALU

If you’ve never seen how to build an ALU before, the logic to build one is actually very simple. It’s basically a big huge case statement that selects from among several possible outputs.

case(i_op[3:0])
	4'b0000:{c,o_c } <= {1'b0,i_a}-{1'b0,i_b};// SUB
	4'b0001:   o_c   <= i_a & i_b;		// And
	4'b0010:{c,o_c } <= i_a + i_b;		// Add
	4'b0011:   o_c   <= i_a | i_b;		// Or
	4'b0100:   o_c   <= i_a ^ i_b;		// Xor
	// ....
	default:   o_c   <= i_b;		// MOV, LDI

In this example, taken from the ZipCPU’s ALU, the two inputs, i_a and i_b, are both 32-bit values. The input to the routine also includes a number, i_op, identifying the operation that needs to be calculated. The result is placed into an output register, o_c. The c bit you see above is associated with the ZipCPU’s carry logic–something we’ll come back to later in this post.

Structurally, within an FPGA, the logic looks like Fig 1 below.

Fig 1: ALU Hardware Structure

Each of the blocks in this figure takes up logic when implemented within hardware. As a result, even if i_op requests that the two values be subtracted, all of the other operations (addition, and, or, xor, etc.) will still be calculated. These other results, though, are just ignored. Thus, on the final clock of the ALU, all of the operations have been calculated, but only the result of the selected operation is stored into the output register.

So that’s what an ALU looks liike in general. Let’s now turn our attention to the ZipCPU’s ALU.

The ZipCPU ALU

The actual case statement within the ZipCPU’s ALU has sixteen operations that the instruction selects from among, not just the six shown above. In this section, we’ll look at all of that logic save the multiply. (Although the multiply takes up most of the code space within the cpuops.v file, it doesn’t fit into this lesson very well.) Put together, the full case statement for the ZipCPU’s ALU looks like:

always @(posedge i_clk)
if (i_ce)
begin
	c <= 1'b0;
	casez(i_op)
	4'b0000:{c,o_c } <= {1'b0,i_a}-{1'b0,i_b};// CMP/SUB
	4'b0001:   o_c   <= i_a & i_b;		// BTST/And
	4'b0010:{c,o_c } <= i_a + i_b;		// Add
	4'b0011:   o_c   <= i_a | i_b;		// Or
	4'b0100:   o_c   <= i_a ^ i_b;		// Xor
	4'b0101:{o_c,c } <= w_lsr_result[32:0];	// LSR
	4'b0110:{c,o_c } <= w_lsl_result[32:0]; // LSL
	4'b0111:{o_c,c } <= w_asr_result[32:0];	// ASR
	4'b1000:   o_c   <= w_brev_result;	// BREV
	4'b1001:   o_c   <= { i_a[31:16], i_b[15:0] }; // LODILO
	4'b1010:   o_c   <= mpy_result[63:32];	// MPYHU
	4'b1011:   o_c   <= mpy_result[63:32];	// MPYHS
	4'b1100:   o_c   <= mpy_result[31:0];	// MPY
	default:   o_c   <= i_b;		// MOV, LDI
	endcase
end else // if (mpydone)
	// set the carry based upon a multiply result
	o_c <= (mpyhi)?mpy_result[63:32]:mpy_result[31:0];

Let’s walk through each of these operations.

The first operation, 4'h0 supports either a compare or a subtract instruction. This instruction subtracts two numbers in order to produce its result. The difference between the CMP and SUB instructions is that the compare doesn’t write the results back to any registers in the end while the subtract does–but since that’s external to the ALU implementation, you won’t see that difference here.

You may notice that the subtract that is taking place is a 33-bit subtract rather than a 32-bit subtract. The reason for this is the carry, c, bit. In the case of a subtract, this bit will be true if i_b had to borrow from the (unspecified) high order bit in order to complete. We’ll discuss this flag in more detail further down.

The TST and AND instructions are similar to the CMP and SUB instructions. If the operation is an AND, then both the result and the flags will be set during writeback, whereas only the flags are set in the case of the TST operation. Again, this difference is external to the ALU implementation itself.

The ADD, OR, and XOR instructions should need no further explanation. (XOR Ref) The c bit is set on the ADD, though, if the result of adding unsigned i_a to i_b overflows 32-bits. This bit then makes it possible to string 32-bit additions together to create a 64-bit or larger operation.

The shift instruction(s) needs some additional discussion. The logical shift left, or LSL, is what the compiler creates from an A = B << C instruction. The result is created by shifting all of the bits in B to the left by one and filling the results in with zeros. The logical shift right, or LSR, comes from an A = B >> C instruction in C, but only when B is unsigned. In this case, all the bits shift to the right and the upper bits are filled in with zeros.

If B is signed, however, the compiler will create an arithmetic shift right, ASR, instruction instead. The ASR is similar to the LSR instruction with one exception: the ASR instruction propagates the high order bit during the shift. Hence the incoming i_a[31] will always set the outgoing o_c[31] bit. So while logically shifting a 32'hffff_fffe to the right by one bit will create a 32'7fff_ffff, arithmetically shifting a 32'hffff_fffe to the right by one bit will create a 32'hffff_ffff. The sign bit, i_a[31] propagates on an ASR.

Where things get interesting is what happens when you shift farther than the number of bits in a register. For example, what should be the result of 32'h0000_ffff when shifted left by 32? 32'h0000_0000, right? Sure. Now what happens when you shift left by 34 bits? Some CPUs examine only the lower 5 bits of the shift amount, and require the compiler to make certain the shift is within bounds. The ZipCPU, however, examines all 32-bits of the shift request contained in i_b. Hence, any attempt to logically shift more than 32-bits on the ZipCPU results in a zero.

The ZipCPU handles shifts with one further difference that isn’t necessarily used by the compiler: the carry bit is set to the last bit shifted off the register.

Now that all that is said, the LSR logic is given by:

assign  w_lsr_result =
	// Check if the shift amount will overflow, return 33'h00 if it does
	((|i_b[31:6])||(i_b[5]&&(i_b[4:0]!=0)))? 33'h00
	// On a shift of 32 exactly, keep i_a[31] in the carry
				:((i_b[5])?{32'h0,i_a[31]}
	// Otherwise just shift the results by i_b
				: ( { i_a, 1'b0 } >> (i_b[4:0]) ));// LSR

The logic for the LSL and ASR instructions is quite similar.

A second unique ZipCPU instruction is the bit reverse instruction, BREV. This is a zero cost instruction that does nothing but re-order the wires from the i_b input:

genvar	k; generate
for(k=0; k<32; k=k+1)
begin : bit_reversal_cpuop
	assign w_brev_result[k] = i_b[31-k];
end endgenerate

This particular instruction is not found in other CPUs. It was placed into the ZipCPU in order to support the bit reversed addressing required by a Fast Fourier transform. It has since become integral to the ZipCPU’s instruction set. Here’s why:

Every CPU needs an ability to load a value with as many bits as a register into a register. That is to say, a 32-bit CPU needs the ability to load a 32-bit immediate value into a 32-bit register–binutils, and particularly the linker within it, requires this capability. Since instructions are 32-bits wide, you can’t fit both an instruction and a 32’bit value into the same instruction. While the x86 solved this problem by storing the 32’bit value directly in the instruction stream following this instruction, this risks mis-aligning the instruction stream and therefore adding complication to the instruction decoder. To avoid this extra complication, RISC CPUs tend to handle this problem with a two instruction load: the first instruction loads the upper half of the register, while the second instruction loads the lower half of the register.

In the case of the ZipCPU, the BREV instruction is the first instruction in this pair–it can be used to load the upper 18-bits of a register. The other instruction, shown in the code above as LDILO (load immediate into lower 16-bits), loads the lower 16-bits of any 32-bit value. GCC requires that neither of these operations affect the flags, something we can come back and discuss another time. Together, these two instructions allow the ZipCPU to load any 32-bit immediate value into a register. Since the extra cost of bit reversing a 32-bit value is handled by the assembler and linker (there’s no hardware cost for doing this), there’s no performance penalty to the ZipCPU for having this instruction.

That leaves two pieces of logic we haven’t discussed within the ZipCPU ALU. The first is the flag generation logic which we will come to next. The other piece of logic within the ALU that we haven’t discussed is the multiply. The multiply logic in the ZipCPU’s cpuops.v file is particularly complicated for the sole reason that the ZipCPU can support a 32x32-bit multiply across a wide variety of hardware architectures. Some of these architectures can do a multiply in a single clock, while other FPGAs require two, three, or even four clocks to execute a multiply.

Let’s look at calculating the flags next.

The Flags Results

The ALU creates and outputs a set of flags from the operation. Many instructions will cause these to be placed into the condition codes register. The ZipCPU supports all four of the common condition codes, or flags as we’ll call them here: Z (zero), C (carry), N (negative) and V (overflow).

The Z flag is set whenever the result is zero:

assign	z = (o_c == 32'h0000);

The compiler uses this flag whenever two numbers need to be tested and compared for equality.

The C or “carry” flag was set above whenever an addition or subtraction required a carry. I also mentioned above how the carry flag on the ZipCPU is also set for shift operations. All other operations clear the carry flag.

The compiler uses this flag whenever two numbers need to be tested and compared for an unsigned less than, or whenever an extended 64-bit addition (or subtraction) needs to be carried out. (Unlike many other CPUs, the ZipCPU doesn’t have either “add with carry” or “subtract with carry” instructions.)

The N flag is perhaps the simplest to discuss. This is set whenever the sign bit is set on the output:

assign	n = (o_c[31]);

Well, not quite.

It turns out that there’s an ugly problem associated with setting the N flag to the outgoing sign bit. Specifically, if you wish to compare whether or not A<B when A and B are large and signed then o_c[31] isn’t sufficient. To do this comparison, the CMP instruction will enter the ALU via a subtract, A-B. The result of this subtraction will then be checked to determine whether or not the result is less than zero. However, if the result overflows the sign bit, then you still want to continue to set the N flag appropriately. If you don’t, then 32'h8000_0000 won’t be less than 32'h7fff_ffff.

As proof, consider what happens on a subtraction overflow:

  32'h7fff_ffff ( 2^(31)-1)
- 32'h8000_0000 (-2^(31)  )
---------------
= 32`hffff_ffff

The result of this subtraction should be a positive 2^(32)-1. However, since an overflow took place, the MSB no longer reflects the correct sign. We’re going to need to adjust N therefore to reflect that this result is positive and not negative.

While there may be a simpler way, the ZipCPU ALU solves this problem in three steps. The first is to keep track of the original sign of i_a in a register called pre_sign. The second part is to determine if that sign needs to be kept on an overflow. Then, on any overflow, the sign is flipped when determining N.

always @(posedge i_clk)
	if (i_ce) // 1 LUT
	begin
		pre_sign <= (i_a[31]);
		keep_sgn_on_ovfl<=
			(((i_op==4'h0)&&(i_a[31] != i_b[31]))//SUB&CMP
			||((i_op==4'h2)&&(i_a[31] == i_b[31]))); // ADD
	end

assign n = (o_c[31])
	^ ((keep_sgn_on_ovfl)&&(pre_sign != o_c[31]));

When to keep a sign on overflow needs a touch more explanation:

If you add two values with the same sign, then the result should also have the same sign regardless of any overflow. Hence, two positive numbers should yield a positive result in spite of any overflow. In this case, the sign should be given by i_a’s sign, captured here in pre_sign.
The same is true of subtract, save only that the sign needs to be preserved any time the initial values have opposite signs. In particular, if you negate the second value, a subtract becomes the same as an addition, and then the addition rule above applies again.

Now, if the sign changes but yet was supposed to be kept, then the resulting N flag needs to be swapped–hence the exclusive OR in the code above.

The N bit is used by the compiler to test whether or not i_a < i_b when both numbers are signed. By preserving the meaning of this bit in spite of overflow, the compiler does what you would expect it to do–even when comparing large numbers.

The final flag bit, the overflow or V bit, requires a touch more logic as well.

always @(posedge i_clk)
	if (i_ce) // 1 LUT
		set_ovfl<=(((i_op==4'h0)&&(i_a[31] != i_b[31]))//SUB&CMP
			||((i_op==4'h2)&&(i_a[31] == i_b[31])) // ADD
			||(i_op == 4'h6) // LSL
			||(i_op == 4'h5)); // LSR

The ZipCPU sets the overflow flag on one of four conditions:

If this is an addition, and the signs of both operands are the same, then the result should be positive and the overflow bit will be set if the signs don’t match.

This alone is almost identical to the N logic above.
The second case is the case for the subtract. If the incoming signs are identical, and the outgoing sign is different, then the overflow bit needs to be set. This is also very similar to the N logic above.
The last two cases regard shifts. In particular, the ZipCPU sets the overflow if either of the logical shift instructions changes the sign of the output.

Put together, this yields the following logic for the V bit:

assign	v = (set_ovfl)&&(pre_sign != o_c[31]);

Since the C language doesn’t have a way to test for overflow within the language, this flag has gone largely unused. Further, the compiler support for overflow checking for those other languages that do use it hasn’t been implemented yet.

Conclusion

That’s basically how an ALU works: it’s a series of operations whose output is selected via a case statement. While every CPU will support different instructions, at some level all CPUs will have a case statement selecting between various operations within them.

In the case of a CPU with a condition codes register. such as the ZipCPU, these codes are also easily calculated within the ALU as well.

Turning a simple ALU into a full blown CPU, such as the ZipCPU, takes a bit more work than we presented above. In fact, it takes a lot more work. Let’s consider that a good thing, though, because it will give us something to learn on another day.