A Simple ALU, drawn from the ZipCPU
Many digital logic design courses end with a discussion of how to build a CPU. The common lesson tends to focus on the arithmetic logic unit (ALU) as the work horse within the center of the CPU.
Since ALUs tend to be very simple, they are easy to look at and examine.
For this post, we’ll look at the ALU found within the ZipCPU. Unlike many of the “classroom” ALUs you may have come across, this simple ALU is also complete enough to support the newlib C library. As a result, you might find a couple of features within this ALU that you are not expecting.
A Basic ALU
If you’ve never seen how to build an ALU before, the logic to build one is actually very simple. It’s basically a big huge case statement that selects from among several possible outputs.
In this example, taken from the
ZipCPU’s
ALU, the
two inputs, i_a
and i_b
, are both 32bit values. The input to the
routine also includes a number, i_op
, identifying the operation that
needs to be calculated. The result is placed into an output register, o_c
.
The c
bit you see above is associated with the
ZipCPU’s carry logic–something we’ll
come back to later in this post.
Structurally, within an FPGA, the logic looks like Fig 1 below.
Each of the blocks in this figure takes up logic when implemented within
hardware. As a result, even if i_op
requests that the two values be
subtracted, all of the other operations (addition, and, or, xor, etc.) will
still be calculated. These other results, though, are just ignored. Thus,
on the final clock of the
ALU,
all of the operations have been calculated, but only the result of
the selected operation is stored into the output register.
So that’s what an ALU looks liike in general. Let’s now turn our attention to the ZipCPU’s ALU.
The ZipCPU ALU
The actual case statement within the ZipCPU’s ALU has sixteen operations that the instruction selects from among, not just the six shown above. In this section, we’ll look at all of that logic save the multiply. (Although the multiply takes up most of the code space within the cpuops.v file, it doesn’t fit into this lesson very well.) Put together, the full case statement for the ZipCPU’s ALU looks like:
Let’s walk through each of these operations.
The first operation, 4'h0
supports either a compare or a subtract
instruction. This instruction subtracts two numbers in order to produce its
result. The difference between the CMP
and SUB
instructions is that the
compare doesn’t write the results back to any registers in the end while the
subtract does–but since that’s external to the
ALU
implementation, you won’t see that difference here.
You may notice that the subtract that is taking place is a 33bit subtract
rather than a 32bit subtract. The reason for this is the carry, c
, bit.
In the case of a subtract, this bit will be true if i_b
had to borrow
from the (unspecified) high order bit in order to complete. We’ll discuss
this flag in more detail further down.
The TST
and AND
instructions are similar to the CMP
and SUB
instructions. If the operation is an AND
, then both the result
and the flags will be set during writeback, whereas only the flags are set in
the case of the TST
operation. Again, this difference is external to the
ALU
implementation itself.
The ADD
, OR
, and XOR
instructions should need no further explanation.
(XOR Ref)
The c
bit is set on the ADD
, though, if the result of adding unsigned
i_a
to i_b
overflows 32bits. This bit then makes it possible to string
32bit additions together to create a 64bit or larger operation.
The shift instruction(s) needs some additional discussion. The logical
shift left, or LSL
, is
what the compiler creates from an A = B << C
instruction. The result is
created by shifting all of the bits in B
to the left by one and filling
the results in with zeros.
The logical shift
right, or LSR
, comes from an A = B >> C
instruction in C, but only when
B
is unsigned. In this case, all the bits shift to the right and the
upper bits are filled in with zeros.
If B
is signed, however, the compiler will create an
arithmetic shift
right, ASR
, instruction instead. The ASR
is similar to the LSR
instruction with one exception: the ASR
instruction propagates the high order
bit during the shift. Hence the incoming i_a[31]
will always set the
outgoing o_c[31]
bit. So while
logically shifting a
32'hffff_fffe
to the right by one bit will create a 32'7fff_ffff
,
arithmetically shifting a
32'hffff_fffe
to the right by one bit will create a 32'hffff_ffff
.
The sign bit, i_a[31]
propagates on an ASR
.
Where things get interesting is what happens when you shift farther than
the number of bits in a register. For example, what should be the result of
32'h0000_ffff
when shifted left by 32? 32'h0000_0000
, right? Sure. Now
what happens when you shift left by 34 bits? Some
CPUs
examine only the lower 5 bits of the shift amount, and require the compiler
to make certain the shift is within bounds. The
ZipCPU,
however, examines all 32bits of the shift request contained in i_b
.
Hence, any attempt to logically shift more than 32bits on the
ZipCPU results in a zero.
The ZipCPU handles shifts with one further difference that isn’t necessarily used by the compiler: the carry bit is set to the last bit shifted off the register.
Now that all that is said, the LSR
logic is given by:
The logic for the LSL
and ASR
instructions is quite similar.
A second unique
ZipCPU
instruction is the bit
reverse instruction,
BREV
. This is a zero cost instruction that does nothing but reorder the
wires from the i_b
input:
This particular instruction is not found in other CPUs. It was placed into the ZipCPU in order to support the bit reversed addressing required by a Fast Fourier transform. It has since become integral to the ZipCPU’s instruction set. Here’s why:
Every CPU needs an ability to load a value with as many bits as a register into a register. That is to say, a 32bit CPU needs the ability to load a 32bit immediate value into a 32bit register–binutils, and particularly the linker within it, requires this capability. Since instructions are 32bits wide, you can’t fit both an instruction and a 32’bit value into the same instruction. While the x86 solved this problem by storing the 32’bit value directly in the instruction stream following this instruction, this risks misaligning the instruction stream and therefore adding complication to the instruction decoder. To avoid this extra complication, RISC CPUs tend to handle this problem with a two instruction load: the first instruction loads the upper half of the register, while the second instruction loads the lower half of the register.
In the case of the ZipCPU,
the BREV
instruction is the first instruction in this pair–it can be used to
load the upper 18bits of a register. The other instruction, shown in the
code above as LDILO
(load immediate into lower 16bits), loads the lower
16bits of any 32bit value. GCC requires that neither
of these operations affect the flags, something we can come back and discuss
another time. Together, these two instructions allow the
ZipCPU
to load any 32bit immediate value into a register. Since the extra cost
of bit reversing
a 32bit value is handled by the
assembler and linker (there’s no
hardware cost for doing this), there’s no performance penalty to the
ZipCPU for having this instruction.
That leaves two pieces of logic we haven’t discussed within the ZipCPU ALU. The first is the flag generation logic which we will come to next. The other piece of logic within the ALU that we haven’t discussed is the multiply. The multiply logic in the ZipCPU’s cpuops.v file is particularly complicated for the sole reason that the ZipCPU can support a 32x32bit multiply across a wide variety of hardware architectures. Some of these architectures can do a multiply in a single clock, while other FPGAs require two, three, or even four clocks to execute a multiply.
Let’s look at calculating the flags next.
The Flags Results
The ALU
creates and outputs a set of flags from the operation. Many instructions will
cause these to be placed into the condition codes
register.
The ZipCPU
supports all four of the common condition codes, or flags as we’ll call them
here: Z
(zero), C
(carry), N
(negative) and V
(overflow).
The Z
flag is set whenever the result is zero:
The compiler uses this flag whenever two numbers need to be tested and compared for equality.
The C
or “carry” flag was set above whenever an addition or subtraction
required a carry. I also mentioned above how the carry flag on the
ZipCPU
is also set for shift operations. All other operations clear the carry flag.
The compiler uses this flag whenever two numbers need to be tested and compared for an unsigned less than, or whenever an extended 64bit addition (or subtraction) needs to be carried out. (Unlike many other CPUs, the ZipCPU doesn’t have either “add with carry” or “subtract with carry” instructions.)
The N
flag is perhaps the simplest to discuss. This is set whenever the
sign bit is set on the output:
Well, not quite.
It turns out that there’s an ugly problem associated with setting the N
flag to the outgoing sign bit. Specifically,
if you wish to compare whether or not A<B
when A
and B
are large and
signed then o_c[31]
isn’t sufficient. To do this comparison, the CMP
instruction will enter the
ALU
via a subtract, AB
. The result of this subtraction will then be checked
to determine whether or not the result is less than zero. However, if the
result overflows the sign bit, then you still want to continue to set the
N
flag appropriately. If you don’t, then 32'h8000_0000
won’t be less
than 32'h7fff_ffff
.
As proof, consider what happens on a subtraction overflow:
The result of this subtraction should be a positive 2^(32)1
. However,
since an overflow took place, the MSB no longer reflects the correct sign.
We’re going to need to adjust N
therefore to reflect that this result is
positive and not negative.
While there may be a simpler way, the
ZipCPU
ALU
solves this problem in three steps. The first is to keep track of
the original sign of i_a
in a register called pre_sign
. The second part
is to determine if that sign needs to be kept on an overflow.
Then, on any overflow, the sign is flipped when determining N
.
When to keep a sign on overflow needs a touch more explanation:

If you add two values with the same sign, then the result should also have the same sign regardless of any overflow. Hence, two positive numbers should yield a positive result in spite of any overflow. In this case, the sign should be given by
i_a
’s sign, captured here inpre_sign
. 
The same is true of subtract, save only that the sign needs to be preserved any time the initial values have opposite signs. In particular, if you negate the second value, a subtract becomes the same as an addition, and then the addition rule above applies again.
Now, if the sign changes but yet was supposed to be kept, then the resulting
N
flag needs to be swapped–hence the
exclusive OR
in the code above.
The N
bit is used by the compiler to test whether or not i_a < i_b
when
both numbers are signed. By preserving the meaning of this bit in spite
of overflow, the compiler does what you would expect it to do–even when
comparing large numbers.
The final flag bit, the overflow or V
bit, requires a touch more logic
as well.
The ZipCPU sets the overflow flag on one of four conditions:

If this is an addition, and the signs of both operands are the same, then the result should be positive and the overflow bit will be set if the signs don’t match.
This alone is almost identical to the
N
logic above. 
The second case is the case for the subtract. If the incoming signs are identical, and the outgoing sign is different, then the overflow bit needs to be set. This is also very similar to the
N
logic above. 
The last two cases regard shifts. In particular, the ZipCPU sets the overflow if either of the logical shift instructions changes the sign of the output.
Put together, this yields the following logic for the V
bit:
Since the C language doesn’t have a way to test for overflow within the language, this flag has gone largely unused. Further, the compiler support for overflow checking for those other languages that do use it hasn’t been implemented yet.
Conclusion
That’s basically how an ALU works: it’s a series of operations whose output is selected via a case statement. While every CPU will support different instructions, at some level all CPUs will have a case statement selecting between various operations within them.
In the case of a CPU with a condition codes register. such as the ZipCPU, these codes are also easily calculated within the ALU as well.
Turning a simple ALU into a full blown CPU, such as the ZipCPU, takes a bit more work than we presented above. In fact, it takes a lot more work. Let’s consider that a good thing, though, because it will give us something to learn on another day.
In all thy ways acknowledge him, and he shall direct thy paths. (Prov 3:6)