Since ALUs tend to be very simple, they are easy to look at and examine.
For this post, we’ll look at the ALU found within the ZipCPU. Unlike many of the “classroom” ALUs you may have come across, this simple ALU is also complete enough to support the newlib C library. As a result, you might find a couple of features within this ALU that you are not expecting.
A Basic ALU
If you’ve never seen how to build an ALU before, the logic to build one is actually very simple. It’s basically a big huge case statement that selects from among several possible outputs.
In this example, taken from the
i_b, are both 32-bit values. The input to the
routine also includes a number,
i_op, identifying the operation that
needs to be calculated. The result is placed into an output register,
c bit you see above is associated with the
ZipCPU’s carry logic–something we’ll
come back to later in this post.
Structurally, within an FPGA, the logic looks like Fig 1 below.
Each of the blocks in this figure takes up logic when implemented within
hardware. As a result, even if
i_op requests that the two values be
subtracted, all of the other operations (addition, and, or, xor, etc.) will
still be calculated. These other results, though, are just ignored. Thus,
on the final clock of the
all of the operations have been calculated, but only the result of
the selected operation is stored into the output register.
The ZipCPU ALU
The actual case statement within the ZipCPU’s ALU has sixteen operations that the instruction selects from among, not just the six shown above. In this section, we’ll look at all of that logic save the multiply. (Although the multiply takes up most of the code space within the cpuops.v file, it doesn’t fit into this lesson very well.) Put together, the full case statement for the ZipCPU’s ALU looks like:
Let’s walk through each of these operations.
The first operation,
4'h0 supports either a compare or a subtract
instruction. This instruction subtracts two numbers in order to produce its
result. The difference between the
SUB instructions is that the
compare doesn’t write the results back to any registers in the end while the
subtract does–but since that’s external to the
implementation, you won’t see that difference here.
You may notice that the subtract that is taking place is a 33-bit subtract
rather than a 32-bit subtract. The reason for this is the carry,
In the case of a subtract, this bit will be true if
i_b had to borrow
from the (unspecified) high order bit in order to complete. We’ll discuss
this flag in more detail further down.
AND instructions are similar to the
instructions. If the operation is an
AND, then both the result
and the flags will be set during writeback, whereas only the flags are set in
the case of the
TST operation. Again, this difference is external to the
XOR instructions should need no further explanation.
c bit is set on the
ADD, though, if the result of adding unsigned
i_b overflows 32-bits. This bit then makes it possible to string
32-bit additions together to create a 64-bit or larger operation.
The shift instruction(s) needs some additional discussion. The logical
shift left, or
what the compiler creates from an
A = B << C instruction. The result is
created by shifting all of the bits in
B to the left by one and filling
the results in with zeros.
The logical shift
LSR, comes from an
A = B >> C instruction in C, but only when
B is unsigned. In this case, all the bits shift to the right and the
upper bits are filled in with zeros.
B is signed, however, the compiler will create an
ASR, instruction instead. The
ASR is similar to the
instruction with one exception: the
ASR instruction propagates the high order
bit during the shift. Hence the incoming
i_a will always set the
o_c bit. So while
logically shifting a
32'hffff_fffe to the right by one bit will create a
arithmetically shifting a
32'hffff_fffe to the right by one bit will create a
The sign bit,
i_a propagates on an
Where things get interesting is what happens when you shift farther than
the number of bits in a register. For example, what should be the result of
32'h0000_ffff when shifted left by 32?
32'h0000_0000, right? Sure. Now
what happens when you shift left by 34 bits? Some
examine only the lower 5 bits of the shift amount, and require the compiler
to make certain the shift is within bounds. The
however, examines all 32-bits of the shift request contained in
Hence, any attempt to logically shift more than 32-bits on the
ZipCPU results in a zero.
The ZipCPU handles shifts with one further difference that isn’t necessarily used by the compiler: the carry bit is set to the last bit shifted off the register.
Now that all that is said, the
LSR logic is given by:
The logic for the
ASR instructions is quite similar.
This particular instruction is not found in other CPUs. It was placed into the ZipCPU in order to support the bit reversed addressing required by a Fast Fourier transform. It has since become integral to the ZipCPU’s instruction set. Here’s why:
Every CPU needs an ability to load a value with as many bits as a register into a register. That is to say, a 32-bit CPU needs the ability to load a 32-bit immediate value into a 32-bit register–binutils, and particularly the linker within it, requires this capability. Since instructions are 32-bits wide, you can’t fit both an instruction and a 32’bit value into the same instruction. While the x86 solved this problem by storing the 32’bit value directly in the instruction stream following this instruction, this risks mis-aligning the instruction stream and therefore adding complication to the instruction decoder. To avoid this extra complication, RISC CPUs tend to handle this problem with a two instruction load: the first instruction loads the upper half of the register, while the second instruction loads the lower half of the register.
In the case of the ZipCPU,
BREV instruction is the first instruction in this pair–it can be used to
load the upper 18-bits of a register. The other instruction, shown in the
code above as
LDILO (load immediate into lower 16-bits), loads the lower
16-bits of any 32-bit value. GCC requires that neither
of these operations affect the flags, something we can come back and discuss
another time. Together, these two instructions allow the
to load any 32-bit immediate value into a register. Since the extra cost
of bit reversing
a 32-bit value is handled by the
assembler and linker (there’s no
hardware cost for doing this), there’s no performance penalty to the
ZipCPU for having this instruction.
That leaves two pieces of logic we haven’t discussed within the ZipCPU ALU. The first is the flag generation logic which we will come to next. The other piece of logic within the ALU that we haven’t discussed is the multiply. The multiply logic in the ZipCPU’s cpuops.v file is particularly complicated for the sole reason that the ZipCPU can support a 32x32-bit multiply across a wide variety of hardware architectures. Some of these architectures can do a multiply in a single clock, while other FPGAs require two, three, or even four clocks to execute a multiply.
Let’s look at calculating the flags next.
The Flags Results
creates and outputs a set of flags from the operation. Many instructions will
cause these to be placed into the condition codes
supports all four of the common condition codes, or flags as we’ll call them
N (negative) and
Z flag is set whenever the result is zero:
The compiler uses this flag whenever two numbers need to be tested and compared for equality.
C or “carry” flag was set above whenever an addition or subtraction
required a carry. I also mentioned above how the carry flag on the
is also set for shift operations. All other operations clear the carry flag.
The compiler uses this flag whenever two numbers need to be tested and compared for an unsigned less than, or whenever an extended 64-bit addition (or subtraction) needs to be carried out. (Unlike many other CPUs, the ZipCPU doesn’t have either “add with carry” or “subtract with carry” instructions.)
N flag is perhaps the simplest to discuss. This is set whenever the
sign bit is set on the output:
Well, not quite.
It turns out that there’s an ugly problem associated with setting the
flag to the outgoing sign bit. Specifically,
if you wish to compare whether or not
B are large and
o_c isn’t sufficient. To do this comparison, the
instruction will enter the
via a subtract,
A-B. The result of this subtraction will then be checked
to determine whether or not the result is less than zero. However, if the
result overflows the sign bit, then you still want to continue to set the
N flag appropriately. If you don’t, then
32'h8000_0000 won’t be less
As proof, consider what happens on a subtraction overflow:
The result of this subtraction should be a positive
since an overflow took place, the MSB no longer reflects the correct sign.
We’re going to need to adjust
N therefore to reflect that this result is
positive and not negative.
While there may be a simpler way, the
solves this problem in three steps. The first is to keep track of
the original sign of
i_a in a register called
pre_sign. The second part
is to determine if that sign needs to be kept on an overflow.
Then, on any overflow, the sign is flipped when determining
When to keep a sign on overflow needs a touch more explanation:
If you add two values with the same sign, then the result should also have the same sign regardless of any overflow. Hence, two positive numbers should yield a positive result in spite of any overflow. In this case, the sign should be given by
i_a’s sign, captured here in
The same is true of subtract, save only that the sign needs to be preserved any time the initial values have opposite signs. In particular, if you negate the second value, a subtract becomes the same as an addition, and then the addition rule above applies again.
Now, if the sign changes but yet was supposed to be kept, then the resulting
N flag needs to be swapped–hence the
in the code above.
N bit is used by the compiler to test whether or not
i_a < i_b when
both numbers are signed. By preserving the meaning of this bit in spite
of overflow, the compiler does what you would expect it to do–even when
comparing large numbers.
The final flag bit, the overflow or
V bit, requires a touch more logic
The ZipCPU sets the overflow flag on one of four conditions:
If this is an addition, and the signs of both operands are the same, then the result should be positive and the overflow bit will be set if the signs don’t match.
This alone is almost identical to the
The second case is the case for the subtract. If the incoming signs are identical, and the outgoing sign is different, then the overflow bit needs to be set. This is also very similar to the
Put together, this yields the following logic for the
Since the C language doesn’t have a way to test for overflow within the language, this flag has gone largely unused. Further, the compiler support for overflow checking for those other languages that do use it hasn’t been implemented yet.
That’s basically how an ALU works: it’s a series of operations whose output is selected via a case statement. While every CPU will support different instructions, at some level all CPUs will have a case statement selecting between various operations within them.
Turning a simple ALU into a full blown CPU, such as the ZipCPU, takes a bit more work than we presented above. In fact, it takes a lot more work. Let’s consider that a good thing, though, because it will give us something to learn on another day.
In all thy ways acknowledge him, and he shall direct thy paths. (Prov 3:6)