Now that we’ve discussed some general pipeline strategies, it’s time to take a look at how pipelining can work within a simple, in order, pipelined CPU. Let’s take a look, therefore, at the ZipCPU and see how it handles its pipeline logic. What you’ll see is that the ZipCPU uses a variant of the handshaking strategy we discussed earlier.
There’s the prefetch and instruction cache stage, an instruction decode stage, a read operands stage, an ALU stage and a writeback stage. The ALU stage is also placed parallel to a memory operations unit, a divide unit, and a space for (a still undefined) floating point unit.
As currently built, the ZipCPU is an in-order processor. It’s pipeline structure requires that values going into the write-back unit be strictly in order.
If you choose to look through the main ZipCPU processor’s code, you may find it not as simple to read. That’s because the ZipCPU was written to thread a fine line between two separate purposes. The first, stated purpose, is to be a simple CPU–as measured by the amount of logic used by the CPU. The second purpose is to be a fast CPU. These two purposes are often in conflict. Therefore, the ZipCPU implementation allows you to select the performance you want, and adjust as necessary to fit the amount of logic you have available for a CPU within your FPGA.
In order to make the following examples make sense, it might help to understand how the various pipeline stages are supposed to interact. This will make examples easier to understand.
First, almost all ZipCPU instructions have the form:
In this example,
OP is the operation. It’s a 5-bit field identifying
which instruction is described by this instruction. Think of this as what
is being asked to do. Indeed, four of the five
OP bits form the
multiplexer selection input within the
we discussed earlier.
C term is the condition. This controls whether or not any value
is written back from the instruction. For example, the
Z condition which will cause the instruction to only write back
its results if the
Z flag is set. The other seven supported conditions are
discussed within the ZipCPU
is primarily a two operand machine. It has no three operand instructions.
instruction provides two inputs to the
Rb+I and the output gets written back into register
Finally, and this becomes important for the following discussion, almost
all instructions allow an immediate, usually 14-bits, to be added
to the second register,
Rb. Alternatively, an 18-bit immediate value,
may replace the ALU’s
Rb input so that
Rb is not used at all.
So, how does this impact pipelining?
In the first stage, prefetch, an instruction is read from memory.
As an example, this stage determines whether or not the instruction even has registers
Rbencoded within it that need to be read from the register set. At the end of this stage the CPU, knows:
a. what registers it needs to read (
b. if it needs to read from any register (
c. what the immediate value is (
dcd_I), and if that value is zero (
The third stage, the read operands stage, not only reads
Rbfrom the register file, but also adds
The fourth stage calculates the
OPfunction on the values from
If this is a memory operation, a store will place the value
Rainto memory location
Rb+I. A load will read from memory
Rb+Iand present the result to the writeback unit.
Memory accesses, divides, and multiplies may all take longer than a single clock in this stage. We’ll need to come back and discuss how to handle this later.
Fig 2: Answers go back into the ALU
However, if the output of the ALU gets placed into the
Rbregister input of the next instruction, and if that instruction has an immediate,
I, that needs to be added to it (
dcd_zIis non-zero), then the output cannot go directly back into the ALU. It will instead need to go back to the read operands stage, causing a pipeline bubble in the process. This other path is shown by the dotted line in Fig 2.
Finally, in the writeback unit, if the
Ccondition matches, the result is written back into the register set.
Conditional branches are detected in this stage. We’ll have to come back and discuss how to handle this within the pipeline later as well.
This is nominally how the ZipCPU handles and uses its pipeline. This simple pipeline, though, has all kinds of hazards—many of which we’ll discuss below.
Pipeline Stalls and Bubbles
Before diving into pipeline specifics, let’s define a pipeline stall.
According to wikipedia, a pipeline stall is a condition where there’s no valid instruction within a particular pipeline stage. Wikipedia declares a pipeline stall to be synonymous with a pipeline bubble.
For example, the ZipCPU divide takes roughly 32-clocks to complete. During these 32-clocks, instructions in the prior stage, the read operands stage, will be stuck there until the divide completes. Likewise, the writeback stage will be idle during this time, being reserved for writing back the results of the divide instruction.
Wikipedia’s definition declares the condition where no valid data is within a pipeline stage to be called a stall. This seems to me to be more of a consequence of a stall, than a stall itself. The pipeline bubble term describes this better, although Wikipedia declares the two terms to be synonymous.
If you watch the pipeline, such as within the demonstration we’ll discuss later, you can visually see bubbles form within it where there are no valid instructions.
Unique Pipeline Needs
There are two basic goals that a CPU has with respect to pipelining. These two goals are common among many pipeline systems:
Keep every stage filled.
For example, the ZipCPU is capable of completing and retiring one ALU instruction per clock. All of the pipeline logic within the CPU has been designed so as to keep one instruction completing on every clock.
This is a common goal of any high speed pipeline.
Satisfy any pre-requisites for instruction operations.
An ALU operation cannot complete if its values haven’t finished being produced from other parts of the CPU. For example, if register
R0is read from slow memory in one instruction and then immediately used on the next instruction, the pipeline logic will need to stall the second instruction, waiting for the ALU, until
This problem is even worse if an immediate is to be added to the value read from memory prior to heading into the ALU. In that case, it’s not just the ALU that needs to stall, but the ZipCPU also needs to stall the read operands stage where that immediate addition takes place. (We discussed this with Fig 2 above)
A CPU, though, has some other pipeline needs beyond the more traditional data flow processing pipeline. Specifically, CPU’s need to be able to jump from one instruction in memory to another. This creates some unusual pipeline requirements.
At its most basic level, a
needs be able to flush or clear its pipeline any time a branch renders
the work done in a prior stage irrelevant.
The worst case scenario is when a branch takes place that can’t be
caught prior to the writeback stage. In that case, the whole pipeline will
need to be flushed, costing the
one clock per pipeline stage that cannot be completed.
captures this logic with the
Other things can cause sudden pipeline changes as well. For example, what happens on an interrupt (peripheral initiated), a trap (user initiated), or a fault? The CPU will need to start processing instructions from a new location in the instruction stream–that of the interrupt service routine (ISR). The ZipCPU handles this condition by flushing the pipeline on any change of interrupt status. The ZipCPU is also unique in that the interrupt address is kept within an alternate register set. (See the ZipCPU specification for details.
Indeed, lots of things need to change within the
every time it switches to an interrupt context, or back again.
If you want to trace this logic, feel free to
w_release_from_interrupt wires. The first
will be true any time the
switches to an interrupt context, the second will be true anytime the
Now, with all that as background, we can discuss the logic necessary to handle a pipeline handshake.
pipeline is controlled primarily with three logic signals per stage:
CE, as shown in Fig 3 to the right. The logic is
designed around the idea of processing the data from the previous stage any
CE line is set. The basic logic is this: if a stage is not stalled,
and if the previous stage is valid, then the
CE line will be set. The
control signal is used to determine when to clock the data from the last stage
Let’s walk though that for a moment.
First, each stage has a condition (or set of conditions) that might stall the prior stage from entering this stage. This is the stall logic for this stage.
For example, the read operands stage will stall any time the
is already busy, the memory
unit is busy
with a read, or the
divide is busy.
div_busy are used to capture these
conditions. An exception is made if the memory unit is busy writing a value
over the bus. In this case,
mem_rdbusy captures the idea that the memory
is busy with a read and not a write. Likewise, the
will also stall if the external halt request line is true. (We’ll come back and
discuss this when discussing how to debug a
Another, less common, example is that the
will stall any time the condition codes are written to manually–lest the
write a value back according to the wrong conditions.
In general, though, the stall logic looks like:
In this case, I’m using
stage[n]_stalled to describe the
for this stage. If you look within the
you’ll find variables named
mem_stalled. These variables capture this
logic, together with all of the more obscure reasons why the
might be made true as well–such as on a debugging
Second, if any stage is not stalled, and if the prior stage has valid data within it, then we can step that stage forward.
If you pay close attention, you might find this looks a lot like the
(i_ce)&&(!o_busy), that we discussed in our last post on
these various pipeline variables are named
div_ce. There’s also a similar
master_ce which is controlled from the external debugging interface–something
we’ll get to later.
With these signals out of the way, we can start working through the pipeline
signals. The first registered signal is the
valid signal. This signal
indicates whether or not a particular stage has valid data within it.
To get valid data into a stage, the prior stage must have valid data, and
this stage cannot be stalled. Hence, we have the following logic:
the basic valid registers are called
op_valid signal is also broken into three separate signals, depending
upon which stage the read operands stage moves to next:
op_valid_div. Finally, the output
of the ALU/MEM/DIV stage is captured by the
div_valid signals. Indeed, these signals are used to determine whether
or not valid data is ready to be written back.
One particular difficulty with this language is the idea of an illegal
instruction. Within the
a stage containing an illegal instruction has the
valid line set high.
This is because an illegal instruction needs to be processed like any other
instruction, and the
valid line controls when processing moves forward.
The alternative, moving instructions forward any time
illegal were true, just costs more logic than required.
Each stage has many more logic lines beyond the ones we just discussed.
These can often be treated in a much simpler fashion, though. As a result,
these other signals are often sate based upon the simple
My best advice for others, when trying to find these pipeline problems, is to methodically debug your CPU using Verilator and some small programs. Use GCC both all optimizations turned on, as well as with all optimizations turned off. The two environments create very different pipeline environments, and programs working in the one may not work in the other. My second piece of advice would be to create a program containing every pipeline hazard you can think of, and to turn this program into a test program for the CPU. You can find the ZipCPU’s CPU testing program within the OpenArty distribution.
I have a wonderful demonstration of all of this that I would like to post, but to see it I may need some help from my readers.
allows you to watch how well the pipeline is filled at any given clock, as
well as the ability to “watch” the signals discussed above. Further, by
adjusting the pipeline
you can see how the pipeline and
are affected as the logic within the
is increased or decreased. Indeed, you can watch the
stall lines as they get set and adjusted on a clock by clock basis.
Nominally, the instructions to do this would require something like:
This is notional only, the above script has not been “tested” … but I think it captures the idea of what I would have you do.
My specific problem is that, while the design works nicely on my own system, I’d love to have some help from others who would like to try running the design on their own systems–so as to get some redundance across operating systems and version differences. If you are interested in trying this, please write be at the address below if you have problems, or perhaps just to tell me of your success. Enough successes, and I’ll write a more complete post about how to watch the ZipCPU pipeline in action.
That’s it! The ZipCPU basically uses the handshake approach to handling pipeline stalls that we discussed earlier. A couple extra variables for logic expression allow the ZipCPU to be able to detect and respond appropriately to pipeline hazards, stalling the CPU. anytime it is necessary to do so.
Now that we’ve gone through this example, I’d like to come back and discuss how to debug a CPU running within an FPGA in general. That post has been mostly written for some time, but has been waiting for this background beneath it to be explained.
I’d also like to post more complete instructions for the pipeline demonstration above, outlining how data moves through the various stages of the ZipCPU pipeline. I’ll hold off on the demonstration, post, until I have some confidence that it will work on the greatest number of computers.
For I know that my redeemer liveth, and that he shall stand at the latter day upon the earth (Job 19:25)