Building a prefetch module for the ZipCPU
At its most basic level, any CPU works by fetching instructions from memory, acting upon those instructions, and repeating the process over and over again as shown in Fig 1. The ZipCPU is no different. It also needs to fetch instructions from memory and then act upon them in a tight loop.
However, while the ZipCPU accomplishes this same basic loop, the pipelining within the CPU might render these steps a touch more difficult to recognize. Indeed, the ZipCPU splits up the “do-instruction” into multiple pipeline steps,
as illustrated in Fig 2.
We’ve already discussed how the ZipCPU handles the do instruction stage within its arithmetic logic unit. We’ve also discussed how the ZipCPU handles the signaling between its pipeline stages. What we haven’t discussed is how the ZipCPU reads instructions from memory. This is accomplished by the prefetch unit.
The prefetch is that portion of a CPU that reads instructions from memory and presents those instructions to the rest of the CPU for execution. While today’s Wikipedia author’s argue that there’s a difference between an instruction fetch, which gets the next instruction from memory, and a prefetch, which fetches an instruction before it is needed, I’m going to equate the two terms under the single name prefetch for simplicity. The result of this abuse of terminology will be that I can describe all instruction fetch modules with the same term, but also that this subtle difference in meaning will be lost.
Using this common term, the prefetch is a critical component of any CPU because the CPU can only run as fast as it has instructions. A good prefetch unit, therefore, is optimized to keep the CPU’s pipeline filled with instructions.
The prefetch is so important to how the ZipCPU runs that the ZipCPU has had four prefetch units over time. The first of these, prefetch.v, simply fetches a single instruction and waits for the CPU to ask for the next instruction. This is the code we shall discuss below. The next prefetch version tried to keep the Wishbone pipeline full with memory requests while caching a window of memory. This prefetch has since been abandoned in favor of a more traditional combined prefetch and instruction cache version. (The Dhrystone benchmark was part of the motivation for this change.) The fourth version, one that fetches two instructions at a time, was written recently for a low logic application. It improves upon the simple prefetch by fetching two instructions back to back.
Today, though, we’re going to discuss the first of these
units–the one that fetches only a single instruction at a time.
The ZipCPU can be caused to use this
routine by defining the
OPT_SINGLE_FETCH option within the
Going from this simple
to a better
requires adding a
and some other things–but it’s still built upon the basis of how to build a
in the first place. For these reasons, the simple
is well worth studying and understanding.
Further, since the ZipCPU’s prefetch module is a Wishbone master, this will be another opportunity to discuss how to build a Wishbone master interface. While the Wishbone slave interface is fairly simple, the master interface can be a touch more complicated. When we’re done, we’ll even formally prove that this interface works, thus showing you how to build a more generic Wishbone master. (We presented another version of how to build a Wishbone master earlier, as part of our debugging bus discussion.)
The CPU Interface
Any prefetch module must support two interfaces: both an interface to memory, as well as an interface to the CPU. Both of these interfaces are shown in Fig 3.
We discussed the Wishbone interface at length in an earlier article. Today, we’ll introduce the interface with the CPU. If you remember our prior discussion on pipeline strategies, then you will probably recognize several examples of handshaking between pipeline stages as we go along.
Internally, a CPU keeps track of the address of the next instruction in a register known as the program counter. This counter normally changes in a one-up fashion from one instruction to the next. The common exception to this rule is when a branch takes place. Such a branch may be the result of a branch instruction, a jump to an interrupt service routine (ISR), or even a CPU reset request. In all of these examples, the CPU needs to command a memory read from a brand-new address of the prefetch unit.
Well, not quite. If you look through the
you’ll find the address of the “current” instruction wandering
through the various pipeline stages as one of several registers:
alu_pc. The actual register that stores the
PC upon instruction
completion is either the
PC register) or the
upc (user PC register).
Inside the prefetch, the
pf_pc is renamed as the
PC address request,
and the address of the returned instruction,
o_pc, is relabeled within
side of this interaction, nothing begins until the
i_new_pc signal. When this signal is valid, the address of
the instruction that the
wants is captured in the
i_pc input. The
must respond to this input by fetching the instruction from the memory
address given by this
Once the new instruction has been read, the
needs to do several things. First, it places that new instruction in the
o_insn output. Second, it places the instruction’s address into the
register. Finally, it sets the
o_valid flag to let the
know that there’s a valid instruction ready for it. This flag is part of
a simple handshake
with the CPU.
As a result, it needs to be held high until the
also raises the
i_stall_n line (a ready signal) to indicate that it is
ready to read an instruction.
(o_valid)&&(i_stall_n) are both true, the
needs to drop the
o_valid signal and fetch the next instruction from the
address given by
o_pc+1 (counting in 32-bit words).
This continues until another
takes place. When that happens, the
will communicate this to the
by raising the
i_new_pc signal. In response, the
will drop the
i_stall_n is true), and fetch the next instruction and so the
This is the basics of how this simple prefetch works. As with any FPGA project, however, we’ll need to pay close attention to the details.
For example, what happens when the
returns an error?
Suppose, for example, that the
tried to branch
to a non-existent memory address. In this case, the
would return an error, and so the
needs to return everything as before, only in this case it also sets an
o_illegal flag so that the
can recognize and properly respond to the invalid instruction.
Another corner case might seem more relevant for a
interaction than this simple
but it’s actually still relevant. In this example, the
instructs (or is instructed) to clear its
It may be, for example, that the
is halted and under the control of the debugger. The debugger may have
rewritten the memory the
is about to execute–but the
needs to know that the instruction that it has already read is no longer valid.
This is the purpose of the
i_clear_cache line for even this simple prefetch
When raised, the
is telling the
that any instruction that has been read and is waiting for the
to issue is invalid, and thus needs to be read again.
Finally, what happens when the prefetch is given a new address while it is in the middle of requesting another address? In that example, the CPU needs to abandon (abort) its current transaction and initiate a new read to get the newly requested address.
These subtle details help to describe some of the more interesting cases when dealing with this (otherwise) simple prefetch. However, the operation is still simple enough that we might try to build it in a straightforward fashion–the topic of our next section.
Now that we know how this prefetch module is supposed to interact with both the bus and the rest of the CPU, let’s outline some of the detailed basics of how this might take place.
The process starts out with the bus idle. Similarly, on any reset request we’ll want to bring the bus back to this idle state again.
Since these two lines qualify all of the other
bus output lines
(prefixed herein with
o_wb_*), all it takes to
end a bus cycle
is to lower these two wires.
What about any bus transaction in process when this reset is received? That was part of our discussion when we worked through a formal description of the Wishbone bus. Our conclusion, from that discussion, was that it is important to be able to abort a bus transaction at any stage in that transaction. Were we in the middle of a transaction, the logic above would simply abort that transaction as we’ve discussed.
o_valid line initialized above isn’t part of the
As we discussed in the last section, this is the signal to the
that a valid instruction is ready to be read from the
This signal also needs to be reset, along with the
interaction, so we
o_valid at the same time above.
The next step is to recognize when a new transaction needs to begin. There
are three basic conditions indicating that we want to start a new
The first is if the
wants us to branch
to a new address. In this case, the
have raised the
i_new_pc signal and placed the new address in the
input. A new transaction also needs to begin any time the
accepts the instruction the
presented to it. This condition is indicated by both an instruction being
o_valid, at the same time the
is not stalled,
The address appropriate for this new
depends upon the reason for the request. If the
gives us a new instruction address, indicated by
i_new_pc, then that
address is the memory address we need to fetch.
On the other hand, if the last instruction was just accepted, then we want to grab the next instruction–found one address later.
In each of these cases, the output instruction needs to be marked as no longer valid. If this were a branch, every other stage in the CPU pipeline. would be marking their data as invalid as well.
The final section of this overview pseudocode discussion involves how the controller should respond while a bus interaction is taking place.
The first item to pay attention to during a
is to insure that only one transaction request is issued. (Other
prefetches issue multiple requests in quick succession, this
only issues the one request at a time.) From the
we know that a transaction request has been accepted any time the master’s
STB signal is high and the slave’s
STALL signal is low, or
(o_wb_stb)&&(!i_wb_stall). We can short-circuit this full test in this
by just setting
o_wb_stb low anytime
low during a bus transaction. Should
i_wb_stall be low when
is already zero, then this statement will have no effect–as desired.
The second item to deal with is when to end our request. In this
the transaction ends on the first
o_wb_cyc is returned to zero following any
from the memory slave peripheral.
On this same clock, we can set the value of the instruction,
o_insn, to be
sent to the
as well as the valid flag,
o_valid, to indicate this instruction is valid.
This is the basic outline of how this prefetch module works. When we get into the details below, you will likely find them very similar to this discussion above. However, because breaking this one giant “always” block into multiple processes can reduce our logic requirements, you may not necessarily recognize the code above in the code presented below.
You’ll also see another difference below, associated with having to deal with some of the subtle details of the corner cases–things you may not expect unless you’ve had your CPU fail because you haven’t dealt with them. These will be the topic of the next section.
The Actual Prefetch Code
At this point, we’ve outlined how this prefetch module needs to interact with both the CPU, and the Wishbone bus. We’ve also outlined the basics of the module we’d like to implement above. The task left before us now is to finally implement the details of this module, and then to prove that it works below. In other words, it’s now time to get specific about those corner cases.
is a read-only structure, we’ll set the
bus wires associated with writing to the bus to zeros, although only
o_wb_we signal will be relevant since
o_wb_data is ignored unless
we are within a write transaction and
Having dealt with the constants, we can now
turn our focus to the actual implementation of the logic
above. We’ll start with what’s left of the giant always block controlling
the bus wires
o_wb_stb. As before, we’ll start by
initializing ourselves into an idle state.
Unlike before, we’ll also return to this idle state upon any bus
These two signals need to be qualified by the
o_wb_cyc line, since in our
last wishbone discussion,
how either the
i_wb_ack or the
i_wb_err signal might be true on the clock
o_wb_cyc is dropped as part of an abort.
The next task is to start a new bus request. There are several reasons for starting a new bus request:
If the last instruction was accepted, and it wasn’t the result of a bus error.
There should only be two ways out of a bus error condition.
The first way out of a bus error condition is by the CPU branching to a new instruction. Two examples will help illustrate this. The first example would be if the pipeline has gotten ahead of the CPU and read past the end of the memory device, but while the software program itself has not. Perhaps the last item of memory is a branch statement and the software hasn’t gotten that far, even though the pipeline is beyond it. The second example is that of a CPU branch in response to an error condition. This would be the case if the CPU actually tried to execute the instruction at the address that caused the bus error. In this case, we’d get a new request for an instruction, on an
i_new_pc, only the instruction address requested,
i_pc, would be the address of the instruction error handler.
The second way out of a bus error condition is via the
We’ll want to start a new transaction if the last transaction was aborted. In this case, we’ll use an
invalidflag to indicate that the last bus transaction ended in an invalid manner—such as if the CPU issued us an
i_new_pcsignal during a memory transaction. In this case, the
invalidflag is our memory that we need to start a new transaction to get that updated address.
We’ll also start a new transaction following any branch to a new address, indicated by the
i_new_pcsignal. This differs from the
invalidversion above in that this request may take place while the bus is already idle.
In all three of these cases, a new
is initiated by raising both
o_wb_stb lines high.
We’ll also need to adjust
o_wb_addr at this time as well, but we’ll
come back to that later as part of its own
The next question is how to handle an ongoing bus transaction.
We already dealt with any bus transaction aborts due to bus error’s above in the reset logic, so that leaves only two items to deal with. The first is dropping the strobe line once our request has been accepted,
and the second is aborting the transaction any time a branch request is received during the bus transaction.
In this latter case,
invalid will be true on the next cycle to let us
know that we need to restart our
transaction with a
new address. Here’s the logic associated with letting us know that an
needs to be restarted.
The next value of interest is the address of the instruction we are
interested in. This address is set any time the
issues a new address via
i_new_pc. In all other respects it is incremented
any time a valid instruction is accepted by the
the one exception to this choice
being the case of an illegal instruction resulting from a
In that case, the instruction address doesn’t change–and we don’t issue new
Since this is just a simple prefetch
one that only returns a single instruction, we can re-use the
o_wb_addr, as instruction address lines when sending them to the
As for the instruction passed to the
that’s one of the simpler things we need to do–especially since this
version only requests one instruction at a time. We’ll set this value on any
return, found in the
i_wb_data word, and indicated by both the fact that
we are in a
o_wb_cyc is high), and by the acknowledgement flag,
The last step is to handle the two flags,
o_illegal, sent to the
to let it know if the instruction presented in the
o_insn register is a valid
instruction or not.
Initially, the instruction wires will always be invalid.
Likewise, following any reset, branch, or clear cache request, we’ll need to mark the instruction as invalid as well.
We’ll also want to mark the instruction,
o_insn, as valid immediately
following any bus acknowledgement,
i_wb_ack. Since this acknowledgement flag
is only valid during a bus cycle (and may accidentally show up after a bus
cycle, as the result of an abort), we’ll have to check
o_wb_cyc as well
to know if we need to set this.
Further, as we mentioned above, the
!o_illegal signal is being used as an
indicator that the result of the bus request is a valid instruction versus
just being a valid response. Hence, if this was the result of a
we need to set
o_illegal at the same time we set
Once the CPU
accepts our instruction, that is once
(o_valid)&&(i_stalled_n) are both
true, then we need to clear the
o_valid flag, lest the
accidentally read the same instructions twice.
While you might be tempted to clear the
o_illegal flag as well, doing so
would be a mistake. In particular, you want to keep the
from trying to fetch, refetch, and refetch again, any response that was returned
Hence, we’ll leave
o_illegal flag true following any
and use it as a flag (above) to keep us from re-initiating a new
prior to a new PC
being given to us to recover from this error condition.
One item worth noting about the code above, is that the giant
that remain only control a small number of signals. The largest groups of
signals within this design are associated with the
o_wb_addr, and the
o_insn. These two groups of signals depend upon only
a minimum number of inputs, helping to keep our logic to a
registers that require complex logic, such as
o_illegal or even
o_valid, are all single registers–minimizing the
impact of any difficult logic on our overall core.
Now that we know how this prefetch component interacts with the ZipCPU, and now that we’ve presented the how’s and the why’s of the logic within it, it’s now time to take a look at formally proving whether or not it does what we are expecting. We’ll separate this section into four subsections below: assumptions about logic coming from the CPU, assumptions about logic coming from the Wishbone bus, assertions about logic controlling the Wishbone bus, and then assertions about or logic used to communicate with the CPU, As before, the basic rule of formal verification remains: assume properties of inputs, assert properties of outputs.
Assumptions about logic coming from the CPU
There are four basic control lines coming from the
needs to interact with: the reset line,
i_new_pc, the request for us to clear our cache,
i_clear_cache, and the
ready (not stalled) line,
i_stall_n. The fifth input from the
i_pc, is only relevant when
i_new_pc is valid.
We’ll start out with a standard assumption: Everything begins in a reset state.
You may remember the
ASSUME macro from my first experiences with
formal methods. This
macro is set to
assume inputs from another part of the design only when
we are tested in isolation, and to
assert those same properties any time
we are tested as a component of a larger interaction. The macro itself
is defined within the
Moving on, we also know that, following any reset request from the
the first thing the
will do will send us an
i_new_pc command–requesting a read from the
The same is true of the
i_clear_cache signal. The
will always follow this signal by an
Now let’s look at the
i_stalled_n signal. This signal comes from the
and tells us when the
is not stalled. This is a handshake signal, much like the
we discussed when we discussed pipeline
Hence, the only time this signal matters to us is when
o_valid is true. We
can still constrain it however.
The first constraint on this signal is that following any reset, the rest of the CPU will be idle. Stages beyond this one cannot be busy following a reset.
The next constraint on this signal is that the CPU cannot suddenly become stalled without being given an instruction. Stalls only take place when there’s an instruction in the following stage that is taking more than one clock to execute–they can’t happen without an instruction. Hence, if the CPU wasn’t stalled on the last clock, and we didn’t pass the CPU an instruction on the last clock, then it cannot be stalled on this clock.
Our last criteria isn’t so much a characteristic of the CPU, but rather one required by the proof. In order for induction to be successful, all of the various states need to be flushed within a given number of clocks. To make certain this happens, we’ll insist that the CPU can only be stalled for less than four clocks.
In practice, the CPU, can be stalled for much longer. Divide instructions, for example, will stall the entire pipeline for 32+ clocks. This is just about speeding things up enough so that the solver can prove a solution.
To make this limit, we’ll first count the number of clocks we need to wait for the CPU, to be receptive to our instruction.
Finally, we’ll assume that this number remains less than the parameterized (but fixed) delay above.
Further, we’ll caveat this last test so that it will only take place if the prefetch is being tested in isolation, and not require it any time the prefetch is being tested as part of the CPU.
Those are the assumptions we need to make regarding how the CPU controls this prefetch module. In many ways, these assumptions form the CPU’s side of a contract: the prefetch will work as long as the CPU, and the Wishbone bus which we’ll discuss next, keeps its end of the contract.
Assumptions about logic coming from the Wishbone bus
Making assumptions about the Wishbone bus, however, is now really easy. Because we put together a module describing the properties of the Wishbone bus, we only need to include that module to get a copy of all of the various assumptions (and assertions) associated with interacting with this bus.
Once included, and given that our proof succeeds, we will then know that we are interacting validly with any peripheral on the Wishbone bus.
There are a couple of options we set above, however. These include the size
of the address bus and data bus, as well as the log (based two) of the length of
any interaction (
F_LGDEPTH=2). We also indicated that this would be a source
CYC&STB go high together), and that we will only ever make one
request of the bus (
F_MAX_REQUESTS(1). Since the
isn’t involved in writes, we can leave
the read-modify-write option off (
F_OPT_RMW_BUS_OPTION). We’re also
not going to be restarting requests while
CYC is high, so we can leave
F_OPT_DISCONTINUOUS option low.
This part is just that easy: include assertions and assumptions from elsewhere and we’re done. Well … almost. We still need to make certain that the number of requests and acknowledgements counted by this formal description of a Wishbone master match the logic within our prefetch module. That’s coming up in the next section.
Assertions about logic controlling the Wishbone bus
We’ve now finished with the assumptions about our inputs. It’s now time to turn to look at any assertions we wish to make about our outputs. We’ll start with the assertions about the Wishbone bus.
Our first assertion is that we are reading only from the bus. This may seem silly, but … a prefetch should never do more than read from a bus. It’s worth knowing that that’s all we are going to do.
We’re also going to assert that, two clocks after an
we’ve abandoned any ongoing
Why two clocks? Well, the first clock should be the
The second clock should be the
i_new_pc signal. Then, on the third clock,
o_wb_cyc should be indicating that we are within a transaction.
Once we have a valid result (instruction) to present to the
then the address of this result shouldn’t change, neither should the instruction
itself–as long as we are holding
o_valid high. Since this address is our
reference for the next instruction address, we can’t allow this to change
until the next bus cycle.
We’re also going to assert that we start a
following any abort. Since
invalid will be true following any abort based
upon a new PC,
this assertion captures the logic in question.
Those are the things we need to assert regarding our bus interaction–things specific to this prefetch.
Assertions about logic responding to the CPU
The last set of assertions are those associated with our responses to the CPU. These are primarily about the integrity of our return signals.
Since this is a single instruction prefetch
accepts an instruction we’ll have to go get a new instruction.
This means that the
o_valid line must immediately drop–at least until the
next instruction is received.
We can go further and insist that any time we are within a
o_valid line must also be low. Consider the consequences if this weren’t
the case: if the
were allowed to present a valid instruction to the
and a new instruction was received from the
where should it be stored?
Further, any time we get an instruction from the
we need to assert that we are telling the
that we have a valid instruction on the next clock cycle–the first cycle
o_insn is valid.
Of course, following an
i_clear_cache request, we’ll need
to make sure that the instruction presented isn’t valid.
Not only that but two clocks following an
i_clear_cache request we want
to make certain we are still invalid. This makes sure we don’t abort
a bus cycle
and somehow turn on the
You may remember the discussion regarding two clocks past the
signal above, having to do with the
o_wb_cyc output. This is really just
an assertion of (roughly) the same thing.
Now let’s start looking at the content of what we are returning to the
As long as we are presenting a valid instruction to the
is stalled and not ready for the instruction, then we need to continue
presenting our valid instruction.
Exceptions to this rule include the clock following any
Further, any time we present a valid instruction for two clocks in a row,
none of the information associated with that instruction should be able
to change. This goes for not only the instruction itself,
also the address of the instruction,
o_pc, and the
o_illegal. Another way to say this would be to say that
these lines shouldn’t change until the
o_illegal line needs to remain valid even after the instruction has
been accepted–at least until the next
i_new_pc command, since we are using
it as an indication not to refetch an instruction that is no longer
o_valid. Indeed, as long as
o_wb_cyc remains low (with exceptions),
o_illegal needs to remain unchanging.
That leaves us with two more assertions, both about the returned address,
The first of these address assertions is that, unless the CPU tells us otherwise, we need to walk through the instruction stream one address at a time. There are a couple parts to making this assertion. We’ll need to keep track of anytime we have a valid past PC address to compare against.
o_valid signal, we have a valid
PC to load into our
There’s a trick to making this work, though, and that is that we can’t allow
f_last_pc register to be just anything–even when it isn’t being
referenced. This is a requirement of the formal induction step which will start
in any random (valid) state. Without the assertion below, the induction
step might start with an unreasonable
f_last_pc value, and then conclude that
was in error.
Finally, we’ll make this first assertion associated with the output
address, that following any valid instruction and without an intervening
i_new_pc, the next address must be one more than the last address.
That’s the first of the two address based assertions.
The second of these two assertions is a more complete assertion, this time
dealing with the address of the next request. In this case, we keep track of
the last address requested by the
f_req_addr and increment it on any
Now, any time a value is being requested from the
it should be the value found within
This is the last assertion we need to test
o_wb_addr, but as with the last
address assertion, this assertion isn’t sufficient. In particular, if we don’t
constrain it further, the induction step might assume that the
has a random value (since we haven’t told it otherwise), and then draw
an invalid conclusion as a result. Hence we’ll need to assert that if the
isn’t active, then the
f_req_addr must be the same as the bus address.
As a last assertion, we’ll insist that the
invalid signal only ever be
true for a single clock.
This ends the list of assertions used to prove that the single instruction prefetch works as designed.
If you are not familiar with the ZipCPU, you should know that one of the reasons why I built the ZipCPU was to allow me to experiment with CPU design on really cheap FPGA hardware (Ex      ). Achieving this goal required me to pay a lot of attention to logic minimization. It also means that any time I walk through my own code, I am forever asking myself, “is this wire needed?” “Can I remove the dependence of this flip-flops on this logical condition?” Doing this, though, requires two specific capabilities from my tool set.
First, I need to be able to know, of a certainty, any time I adjust a piece of logic, that the module will continue to work as designed. This is the purpose of the formal model checks above, and/or any test benches I might use.
Second, I also need to be able to know how many
and so forth my logic uses. Traditionally, I have been using
to build my entire design and then to report to me the logic used by the
design. This can take a long time time (10+ minutes). On the other hand, as
part of putting this post together, I discovered that I can use
yosys with either the
synth_ice40, or (hopefully soon) the
Altera chips) command to then be able to
estimate the logic required. Below, for example, is the
output from processing the
file above with
Indeed, I was pleased to discover that the number of LUTs required by this prefetch went down as a result of building a formal proof of this prefetch.
This prefetch only fetches one instruction
The presentation above demonstrated how one of the ZipCPU prefetch modules was put together. As I mentioned above, the ZipCPU’s has three other prefetch modules (although only two are actively maintained). This is also the first, and in many ways the simplest, of the prefetch module’s the ZipCPU has had.
Why did I switch?
I switched from this prefetch module to another when I watched how fast the pipelined ZipCPU performed when using this prefetch module. Indeed, the performance was so pitiful it was almost unbearable to watch the instructions flow through the pipeline–with never more than one instruction in the pipeline at any given time. Eventually, I measured the ZipCPU’s performance against the Dhrystone benchmark. Using a prior version of this prefetch module, the ZipCPU managed to achieve 0.128 DMIPS/MHz–a pitiful score. The score, however, should be compared with the 0.95 DMIPS/MHz score the ZipCPU achieved when fully pipelined. [Ref]
Since that comparison, however, the ZipCPU has been extensively modified–to include adjusting this prefetch. As a result of one of those changes, this prefetch. will now start fetching a new instruction a new instruction as soon as the CPU has accepted the last one, rather than waiting for the CPU to flush the last instruction through the pipeline before starting the next fetch. As a result, while it’s still painful to watch this prefetch operate, it’s not nearly as bad as it was originally.
If you like this blog, please consider supporting it on Patreon. Thank you!
And now I have told you before it come to pass, that, when it is come to pass, ye might believe. (John 14:29)