When I first built the ZipCPU, I built it for the Wishbone bus. Wishbone is very easy to work with, and a good Wishbone pipeline implementation should be able to achieve (roughly) the same performance as AXI-lite. At the time, I had yet to build a crossbar interconnect, so my basic interconnect designs were fairly simple and depended upon the existence of no more than a single master. This forced the ZipCPU to have an internal arbiter, and to only expose that one Wishbone interface. You can see this basic structure in Fig. 1 below.
My first memory controller was quite simple. It could handle issuing a single read or write at a time and waiting for the return.
When this memory controller turned out to be a CPU performance bottleneck, I chose to write a pipelined memory controller. To get there, I first noticed that the CPU doesn’t need the results of any write operation, so nothing keeps the CPU from continuing with further operations while the write operation is ongoing. Even better, you could issue a string of write operations and as long as the memory controller was able to issue further bus requests, nothing kept the CPU from allowing many requests to be outstanding only to be retired later.
I continued this reasoning to reads as well. A string of memory reads could be issued by the CPU, under the condition that none of those reads overwrote either the base address register, from which the read addresses were being determined, or the program counter. When these conditions held, multiple read requests could thn issued to be retired later–just like the write requests above.
To see how this concept might work, consider Fig. 2 below showing a notional subroutine.
In this notional example, the CPU starts out with a jump to the subroutine
instruction. The subroutine then creates a
stack frame by subtracting from the stack pointer (
SUB), and stores three
registers to the stack frame via three store-word (
SW) instructions. The
memory controller then becomes busy handling these three requests. While the
requests are active, further requests of the same type are allowed. Moreover,
since the requests are to store data to memory, the CPU can go on with other
instructions. It doesn’t wait for the stores to complete, and so the
CPU issues first an
ADD and then an
Once the CPU is finished, it clears up the stack frame by loading (
copies of the registers it used back from the stack. These loads, however,
need to first wait for the stores to complete–and so they stall the CPU.
Once all the loads have been issued, we then add to the stack pointer to return
the stack frame to what it was. However, since the CPU doesn’t keep track of
what load requests are
outstanding, it can’t tell if this ADD is to a value yet to be returned from
the LOAD. Therefore, the CPU stalls again until all loads are complete.
While this might seem slow, consider the alternative. What if the CPU had to wait for every load or store to complete before issuing the next one? Fig. 3 below gives a taste of what that might look like, save that we’ve allowed the CPU to still continue while store operations are ongoing.
There were a couple issues with this new approach, however. One was that my original interconnect implementation didn’t understand the concept of a currently active slave. Any slave could respond to a bus request and the interconnect would be none the wiser. Keeping the returns in order meant insisting that the memory accesses were to incrementing addresses, and that slaves were sorted on the bus by how long they would take to respond to a request–so that the fastest responding slaves were always at lower addresses. I handled this by insisting, within the instruction decoder, that any string of memory operations had to be to either the same address or subsequent addresses.
A second issue with this pipelined memory approach involved how to handle bus errors. Once a CPU can issue requests without waiting for their responses, then it becomes possible for the CPU to issue requests for multiple operations before the first one returns a bus error. While this makes analyzing a program in the debugger that much more challenging, the speed benefit provided by this approach was really quite tremendous, and often outweighed any drawbacks.
The result was a basic pipelined memory controller. As an example of the performance that could be achieved using this technique, the ZipCPU can toggle an output pin at 47MHz while running the bus at 100MHz, whereas others have measured the Zynq running a 250MHz bus as only able to toggle the same pin at 3.8MHz. In percentages, the ZipCPU was able to demonstrate a 47% bus utilization using this technique vs. the Zynq’s 1.5% bus utilization.
This pipelined memory architecture worked quite well for the ZipCPU. Hand optimized loops could easily be unrolled for much better performance. Without hand optimization, however, the greatest benefit of this technique was when generating or recovering stack frames where the technique was an awesome fit.
Indeed, I was a bit taken aback later when I finally built a data cache only to discover the pipelined memory controller was often as fast or faster than the data cache. What?? How could that happen? Well, part of the problem was the time it took to load the cache in the first place. Loading the cache could generate more memory requests than necessary, such as if the CPU only wanted a single value but had to load the entire cache line, and so the cache might unnecessarily slow down the CPU. The other problem was that my original data cache implementation resorted to single operations when accessing uncachable memory. As a result, I had to go back and retrofit the data cache to handle pipelined operations for uncached memory just to recover the lost performance.
Recently, however, I’ve found myself doing a lot of work with AXI and not Wishbone. How should the ZipCPU be modified to handle AXI? One approach would be to use my Wishbone to AXI bridge. This approach, however, loses some of the benefits of AXI. The Wishbone to AXI bridge will never allow both read and write transactions to be outstanding (nor will the CPU …), neither will it allow the CPU to use AXI bursts or to issue exclusive access requests. The piece breaking the camel’s back, however, is simply the lost performance going through a bus bridge.
Bus Agnostic CPU Design
At present, I’m still in the process of making the ZipCPU bus agnostic. As a result, I don’t (yet) have any good examples of completed designs to show you how well (or poorly) the newly updated design works. Expect those within the year. For now, however, I’d like to discuss some of the changes that have taken place.
The ZipCPU as originally written had two problems when it comes to building a bus agnostic implementation. The first is that Wishbone was central to the CPU. The bus interface therefore needed to be removed from the CPU itself and made into a sort of wrapper. The second problem was that the Wishbone arbiter was integrated into the CPU. This also needed to be removed from the CPU core and placed into an external wrapper.
This naturally led to what I’m calling the ZipCore. The ZipCore is the logic left over after removing the bus logic from the original ZipCPU. The ZipCore is independent of any bus implementation. Instead, it exports a custom interface to both the instruction fetch and the memory controller.
This also presented a wonderful opportunity to separate the formal verification of the ZipCPU from the verification of the instruction and data bus interfaces. This is shown in Fig. 5 by the introduction of custom interface property sets sitting between the CPU and these two sets of interface modules. I now have one custom property set for verifying the instruction fetch, and another for verifying the memory controller fetch. This means that any instruction fetch or memory controller meeting these properties will then be able to work with the ZipCore. As a result, I no longer need to verify that the ZipCore. will work with a particular instruction fetch or a particular memory controller implementation. Instead, I just need to prove that those controllers will work with the appropriate custom interface property set. If they do, then they’ll work with the CPU.
For now, let’s look at what the ZipCPU’s memory interface looks like.
The ZipCPU’s memory controller interface can support one of two basic operations: read and write. Each leads to a slightly different sequence. These are shown in Fig. 6.
In the case of a write, the CPU provides the address and the value to be written to the controller. The controller then becomes busy. Once it finishes the task, if all goes well, it quietly ceases to be busy. If something went wrong, the memory controller will instead return an error.
Reads are similar, with the difference that the result needs to be returned to the CPU once the operation is complete. In this case, the memory controller sets a valid signal for the CPU, the value returned from the bus, and then returns to the CPU the register address that this value is to be written into. At least the way the ZipCPU handles this interface, it is the memory controller that keeps track of what register the result will be written into. That’s what happens if all goes well. However, if the bus returns an error, then the controller will set an error flag instead of the valid flag. It’s up to the CPU then to determine what to do in case of a bus error.
In general, the ZipCPU will do one of two things on a bus error. If the CPU is in user mode, it will switch to supervisor mode. If, on the other hand, the CPU is in supervisor mode then it will halt. If desired, an external wrapper can reset the CPU as an attempt to recover from the error, but in general it just halts and waits for the debugger. The S6SoC project was my one exception to this rule, since there was no room for an external debugging bus in that design. In that case, the CPU would simply restart, dump the last CPU register contents, and then attempt to continue a reboot from there.
No matter how the software handles the bus error, the memory controller will not return further results from any ongoing set of operations. Returns from outstanding reads following a bus error will be ignored. Outstanding writes may, or may not be, completed–depending on their status within the memory controller and the bus implementation. At a minimum, only one bus error will be returned. Further error responses from any outstanding accesses on the bus will not be returned to the CPU.
Fig. 7 on the left shows the basic interface between the CPU core and it’s memory controller used to implement these operations. Let’s take a moment before going any further to discuss the various signals in this interface. Indeed, the basic interface is fairly simple:
The CPU reset,
i_cpu_reset: With Wishbone, it’s easy to reset the CPU separate from the bus. All you need to do is drop the
STBlines. With AXI, this is a bit harder, since you will still get responses back from the bus from any requests that were made before your reset if you don’t also reset the bus. This is why the memory interface separates the CPU reset from the system reset, so that the CPU can be reset separate from the rest of the design. It’s up to the memory controller to make sure that the CPU doesn’t get any stale memory results from prior to the reset request returning afterwards.
i_stb: This is the basic request line. When the CPU raises
i_stb, it wants to initiate a memory operation. For those familiar with the AXI stream protocol, you can think of this as
TVALID && TREADY.
o_pipe_stalled: This is the basic stall line. When raised, the CPU will not raise
i_stbto make a request of the bus. Continuing with the AXI stream analogy from above, this is similar to the
!TREADYsignal in AXI stream.
i_op: This specifies the type of operation. To keep logic counts low, the bits to the memory operation are drawn directly from the instruction word.
i_opwill be true for a write (store) instruction, and false for a read (load) instruction.
i_op[2:1]then specifies the size of that operation.
2'b11specifies a byte-wise read or write,
2'b10a half-word/short (16b) operation, and
2'b01a full word operation.
i_addr: The address to be written to or read from. This only has meaning when
i_data: The data to be written to the address above. This only has meaning when both
i_op. For 8’bit writes, only the lower 8-bits have meaning. Likewise for 16’bit writes only the lower 16-bits have any meaning.
i_oreg: For reads, this specifies the register address that the read result will be placed into upon completion. The memory unit will hold onto this value, and then return it to the CPU again later. In the case of the pipelined operators, this value will go into a FIFO to be returned later with any read results. This value is ignored in the case of writes.
o_busy: If the memory core is busy doing anything, it will set
o_busy. For example, if you issue a bus operation, then
o_busywill go true. If you later reset the CPU,
o_busywill remain true until the memory core can accept another operation.
o_busyis subtly different from
o_pipe_stalledin that the CPU may issue additional memory operations while
o_busy && !o_pipe_stalled. However, the CPU will not start a new string of memory operations, nor will it change direction while the memory core asserts
It’s important to note that the CPU may move on to a non-memory instruction if
o_busyis true as long as
o_rdbusy: If the memory core is busy reading values from the memory, then
o_rdbusywill be set to indicate that a read is in progress and the CPU should not proceed to any other instructions (other than additional reads).
o_valid: Once a read value is returned,
o_validwill be set to indicate the need to write to the register file the value returned from the read. If all goes well, there will be exactly one
i_stb && !i_op, although CPU resets and bus errors may keep this count from being exact.
o_err: This will be set following any bus error, with two exceptions: First, if the CPU is reset while operations are outstanding, then any bus error response for those outstanding operations will not be returned. Second, after the first bus error, the memory controller will first flush any ongoing operations before returning any more bus errors to the CPU.
o_wreg: When returning a data value to the CPU, the
o_wregvalue tells the CPU where to write the value. This is basically the
i_oregvalue given to the memory controller reflected back to the CPU, together with the data value that goes with it.
o_result: The value from any read return is provided in
o_result. In the case of an 8-bit read, the upper 24-bits will be cleared. Likewise, for a 16’bit read, the upper 16’bits will be cleared.
Some CPU’s sign extend byte reads to the full word size, some do not. By default, the ZipCPU simply clears any upper bits. Two following instructions, a
TESTinstruction followed by a conditional
ORcan turn a zero extended read into a sign extended read. Alternatively, changing the memory controller from one behavior to another is fairly easy to do. Adjusting the GCC toolchain and following support, however, can take a bit more work.
There are two other important signals in this interface. These are signals we won’t be addressing today, but they are important parts of the more complex controller implementations.
i_clear_cache: This is my way of dealing with cache’s and DMA’s. The CPU can issue a special instruction to clear the cache if the memory may have changed independent of the CPU. This input is also asserted if the debug interface changes memory in a way the CPU is unaware of. If raised, the memory controller will mark any and all cached data as invalid–forcing the cache to reload from scratch on the next request.
i_lock: This flag is used when implementing atomic memory instructions. It will be raised by a
LOCKinstruction, and then lowered three instructions later. This allows for certain four instruction sequences: LOCK, LOAD, (ALU operation), STORE. A large variety of atomic instructions can be implemented this way. Examples include atomic adds, subtracts, or even the classic test and set instruction.
During these three instructions, the CPU is prevented from switching to supervisor mode on any interrupt until all three instructions following the lock have completed.
Atomic access requests are generally easy to implement when using Wishbone. The Wishbone cycle line is simply raised with the first LOAD instruction (LB, for load byte in Fig. 9), and then held high between the LOAD and STORE instructions (SB, for store byte in Fig. 9). Things are a bit more complicated with AXI, however, since AXI doesn’t allow a bus. master to lock the bus. Instead, the CPU will only discover if it’s atomic instruction was successful when/if the final store operation fails. In that case, the memory controller needs to tell the CPU to back up to the lock instruction and start over. How to make this happen, however, is a longer discussion for the day we discuss the full AXI version of this same memory controller.
Note the key steps:
The CPU makes a request by setting
i_stb, placing the data to be written into
i_data, and the address of the transaction into
The memory controller then becomes busy. It raises both
M_AXI_WVALIDto request a transaction of the bus. In this example, we also raise
M_AXI_BREADYas a bit in our state machine, to indicate that we are expecting data to be returned from the bus in the future.
M_AXI_AWVALIDmust remain high, and
M_AXI_AWADDRmust remain constant until
M_AXI_AWREADYis high. In this highly compressed example,
M_AXI_AWREADYjust happens to be high when
M_AXI_AWVALIDis set, so that it
M_AXI_AWVALIDcan be dropped on the next cycle.
The same rule applies to
M_AXI_WVALIDmust stay high and
M_AXI_WSTRBmust stay constant until
I’ve seen several beginner mistakes with this handshake. Remember: this chart in Fig. 10 is only representative! Some slaves will delay setting
M_AXI_AWREADYlonger than others, some will set
M_AXI_WREADYand others will set them in a different order. To be compliant, an AXI master must be able to deal with all these situations.
In this compressed example,
M_AXI_BVALIDis set on the clock immediately following
M_AXI_AWVALID && M_AXI_AWREADY && M_AXI_WVALID && M_AXI_WREADY.
Do not depend upon this condition! I’ve seen beginner mistakes where the beginner’s logic requires all four of these signals to be high at the same time. Remember, either one of these two channels might get accepted before the other.
M_AXI_BVALIDhas been received, the memory controller drops
o_busyto indicate that it is idle. A new request may then be made on the same cycle.
Now let’s take a look at a read example using this interface.
While this example is very similar to the previous write example, there are some key differences. Therefore, let’s walk through it.
The CPU indicates the desire to read from memory by raising
i_stband placing the address to be read from in
i_addr. The register that will be read into is also placed into
i_wreg–the memory controller will need to return this value back when the operation is complete.
Not shown is the
i_opinput indicating the size of the read, whether byte (8b), halfword (16b), or word (32b).
Once the memory controller receives
i_stb, it immediately sets
M_AXI_ARADDRwith the information it is given.
This controller also sets
M_AXI_RREADYhigh at this point, as part of its internal state tracking. This is to indicate that a read return is expected.
Finally, the controller sets both
o_rdbusy. The first indicates that a memory operation is ongoing, and the second indicates that we will be writing back to a register upon completion. This latter flag,
o_rdbusy, is used to prevent the CPU from moving onto its next operation and so helps avoid any pipeline hazard.
M_AXI_ARVALIDmust stay high and
M_AXI_ARADDRconstant until the slave asserts
M_AXI_ARREADY. In this example, that happens immediately, but this will not be the case with all slaves.
M_AXI_ARVALID && M_AXI_ARREADYwill request a second read. Since we don’t want that here, we immediately drop
Once the slave accomplishes the read, it sets
M_AXI_RDATA. Since the memory controller is holding
M_AXI_RREADYhigh, these will only be set for a single cycle.
The memory controller then copies the data from
o_resultto send it back to the CPU.
o_validis set to indicate a result is valid.
o_rdbusyis dropped, since we are no longer in the middle of any operation. Finally,
o_wregreturns the register address that the CPU entrusted to the memory controller.
These are examples drawn from the controller we’ll be examining today. Just to prove that the throughput of this CPU interface isn’t bus limited in general, here is a trace drawn from an AXI-lite memory controller capable of issuing multiple ongoing operation.
Just for the purposes of illustration, I dropped
M_AXI_ARREADY on the
first cycle of the request for address
A3, all this behavior is highly
slave dependent. Doing this, however, helps to illustrate how a bus stall
will propagates through that
Notice how the CPU then suffers one stall, and that the result takes an extra
cycle to return the item from that address. Beyond that, however, we’ll need
to save the examination of that
for another day. For now we’ll limit ourselves to a
that can only handle a 33% bus
throughput at best.
33% throughput? Is that the performance that can be expected from this type of controller? Well, not really. That would be the performance you’d see if this memory controller were connected directly to a (good) block RAM memory. If you connect it to a crossbar interconnect instead, you can expect it to cost you two clock cycles going into the interconnect, and another clock cycle coming out. Hence, to read from a block RAM memory, it will now cost you 6 cycles, not 3, for a 16% bus throughput. Worse, if you connect it to Xilinx’s AXI block RAM controller, it’ll then take you an additional 4 clock cycles. As a result, your blazing fast ZipCPU would be crippled down to one access for every 10 clock cycles simply due to a non-optimal bus architecture. Unfortunately, it only gets worse from there when you attach your CPU to a slower AXI slave.
In this trace, we have the outputs of our controller
M_AXI_ARADDR going into a
crossbar forwards these
BRAM_AXI_ARADDR, the inputs to Xilinx’s
AXI block RAM controller.
This block RAM controller
takes a clock cycle to raise
BRAM_ARREADY, and then two more clock cycles
before it raises its output on the result pipe,
From here it will take another clock to go through the
crossbar. This clock is
the minimum timing allowed by the AXI
spec. As a result, the read takes a
full 10 cycles. The ZipCPU’s memory
interface will allow a second request as soon as this one returns, yielding a
maximum throughput of 11%.
We’ll get there.
For now, a working AXI memory controller is a good place to start from. We can come back to this project and optimize it later if we get the chance.
Now that we know what our interface looks like, let’s peel the onion back another layer deeper to see how we might implement these operations when using AXI-lite.
First, let me answer the question of why AXI-lite and not AXI? And, moreover, what will the consequences be of not using the full AXI4 interface? For today’s discussion, I have several reasons for not using the full AXI4 interface:
AXI-lite is simpler.
This may be my biggest reason.
The CPU memory unit doesn’t need AXI IDs. While a CPU might use two separate AXI IDs, only one would ever be needed for any source. Therefore, the fetch unit might use one ID and the memory controller another. If a DMA were integrated into the CPU, it might use a third ID and so on. There’s just no need for separate ID’s in the memory controller itself.
Since we’re only implementing a single access at a time today, or in the case of misaligned accesses two accesses at a time, there’s no reason to use AXI bursts.
While it might make sense to issue a burst request when dealing with misaligned accesses later, AXI’s requirement that burst accesses never cross 4kB boundaries could make this a challenge. By leaving adjacent memory accesses as independent, we don’t need to worry about this 4kB requirement.
There is one critical bus capability that we will lose by implementing this memory controller for AXI4-lite rather than AXI4 (full), and that is the ability to implement atomic access instructions. If for no other reason, let’s consider this implementation only a first draft of a simple controller, so that we can come back later with a more complicated and full featured controller at a later time. Indeed, if you compare this core to a comparable full AXI memory controller, you’ll see the two mostly share the same structure.
For now, let’s work our way through a first draft of setting our various AXI4-lite signals.
The first signals we’ll control will be the various
signals associated with any AXI request. As discussed above, we’ll use the
xREADY signals as internal state variables to know when something is
outstanding. Hence, on a write request we’ll set
M_AXI_BREADY and we’ll
clear it once the request is acknowledged. We’ll treat read requests
similarly, only using
M_AXI_RREADY for that purpose instead.
The first step will be to clear these signals on any reset.
While it’s a little out of order, the next group in this block controls
how to handle an ongoing operation. In general, if ever
AxREADY then the
AxVALID signal will be cleared. Likewise, once
RVALID are returned, we can close up and finish our operation and clear
As I mentioned above, getting this signaling wrong is a common beginning
AXI mistake. Remember, the
W* channels are independent, and
VALID cannot be lowered until
The last step in controlling these signals is to set them on any request.
Assuming a request is incoming, we’ll want to set the various write
i_op is ever true–indicating a write operation request.
Otherwise, for read operations, we’ll want to set
Of course, that’s only if a request is being made on this cycle. Hence, let’s caveat these new values. If there’s no request being made, then these lines should be kept clear. Likewise, if the request is for an unaligned address then (in our first draft) we’ll return an error to the CPU and not issue any request. Finally, on either a bus error or a CPU reset we’ll need to make certain that we don’t start a new request that will immediately be unwanted one the next cycle.
Judging from the AXI requests associated with Xilinx forum posts that I’ve examined, getting those five signals right tends to be half the battle.
There is another signal, however, that we’ll need to pay attention to, and
this is the one capturing whether or not the CPU was reset separate from the
system. In such cases, we’ll need to flush any ongoing
operation without returning its results to the CPU at a later time. To
handle this, we’re going to implement an
r_flushing signal. This
signal will capture the idea of the
bus begin busy,
even through the CPU isn’t expecting a result from it.
This signal will be cleared on any system reset.
The primary purpose of this signal is to let us know to flush any outstanding returns following a CPU reset while a bus operation is ongoing without also needing to reset the bus.
There’s one caveat to this, however, and that is that we don’t want to set
r_flushing if the CPU is reset on the same cycle the outstanding value is
returned to the CPU.
Otherwise if the bus is idle, we can leave the
r_flushing signal at zero–no
matter whether the CPU is reset or not.
Handling the bus address for this simple controller is really easy. As long as we aren’t in the middle of any operations, we can set the address to the CPU’s requested address. Even better, we can use the same logic for both read and write addresses.
AXI requires an
AxPROT signal accompany any request. Looking through the
AXI spec, it looks like
3'h0 will work nicely for us. This will specify
an unprivileged, secure data access.
That brings us to setting
M_AXI_WDATA and its associated
In general, we’re going to need to up shift these values based upon where
the data given us will fit on the bus. I like to use
AXILSB to capture the
number of address bits, in an AXI interface, necessary to define which octet
the address is referencing.
Remember not to copy Xilinx’s formula for this bus width, since their calculation is only valid for 16, 32, or 64-bit bus widths. (You can see their bug here. In their defense, this doesn’t really matter in an AXI-lite interface, since Xilinx only allows AXI-lite to ever have a data width of 32-bits. Sadly, they made the same mistake in their AXI full demonstrator.)
We can now use this value to shift our data input by eight times the value of these lower address bits to place our write data in its place on the bus.
We’ll come back in a moment and assign
M_AXI_WDATA to be the same as
axi_wdata. For now, let’s just note that the logic for
is almost identical. In this case, we’re upshifting a series of
one for each byte we wish to write, by subword address bits. The second
big difference is that we aren’t multiplying the low order address bits by
eight like we did for the data.
There’s one last step here, and that is that we need to keep track of both the operation size as well as the lower bits of the address. We’re going to need these later, on a read return to know how to grab the byte of interest from the bus.
That leaves only one other signal required to generate a bus request, and that signal is going to tell us if and when we need to abort the request due to the fact that it will require two operations. For this initial implementation, we’ll simply return an error to the CPU in this case. We’ll come back to this in a moment to handle misaligned accesses, but this should be good enough for a first pass.
An access is misaligned if the access doesn’t fit within a single bus word. For a 4-byte request, if adding 3 to the address moves you into the next word then the request is misaligned. For a 2-byte request, if adding one moves you to the next word then the request is misaligned. Single byte requests, however, cannot be misaligned.
That’s what it takes to make a request of the bus.
The next request is to handle the return from the bus. and to forward it to the CPU.
The first part of any return to the CPU is returning a value. We’ll have a
value to return if and when
RVALID is true. We’ll take a clock cycle
to set this
o_valid flag, so as to allow us a clock cycle to shift
to the right value.
For now, notice that
o_valid needs to be kept clear following a reset of
any type. Further, it needs to be kept clear if we are flushing responses
as part of a CPU reset separate from an AXI bus reset. Finally, we’ll set
o_valid flag on
RVALID as long as the
didn’t return an error.
We now turn our attention to the CPU
error flag. In general, a
will be returned when either
RVALID and the response is an
error. We’ll also return a
on any request to send something misaligned to the
bus. The exceptions,
however, are important. If the CPU is reset, we don’t want to return an
nor if we are waiting for that reset to complete.
We’ll also need to return some busy flags. This core is busy if ever
M_AXI_RREADY are true. We’ll also set our
o_pipe_stalled flag to be equivalent to
o_busy for this simple controller,
but that setting will be external to this logic. Similarly, the CPU can
expect a response if
M_AXI_RREADY is true and we aren’t flushing the result.
When returning a result to the CPU, we need to tell the CPU which register to write the read result into. Since this simple memory controller only ever issues a single read or write request of the bus, we can choose to simply capture the register on any new request and know that there will never be any other register to return.
That leaves us only one more signal to return to the CPU, the
a data read. There are two parts to returning this value. The first part
is that we’ll need to shift the value back down from (wherever) it is placed
in the return
word. This was why we kept the subword address bits in our
We also kept the size of our operation in the upper bits of
r_op. We can
use these now to zero extend octets and half words into 32-bits.
Some CPU’s sign extend sub word values on reading. Not the ZipCPU. The ZipCPU zero extends subword values to full words on any read. This behavior, however, is easy enough to adjust if you want a different behavior.
There you go, a basic AXI-lite based CPU memory controller.
Handling Misaligned Requests
Perhaps I should have been satisfied with that first draft of a basic memory controller.
The draft controller will return a bus error response to the CPU if you ever try to write a misaligned word to the bus. Try, for example, to write a 32-bit word to address three. The operation will fail with a bus error. This was by design. Why? Because otherwise you’d then need to write across multiple words.
Well, why can’t we build a controller that will read or write across multiple words when requested? Such a controller could handle misaligned requests.
So, let’s start again, using the design template above, and see if we can adjust this controller to handle misaligned requests.
The first thing we are going to need are some flags to capture a bit of state. Let’s try these:
misaligned_aw_request: This is the first request of two
AW*requests, as a result of a misaligned write.
misaligned_request: This is the first request of either two
W*requests, or two
misaligned_response_pending: Two responses are expected. As a result, if
misaligned_response_pendingis ever true, then we still expect either two
BVALIDreturns or two
RVALIDreturns. (One might be present on this clock cycle.)
misaligned_read: This signal is very similar to
misaligned_response_pending, except that it isn’t cleared on the first read response. It’s used at the end to let us know that two read results need to be merged together into one before returning them to the CPU.
pending_err: Of our two responses, the first has returned an error. Since it is only the first of two, we haven’t returned the error response to the CPU yet. Hence, if
pending_err, then we need to return a bus error to the CPU on the next bus return–regardless of what status response is returned with it.
We can now go back to the top and re-look at our
handshaking request flags again.
The big difference here is in how we handle a return. If a misaligned request
is outstanding, then you don’t want to drop
xVALID on the first cycle–you
will want to wait for the second return. The same applies to waiting for two
That’s the big change there. The logic required to start a memory operation won’t change.
r_flushing signal, indicating that shouldn’t forward results to the
CPU is a little more complex. The big difference here is if a misaligned
response is pending. In that case, we don’t want to clear our
signal on a return, but rather on the next return.
Address handling gets just a touch more complicated as well. In this case, any time an address is accepted we’ll increment it to the next word address.
There’s a couple things to remember here. First, when handling a misaligned request, we must always move to the next word—that’s what a misaligned request is. Second, the low address bits should be zero. This will be appropriate for little endian systems. It’s not necessarily appropriate for big endian systems like the ZipCPU, but at least it won’t hurt.
The next trick is the
M_AXI_WSTRB values indicating which
bytes to write to and the values to be written to them. The trick to making
this work is to map the request onto two separate
words. Once mapped to two words, we can then send the result to those words
one at a time.
We’ll add one intermediate step here, though, which is to create the
value combinatorially first. This just simplifies writing the logic out,
but not much more. Note that we are again shifting a set of ones up by
the address in the low order bits of
i_addr–just like we did before, only
this time onto two words worth of byte enables instead of just one.
The next change is that we’ll add two new registers:
next_wstrb. These will hold the next values of
M_AXI_WSTRB–the values we’ll use for them on the second clock cycle
of any misaligned request.
Here’s the first of the key steps with
logic is identical to the logic we used before, save that they are applied
across two bus words.
axi_wstrb are going to map directly to
M_AXI_WSTRB, that just leaves handling the second
write cycle. For that, we just copy
axi_wdata as soon
as the channel isn’t stalled. We’ll likewise do the same for
To handle both capabilities, we’ll create an single bit
parameter. If this bit is set, misaligned requests will generate
errors. If not, misaligned
requests will be allowed to take place.
We’ll also split our misalignment signal into two. The first signal,
w_misaligned, will simply indicate a misaligned request. The second
w_misaligned_err, will indicate that we want this misaligned
request to turn into a
The next big component will be handling our new misalignment signals. Obviously, if we are just generating errors on any misaligned request, then we won’t need these signals and they can be kept at zero.
This will allow the optimizer to simplify our logic when we just adjust the
On the other hand, if we are generating misaligned requests, then we’ll need
to define these signals. The first indicates that this is a misaligned
request, and a second
AR* operation is required.
W* channels need to be handled independently, we need
a separate signal to handle the second request on the
AW* channel. This
signal is almost identical to
misaligned_request above, save that it is
Knowing if a response will be the first of two expected is the purpose of
misaligned_response_pending. It’s set much the same as the other two.
The big difference in this signal is that it is cleared on either
M_AXI_RVALID–the first return of the misaligned response.
The next signal,
misaligned_read, simply tells us we will need to reconstruct
the read response from two separate read values before returning it to the
Finally, our last misalignment signal is the
pending_err signal. This signal
gets set on any write or read error, and then cleared when that error is
returned to the CPU. Once set, we’ll clear it any time the interface
clears. This guarantees that we’ll be clear following any request or response
to the CPU as well.
The next several signals have only minor modifications.
o_valid signal, indicating a valid read return to the CPU, needs to be
adjusted so that it waits for the second return of any misaligned response.
Similarly, we don’t return
o_valid if either the current or past response,
in the case of a pair of responses, indicates a
In those cases, we’ll set
The error return is also quite similar. There are only a few differences.
The first is that we don’t want to return an
o_err response if there’s still
a response pending. The second difference is that we’ll also return an
response if the prior response indicated a
Our busy signal returns to the CPU don’t change. Those are the same as before,
as is the
That leaves one complicated piece left of this–reconstructing the read
return. This is sort of the reverse of
next_wdata, axi_wdata from above,
save that this time we are using
M_AXI_RDATA, last_result. Note the
reverse ordering–the first value is always going to be on the right in a
little endian bus.
The first step is to construct the two-word wide return, and then to shift it appropriately so the desired data starts at the bottom byte. We handle this with a separate logic block so that we don’t get lint errors when shifting from a value of one size to another.
Now that we have this pre-result, we can construct our final value.
First, on any read return we copy the return to our
case this is a misaligned return.
The next step is to turn this
pre_result value into the value we return
to the CPU. If this is a half-word or octet request, we’ll zero the upper
bits as well.
In many ways, this second pass at this design illustrates the way most of my development has taken place recently. I’ll often draft a simple version of a design, and then slowly layer on top of it more and more complicated functionality until it’s everything I want.
In hindsight, the misalignment processing wasn’t nearly as complicated as I was fearing. I know, I tend to dread handling misaligned requests. However, it never seems to be that hard when I actually get down to building it. Once you adjust the signaling to handle two requests, the remaining process is fairly basic: place the data into a two word shift register and shift it as appropriate, then deal with each half of that register.
If you look over the design for this memory
you might notice other options as well. For example, there’s an
option that will force all unused signals to zero. There’s a
option to sign extend the return data. We’ve already mentioned the
OPT_ALIGNMENT_ERR option. Finally, there are some experimental
SWAP_ENDIANNESS options that I’m still working with–as part of hoping that
I can somehow keep a big endian CPU running on a little endian bus without
massive changes. (I’m not convinced any of these endianness parameters,
SWAP_ENDIANNESS or the
SWAP_WSTRB options, will work–they’re
still part of my ongoing development.)
At this point in my design, I’ve only formally verified this memory controller. I haven’t yet simulated it. Yes, I’m expecting some problems when I get to simulation, but, hey, one step at a time, right?
Let’s now take some time, though, to look over some of the major parts of that proof. These include the AXI-lite interface properties, the CPU interface properties, and some cover checks to make sure the design works. This follows from what I’ve learned from previous experiences about what works for verifying a design. Perhaps it will work the first time I try it in simulation. We’ll see. (I’m still not convinced the big-endian CPU will work with this little-endian controller, formal proof or not … but we’ll see.)
Two years ago, I posted a set of interface properties for working with AXI-lite. At the time, I was very excited about these properties. By capturing all the requirements of an AXI-lite interface into a set of formal properties, I could simplify any future verification problems. I predicted AXI-lite designs would become easy to build as a result.
I haven’t been disappointed. While I’ve made small adjustments to those properties since that time, they’ve seen me through a lot. Using them, I’ve been able to very quickly check designs posted on Xilinx’s forums. The check tends to take about a half hour or so. Even better, it’s pretty conclusive.
So how hard is it? There are only a couple steps. First, on any new design, I start by instantiating my AXI-lite master property set into the design.
I’ll then create a SymbiYosys script file. These files are pretty basic, enough so that I now have a script to handle generating just about all but three lines of the file. At this point, I’ll run the design and often find any bugs.
This design is almost that simple. In this case, I also need to incorporate a CPU interface property file as well, but we’ll get to that part in the next section.
At this point, SymbiYosys will either return a bug in 20 clock cycles in about 5 seconds, or there will likely not be a bug in the design at all. Sometimes I’ll just run it for 40-50 cycles if I’m not sure–or longer, depending on my patience level.
Once I get that far, most of the bugs in the design are gone.
Perhaps I’m a bit of a perfectionist, but this is rarely enough for me. I like to go further and verify these same properties for all time via induction. This, to me, is just a part of being complete.
So let’s spend some time working through some properties we might use to guarantee that this design passes an induction check.
In the case of the AXI-lite bus, this primarily consists of constraining the
faxil_rd_outstanding. We’ll go a bit farther here, and constrain some of our
internal signals as well.
For example, if we are ever in a
misaligned_request, then either
ARVALID should be set since this is our signal that we are in the first
of two request cycles.
misaligned_aw_request is ever true, then we are in the first
AWVALID cycles. That means
M_AXI_AWVALID had better be true.
If no misaligned responses are pending, then we should be able to at least
limit the number of outstanding items. If any of the request lines, whether
M_AXI_ARVALID are true, then, since
there’s no misaligned responses pending, there must be nothing outstanding.
In all other cases, with no misaligned responses pending the number of
outstanding items must be less than one.
Inequality constraints like this aren’t usually very effective, but they’re often where I’ll start a proof. Over time, I usually turn these inequalities into exact descriptions–although I didn’t do so for this design. Indeed, this particular proof is unusual in that the inequalities above are still important parts of my proof. (If I remove them, the proof fails …)
Of course, if there are no misaligned responses pending, then there can’t be any misaligned requests.
On the other hand, if a misaligned response is pending, and we are in a read
cycle, then the
misaligned_read signal should be true.
Now let’s turn our attention to flags specific to read cycles.
For example, if we aren’t in a read cycle then
and the number of outstanding read requests should all be zero.
On the other hand, if this is a read request then this can only be a misaligned request if nothing else is outstanding. After that, we do our best come up with the correct read count. (It’s still an inequality, but it’s enough …)
Let’s now turn our attention to write signals. If we aren’t in the middle of a write cycle, then the write signals should all be zero. There should be no writes outstanding, nor any being requested.
On the other hand, if we are within a write cycle, then what conclusions can we draw? If we are still within a request, then the number of outstanding items must be zero. Likewise, we will only ever have at most two requests outstanding at a time.
But once I got this far, I punted. I just wasn’t certain how to constrain the write counters. So, I fell back on an old trick I’ve come across: the case statement. Using a case statement, I can often work my way through all the possibilities of something. A case statement also forces me to think about each of the possibilities of something happening individual.
As I’ve mentioned before on this blog, don’t worry about creating too many assertions. If you do, the worst that will happen is that there will be a minor performance penalty–assuming that you have valid assertions. If you assert something that isn’t true, the formal tool will catch it, and you’ll be patiently corrected. Indeed, there’s no way through creating too many assertions to get a design to pass an assertion that isn’t so. The problem isn’t usually too many assertions, rather it is not having enough assertions.
Moving on, and perhaps I should’ve asserted this earlier, we can either be in
a write cycle, a read cycle, or no cycle. There should never be a time when
M_AXI_RREADY are true together.
Let’s put a quick constraint on
r_flushing: if we aren’t busy, then we
shouldn’t be flushing any responses. Since we’ve constrained
to only ever be true if either
M_AXI_RREADY, this also
r_flushing to zero if nothing is outstanding and none
WVALID requests lines are active.
When putting this
together, I made some of the signals combinatorial. One example is
which is set if either
are true. I may wish to come back later and adjust this design so that
o_busy is registered. Indeed, this sort of task is common enough that it
forms the basis for a project I often use in my formal
courseware: given a working
design, with a working set of constraints, adjust a combinatorial value so
that it is now registered, and then prove that the design still works. In
order to support this possibility later, I’ve included the combinatorial
o_rdbusy among my formal property set.
In general, I like to have one or more constraints forcing every register into
their correct setting with everything else. Here, we constrain
If we are busy, and there’s a misaligned response pending, then we haven’t
yet gotten our first response back in return. Therefore, if we haven’t gotten
our first of the two responses back,
pending_err should be zero. It
shouldn’t get set until and unless one of our return responses comes back
While I have more assertions in this section of the
that’s probably enough to convince you that I’ve fully constrained the
faxil_*_outstanding counters to the internal state of the design.
What we haven’t done yet, however, is constrain the other half of the design: the CPU interface. Let’s do that next.
One of the challenges associated with blindly attempting to formally verify an AXI design you’ve never seen before is that many AXI designs, like this one, are effectively bridges. That means they have two or more interfaces to them. An interface property file will only provide you with instant properties for one of those interfaces. You’ll still need to constrain the other interface.
In the case of the ZipCPU, there are two interfaces to memory. The ZipCPU, also has many memory interface implementations split across the two categories: instruction and data. When it comes to instruction fetching, the ZipCPU, has a very simple and basic single instruction fetch, as well as a two instruction pipeline fetch and an instruction fetch and cache. In a similar vein, the ZipCPU has three basic data interfaces: a basic single load or store interface, a pipelined memory controller, and a data cache. These three categories have served the ZipCPU well, allowing me to easily adjust the CPU to fit in smaller spaces, or to use more logic in order to run faster in larger spaces.
Those original interfaces, however, are also all Wishbone interfaces.
When it came time to build an AXI interface, I stepped back to rethink my verification approach. The problem with each of those prior memory controllers was that they each had their own assumptions about the CPU within them. When I then verified the CPU, I switched those assumptions to assertions, but otherwise verified the CPU with the memory interfaces intact within it. The consequence of this approach was that I needed to re-verify the CPU with every possible data interface it might have.
This seemed rather painful, so I separated the instruction and data interface assumptions from their respective controllers into one of two property files: one for the instruction interface,and another for the data interface. The property files therefore describe a contract between the CPU core and the instruction and data interfaces. Anything the CPU core needs to assume about those interfaces gets asserted when verifying the interface, or assumed when verifying the CPU. By capturing this contract into one place, verifying new interfaces has become much easier.
All of the former Wishbone memory interfaces have now been re-verified using one of these two property sets as appropriate.
Not only that, but now the ZipCPU has new AXI interfaces. There’s an AXI-lite instruction fetch module that can 1) handle one outstanding transaction, 2) two outstanding instruction fetch bus transactions, or even 3) an arbitrary number of outstanding instruction fetch transactions. I’ve also rebuilt the ZipCPU’s prefetch and instruction cache. One neat feature about these new AXI or AXI-lite interfaces is that they are all parameterized by bus width. That means that I won’t need to slow a 64-bit memory bus down to a 32-bit width for the CPU anymore.
It’s not just instruction fetch interfaces, either. This approach has made it easy to build data interfaces in the same way.
For now, let’s take a look at how easy it is to use this new interface.
The first step is to declare some signal wires to be shared between the memory module and the interface property set. These extra (formal verification only) signals are:
cpu_outstanding: A count of how many requests the CPU thinks the memory is working on. This count will get cleared on a CPU reset,
f_done: This signal is generated by the memory controller to tell the property set that an operation has completed–whether read or write. Normally, something like this would be part of the interface between the memory unit and the CPU, something like
o_errabove. However, there’s no means in this interface to announce the completion of a write operation other than dropping
f_donetakes its place.
f_last_reg: Is a copy of the last register target of any CPU load operation. This is important for the CPU pipeline, since there’s enough room in the CPU pipeline to read into any register but the last one, and so this last register needs to be tracked by the memory property set.
f_addr_reg: One of the rules of pipelined memory operations is that, in any string of ongoing operations, they all need to use the same base address register. This keeps the CPU from needing to keep track of which register will be written to by the operation. In particular, the address register shall not be written to by any operation–save perhaps the last one. The CPU will insure this, by never issuing a read request into the address register unless it waits for the memory controller to finish all of its reads first. The property set will accept this value as
f_areg–again, it’s not part of the CPU’s interface proper, so we just assume it’s presence here. The CPU will actually produce such a register, since it knows what it is, and properties of that register will be asserted there–here they are only assumed.
f_pc: This flag, returned from the memory property set, will be true if the last read request is to read into either the program counter or the condition codes, both of which might cause the CPU to branch. Reads into the program counter or condition codes, if done at all, need to be the last read in any string. This return wire, from the property set, helps to make sure that property is kept.
f_gie: The ZipCPU has a lot of “GIE” flags all throughout it. “GIE” in this case stands for “Global Interrupt Enable”. In supervisor mode, the “GIE” bits are clear, whereas they are set in user mode–the only mode where the ZipCPU can be interrupted. These are also the most significant bit in any register address–since the ZipCPU has one register set for user mode (GIE=1) and one for supervisor mode (GIE=0).
Any string of read (or write) operations must have the same GIE bit, so this flag captures what that bit is.
f_read_cycle: This value, returned by the interface property set, just keeps track of if we are in a read cycle vs a write cycle. To avoid hazards, the ZipCPU will only ever do reads or writes–never both at the same time. Knowing this value helps keep track of what types of request are currently outstanding, so we can make sure we don’t switch cycles.
f_axi_write_cycle: This one won’t get used below. It’s a new one I had to create to support exclusive access when using AXI.
First, a brief overview of how AXI exclusive access works when using the ZipCPU: the CPU must first issues a
LOCKinstruction, and then a load instruction of some size. This load is treated as an AXI exclusive access read, so
M_AXI_ARLOCKwill be set. If the read comes back as
OKAY, rather than
EXOKAY, a bus error is returned to the CPU indicating that the memory doesn’t support exclusive access. Otherwise, an ALU operation may (optionally) take place followed by a store instruction. If the store instruction fails, that is if the result is
EXOKAYin spite of receiving the
EXOKAYresult from the previous read access, than the memory controller returns a read result. (From a write operation? Yes!) That read result writes into the program counter a jump back to the original LOCK instruction to start over.
For this reason, an AXI exclusive access store instruction is the only type of write instruction that will set
That’s a long story, just to explain why this flag is necessary–to explain why
o_rdbusymight be set on a store instruction, and to help guarantee that the result (if any) will either be written to the program counter or quietly ignored if the write was successful.
One last step before instantiating this property set is to create the
signal. For this AXI interface, that’s pretty easy. An operation is done
when we receive the
M_AXI_RVALID signal–with a couple
of exceptions. We’re not done if a
error will be produced.
That’s another, separate, signal. Neither are we done if there’s a pending
error, if this is the first of two responses, or if we are flushing requests
that a recently reset CPU wouldn’t know about.
Still, it’s not much more complicated than anything we’ve already done.
With that out of the way, we can simply instantiate the formal memory property interface.
Now, with all this background out of the way, we can finally verify this memory core. As I mentioned in the AXI-lite verification section above, if it weren’t for wanting to pass induction, these two property sets alone might well be sufficient to verify all but the data path through the logic.
How well does it work? Well, typically the formal tool takes less than twenty seconds to return any bugs. Even better, it points me directly to the property that failed, and the exact timestep where it failed.
That’s not something you’ll get from simulation.
However, since I like to verify a design using induction as well, I’ll often want to add some more properties.
Our first additional property just asserts that if we are every
then the CPU should’ve have just be reset so it shouldn’t be expecting
anything more. If we aren’t flushing, then either this design is busy, or
it is in the process of returning a result or an
error to the CPU.
f_pc is ever set, then our one (and only) output must be to either
the program counter,
o_wreg[3:0] == 4'hf, or the condition codes
o_wreg[3:0] == 4'he. Otherwise, if
f_pc is clear, then no
reads can read into either PC or CC registers.
If any items are outstanding, then
o_wreg must match the last register
address requested. Hence, following a load into
f_last_reg should both point to the register address of
As long as we are busy, the high bit of the return register must match
f_gie. This finishes our constraints upon all of the bits of
As one last property, let’s make sure
f_read_cycle matches our logic.
M_AXI_RREADY is true, then we should be in a read cycle–unless we
are flushing things out of our pipeline following a CPU reset. Similarly,
M_AXI_RREADY is not true and
M_AXI_BREADY is, then we should be in a
write cycle and so we can assert
Notice how easy that was? All we had to do was to tie a couple of return wires from the interface property set together to the internal state of our design, and we then have all the properties we need.
As a final step in this proof, I’d like to see how well it works. For that, let’s just create some quick counters and count the number of returns we receive–both for writes and then again for reads.
Once we have this count, a simple
cover check can produce some very useful
and instructive traces. Indeed, these traces will show how fast
can operate at it’s fastest speed.
The traces themselves are shown in Figs. 10 and 11 above. They show that this core can only ever achieve a 33% throughput at best.
No, 33% is not great. In fact, when you put 33% in context of the rest of any surrounding system, as we did in Fig. 13 above, it’s downright dismal performance. However, all designs need to start somewhere, and in many ways this is a minimally working AXI master memory controller design.
This is now our third article on building AXI masters. The first article discussed AXI masters in general followed by a demonstration Wishbone to AXI bridge. The second article discussed several of the problems associated with getting AXI bursts working properly, and why they are so challenging. This one returns to the simple AXI-lite master protocol, while also illustrating a working CPU memory interface.
As I’ve alluded to earlier, this is only the first of a set of three AXI memory controllers for the ZipCPU: This was the single access controller. I’ve also built a pipelined controller which should get much better performance. These are both AXI-lite designs. This particular controller also has an AXI (full) sister core, implementing the same basic design while also supporting exclusive access. My intent is to make a similar sister design to support pipelined access and AXI locking as well, but I haven’t gotten that far yet. I have gotten far enough, though, to have ported my basic Wishbone data cache to AXI. While usable, that work isn’t quite done yet, since it doesn’t (yet) support either pipelined memory access or locking, but it’s at least a data cache implementation using AXI that should be a step in the right direction. (Remember, I tend to design things in layers these days …)
Lord willing, I’d like to spend some time discussing AXI exclusive access operations next. I’ve recently modified my AXI property sets so that they can handle verifying AXI designs with exclusive access, and I’ve also tested the approach on an updated version of my AXI (full) demonstration slave. Sharing those updates will be valuable, especially since neither Xilinx’s MIG-based DDR3 memory controller, nor their AXI block RAM controller appear to support exclusive access at all. (Remember, the ZipCPU will return a error on any attempt at an exclusive access operation on a memory that doesn’t support it, so having a supporting memory is a minimum requirement for using this capability.) This can then be a prelude to a companion article to this one, discussing how to modify this controller so that it handles exclusive access requests in the future.
Let me also leave you with one last haunting thought: What would happen if, during a misaligned read operation across two words, a write took place at the same time? That’ll be something to think about.
Whom shall he teach knowledge? And whom shall he make to understand doctrine? Them that are weaned from the milk, and drawn from the breasts. For precept must be upon precept, precept upon precept; line upon line, line upon line; hear a little and there a little (Is 28:9-10)