Building a Simple AXI-lite Memory Controller
When I first built the ZipCPU, I built it for the Wishbone bus. Wishbone is very easy to work with, and a good Wishbone pipeline implementation should be able to achieve (roughly) the same performance as AXI-lite. At the time, I had yet to build a crossbar interconnect, so my basic interconnect designs were fairly simple and depended upon the existence of no more than a single master. This forced the ZipCPU to have an internal arbiter, and to only expose that one Wishbone interface. You can see this basic structure in Fig. 1 below.
My first memory controller was quite simple. It could handle issuing a single read or write at a time and waiting for the return.
When this memory controller turned out to be a CPU performance bottleneck, I chose to write a pipelined memory controller. To get there, I first noticed that the CPU doesn’t need the results of any write operation, so nothing keeps the CPU from continuing with further operations while the write operation is ongoing. Even better, you could issue a string of write operations and as long as the memory controller was able to issue further bus requests, nothing kept the CPU from allowing many requests to be outstanding only to be retired later.
I continued this reasoning to reads as well. A string of memory reads could be issued by the CPU, under the condition that none of those reads overwrote either the base address register, from which the read addresses were being determined, or the program counter. When these conditions held, multiple read requests could thn issued to be retired later–just like the write requests above.
To see how this concept might work, consider Fig. 2 below showing a notional subroutine.
In this notional example, the CPU starts out with a jump to the subroutine
instruction. The subroutine then creates a
stack frame by subtracting from the stack pointer (SUB
), and stores three
registers to the stack frame via three store-word (SW
) instructions. The
memory controller then becomes busy handling these three requests. While the
requests are active, further requests of the same type are allowed. Moreover,
since the requests are to store data to memory, the CPU can go on with other
instructions. It doesn’t wait for the stores to complete, and so the
CPU issues first an ADD
and then an AND
instruction.
Once the CPU is finished, it clears up the stack frame by loading (LW
) the
copies of the registers it used back from the stack. These loads, however,
need to first wait for the stores to complete–and so they stall the CPU.
Once all the loads have been issued, we then add to the stack pointer to return
the stack frame to what it was. However, since the CPU doesn’t keep track of
what load requests are
outstanding, it can’t tell if this ADD is to a value yet to be returned from
the LOAD. Therefore, the CPU stalls again until all loads are complete.
While this might seem slow, consider the alternative. What if the CPU had to wait for every load or store to complete before issuing the next one? Fig. 3 below gives a taste of what that might look like, save that we’ve allowed the CPU to still continue while store operations are ongoing.
There were a couple issues with this new approach, however. One was that my original interconnect implementation didn’t understand the concept of a currently active slave. Any slave could respond to a bus request and the interconnect would be none the wiser. Keeping the returns in order meant insisting that the memory accesses were to incrementing addresses, and that slaves were sorted on the bus by how long they would take to respond to a request–so that the fastest responding slaves were always at lower addresses. I handled this by insisting, within the instruction decoder, that any string of memory operations had to be to either the same address or subsequent addresses.
A second issue with this pipelined memory approach involved how to handle bus errors. Once a CPU can issue requests without waiting for their responses, then it becomes possible for the CPU to issue requests for multiple operations before the first one returns a bus error. While this makes analyzing a program in the debugger that much more challenging, the speed benefit provided by this approach was really quite tremendous, and often outweighed any drawbacks.
The result was a basic pipelined memory controller. As an example of the performance that could be achieved using this technique, the ZipCPU can toggle an output pin at 47MHz while running the bus at 100MHz, whereas others have measured the Zynq running a 250MHz bus as only able to toggle the same pin at 3.8MHz. In percentages, the ZipCPU was able to demonstrate a 47% bus utilization using this technique vs. the Zynq’s 1.5% bus utilization.
This pipelined memory architecture worked quite well for the ZipCPU. Hand optimized loops could easily be unrolled for much better performance. Without hand optimization, however, the greatest benefit of this technique was when generating or recovering stack frames where the technique was an awesome fit.
Indeed, I was a bit taken aback later when I finally built a data cache only to discover the pipelined memory controller was often as fast or faster than the data cache. What?? How could that happen? Well, part of the problem was the time it took to load the cache in the first place. Loading the cache could generate more memory requests than necessary, such as if the CPU only wanted a single value but had to load the entire cache line, and so the cache might unnecessarily slow down the CPU. The other problem was that my original data cache implementation resorted to single operations when accessing uncachable memory. As a result, I had to go back and retrofit the data cache to handle pipelined operations for uncached memory just to recover the lost performance.
Recently, however, I’ve found myself doing a lot of work with AXI and not Wishbone. How should the ZipCPU be modified to handle AXI? One approach would be to use my Wishbone to AXI bridge. This approach, however, loses some of the benefits of AXI. The Wishbone to AXI bridge will never allow both read and write transactions to be outstanding (nor will the CPU …), neither will it allow the CPU to use AXI bursts or to issue exclusive access requests. The piece breaking the camel’s back, however, is simply the lost performance going through a bus bridge.
To avoid any lost performance when driving an AXI bus interface I would need to make the ZipCPU bus agnostic.
Bus Agnostic CPU Design
At present, I’m still in the process of making the ZipCPU bus agnostic. As a result, I don’t (yet) have any good examples of completed designs to show you how well (or poorly) the newly updated design works. Expect those within the year. For now, however, I’d like to discuss some of the changes that have taken place.
The ZipCPU as originally written had two problems when it comes to building a bus agnostic implementation. The first is that Wishbone was central to the CPU. The bus interface therefore needed to be removed from the CPU itself and made into a sort of wrapper. The second problem was that the Wishbone arbiter was integrated into the CPU. This also needed to be removed from the CPU core and placed into an external wrapper.
This naturally led to what I’m calling the ZipCore. The ZipCore is the logic left over after removing the bus logic from the original ZipCPU. The ZipCore is independent of any bus implementation. Instead, it exports a custom interface to both the instruction fetch and the memory controller.
This also presented a wonderful opportunity to separate the formal verification of the ZipCPU from the verification of the instruction and data bus interfaces. This is shown in Fig. 5 by the introduction of custom interface property sets sitting between the CPU and these two sets of interface modules. I now have one custom property set for verifying the instruction fetch, and another for verifying the memory controller fetch. This means that any instruction fetch or memory controller meeting these properties will then be able to work with the ZipCore. As a result, I no longer need to verify that the ZipCore. will work with a particular instruction fetch or a particular memory controller implementation. Instead, I just need to prove that those controllers will work with the appropriate custom interface property set. If they do, then they’ll work with the CPU.
Of course, they’ll also need to work with the bus they are connected to, and so this requires a bus interface property set–either Wishbone or AXI, but we’ll get to that in a bit.
For now, let’s look at what the ZipCPU’s memory interface looks like.
CPU Interface
The ZipCPU’s memory controller interface can support one of two basic operations: read and write. Each leads to a slightly different sequence. These are shown in Fig. 6.
In the case of a write, the CPU provides the address and the value to be written to the controller. The controller then becomes busy. Once it finishes the task, if all goes well, it quietly ceases to be busy. If something went wrong, the memory controller will instead return an error.
Reads are similar, with the difference that the result needs to be returned to the CPU once the operation is complete. In this case, the memory controller sets a valid signal for the CPU, the value returned from the bus, and then returns to the CPU the register address that this value is to be written into. At least the way the ZipCPU handles this interface, it is the memory controller that keeps track of what register the result will be written into. That’s what happens if all goes well. However, if the bus returns an error, then the controller will set an error flag instead of the valid flag. It’s up to the CPU then to determine what to do in case of a bus error.
In general, the ZipCPU will do one of two things on a bus error. If the CPU is in user mode, it will switch to supervisor mode. If, on the other hand, the CPU is in supervisor mode then it will halt. If desired, an external wrapper can reset the CPU as an attempt to recover from the error, but in general it just halts and waits for the debugger. The S6SoC project was my one exception to this rule, since there was no room for an external debugging bus in that design. In that case, the CPU would simply restart, dump the last CPU register contents, and then attempt to continue a reboot from there.
No matter how the software handles the bus error, the memory controller will not return further results from any ongoing set of operations. Returns from outstanding reads following a bus error will be ignored. Outstanding writes may, or may not be, completed–depending on their status within the memory controller and the bus implementation. At a minimum, only one bus error will be returned. Further error responses from any outstanding accesses on the bus will not be returned to the CPU.
Fig. 7 on the left shows the basic interface between the CPU core and it’s memory controller used to implement these operations. Let’s take a moment before going any further to discuss the various signals in this interface. Indeed, the basic interface is fairly simple:
-
The bus reset,
i_bus_reset
: This is just another name for the system reset pin. Everything resets when the bus reset is asserted. -
The CPU reset,
i_cpu_reset
: With Wishbone, it’s easy to reset the CPU separate from the bus. All you need to do is drop theCYC
andSTB
lines. With AXI, this is a bit harder, since you will still get responses back from the bus from any requests that were made before your reset if you don’t also reset the bus. This is why the memory interface separates the CPU reset from the system reset, so that the CPU can be reset separate from the rest of the design. It’s up to the memory controller to make sure that the CPU doesn’t get any stale memory results from prior to the reset request returning afterwards. -
i_stb
: This is the basic request line. When the CPU raisesi_stb
, it wants to initiate a memory operation. For those familiar with the AXI stream protocol, you can think of this asTVALID && TREADY
. -
o_pipe_stalled
: This is the basic stall line. When raised, the CPU will not raisei_stb
to make a request of the bus. Continuing with the AXI stream analogy from above, this is similar to the!TREADY
signal in AXI stream.
-
i_op
: This specifies the type of operation. To keep logic counts low, the bits to the memory operation are drawn directly from the instruction word.i_op[0]
will be true for a write (store) instruction, and false for a read (load) instruction.i_op[2:1]
then specifies the size of that operation.2'b11
specifies a byte-wise read or write,2'b10
a half-word/short (16b) operation, and2'b01
a full word operation. -
i_addr
: The address to be written to or read from. This only has meaning wheni_stb
is true. -
i_data
: The data to be written to the address above. This only has meaning when bothi_stb
andi_op[0]
. For 8’bit writes, only the lower 8-bits have meaning. Likewise for 16’bit writes only the lower 16-bits have any meaning. -
i_oreg
: For reads, this specifies the register address that the read result will be placed into upon completion. The memory unit will hold onto this value, and then return it to the CPU again later. In the case of the pipelined operators, this value will go into a FIFO to be returned later with any read results. This value is ignored in the case of writes. -
o_busy
: If the memory core is busy doing anything, it will seto_busy
. For example, if you issue a bus operation, theno_busy
will go true. If you later reset the CPU,o_busy
will remain true until the memory core can accept another operation.o_busy
is subtly different fromo_pipe_stalled
in that the CPU may issue additional memory operations whileo_busy && !o_pipe_stalled
. However, the CPU will not start a new string of memory operations, nor will it change direction while the memory core assertso_busy
.It’s important to note that the CPU may move on to a non-memory instruction if
o_busy
is true as long aso_rdbusy
is low. -
o_rdbusy
: If the memory core is busy reading values from the memory, theno_rdbusy
will be set to indicate that a read is in progress and the CPU should not proceed to any other instructions (other than additional reads). -
o_valid
: Once a read value is returned,o_valid
will be set to indicate the need to write to the register file the value returned from the read. If all goes well, there will be exactly oneo_valid
for everyi_stb && !i_op[0]
, although CPU resets and bus errors may keep this count from being exact. -
o_err
: This will be set following any bus error, with two exceptions: First, if the CPU is reset while operations are outstanding, then any bus error response for those outstanding operations will not be returned. Second, after the first bus error, the memory controller will first flush any ongoing operations before returning any more bus errors to the CPU. -
o_wreg
: When returning a data value to the CPU, theo_wreg
value tells the CPU where to write the value. This is basically thei_oreg
value given to the memory controller reflected back to the CPU, together with the data value that goes with it. -
o_result
: The value from any read return is provided ino_result
. In the case of an 8-bit read, the upper 24-bits will be cleared. Likewise, for a 16’bit read, the upper 16’bits will be cleared.Some CPU’s sign extend byte reads to the full word size, some do not. By default, the ZipCPU simply clears any upper bits. Two following instructions, a
TEST
instruction followed by a conditionalOR
can turn a zero extended read into a sign extended read. Alternatively, changing the memory controller from one behavior to another is fairly easy to do. Adjusting the GCC toolchain and following support, however, can take a bit more work.
There are two other important signals in this interface. These are signals we won’t be addressing today, but they are important parts of the more complex controller implementations.
-
i_clear_cache
: This is my way of dealing with cache’s and DMA’s. The CPU can issue a special instruction to clear the cache if the memory may have changed independent of the CPU. This input is also asserted if the debug interface changes memory in a way the CPU is unaware of. If raised, the memory controller will mark any and all cached data as invalid–forcing the cache to reload from scratch on the next request. -
i_lock
: This flag is used when implementing atomic memory instructions. It will be raised by aLOCK
instruction, and then lowered three instructions later. This allows for certain four instruction sequences: LOCK, LOAD, (ALU operation), STORE. A large variety of atomic instructions can be implemented this way. Examples include atomic adds, subtracts, or even the classic test and set instruction.
During these three instructions, the CPU is prevented from switching to supervisor mode on any interrupt until all three instructions following the lock have completed.
Atomic access requests are generally easy to implement when using Wishbone. The Wishbone cycle line is simply raised with the first LOAD instruction (LB, for load byte in Fig. 9), and then held high between the LOAD and STORE instructions (SB, for store byte in Fig. 9). Things are a bit more complicated with AXI, however, since AXI doesn’t allow a bus. master to lock the bus. Instead, the CPU will only discover if it’s atomic instruction was successful when/if the final store operation fails. In that case, the memory controller needs to tell the CPU to back up to the lock instruction and start over. How to make this happen, however, is a longer discussion for the day we discuss the full AXI version of this same memory controller.
To see how this interface might work when driving an AXI bus, I thought I might provide examples of both writing to and reading from the bus. Here’s the write example.
Note the key steps:
-
The CPU makes a request by setting
i_stb
, placing the data to be written intoi_data
, and the address of the transaction intoi_addr
. -
The memory controller then becomes busy. It raises both
M_AXI_AWVALID
andM_AXI_WVALID
to request a transaction of the bus. In this example, we also raiseM_AXI_BREADY
as a bit in our state machine, to indicate that we are expecting data to be returned from the bus in the future. -
M_AXI_AWVALID
must remain high, andM_AXI_AWADDR
must remain constant untilM_AXI_AWREADY
is high. In this highly compressed example,M_AXI_AWREADY
just happens to be high whenM_AXI_AWVALID
is set, so that itM_AXI_AWVALID
can be dropped on the next cycle.The same rule applies to
M_AXI_WVALID
.M_AXI_WVALID
must stay high andM_AXI_WDATA
andM_AXI_WSTRB
must stay constant untilM_AXI_WREADY
.I’ve seen several beginner mistakes with this handshake. Remember: this chart in Fig. 10 is only representative! Some slaves will delay setting
M_AXI_AWREADY
longer than others, some will setM_AXI_AWREADY
beforeM_AXI_WREADY
and others will set them in a different order. To be compliant, an AXI master must be able to deal with all these situations. -
In this compressed example,
M_AXI_BVALID
is set on the clock immediately followingM_AXI_AWVALID && M_AXI_AWREADY && M_AXI_WVALID && M_AXI_WREADY
.Do not depend upon this condition! I’ve seen beginner mistakes where the beginner’s logic requires all four of these signals to be high at the same time. Remember, either one of these two channels might get accepted before the other.
-
Once
M_AXI_BVALID
has been received, the memory controller dropso_busy
to indicate that it is idle. A new request may then be made on the same cycle.
Now let’s take a look at a read example using this interface.
While this example is very similar to the previous write example, there are some key differences. Therefore, let’s walk through it.
-
The CPU indicates the desire to read from memory by raising
i_stb
and placing the address to be read from ini_addr
. The register that will be read into is also placed intoi_wreg
–the memory controller will need to return this value back when the operation is complete.Not shown is the
i_op
input indicating the size of the read, whether byte (8b), halfword (16b), or word (32b). -
Once the memory controller receives
i_stb
, it immediately setsM_AXI_ARVALID
andM_AXI_ARADDR
with the information it is given.This controller also sets
M_AXI_RREADY
high at this point, as part of its internal state tracking. This is to indicate that a read return is expected.Finally, the controller sets both
o_busy
ando_rdbusy
. The first indicates that a memory operation is ongoing, and the second indicates that we will be writing back to a register upon completion. This latter flag,o_rdbusy
, is used to prevent the CPU from moving onto its next operation and so helps avoid any pipeline hazard. -
M_AXI_ARVALID
must stay high andM_AXI_ARADDR
constant until the slave assertsM_AXI_ARREADY
. In this example, that happens immediately, but this will not be the case with all slaves.Holding
M_AXI_ARVALID
high pastM_AXI_ARVALID && M_AXI_ARREADY
will request a second read. Since we don’t want that here, we immediately dropM_AXI_ARVALID
upon seeingM_AXI_ARREADY
. -
Once the slave accomplishes the read, it sets
M_AXI_RVALID
andM_AXI_RDATA
. Since the memory controller is holdingM_AXI_RREADY
high, these will only be set for a single cycle. -
The memory controller then copies the data from
M_AXI_RDATA
too_result
to send it back to the CPU.o_valid
is set to indicate a result is valid.o_rdbusy
is dropped, since we are no longer in the middle of any operation. Finally,o_wreg
returns the register address that the CPU entrusted to the memory controller.
These are examples drawn from the controller we’ll be examining today. Just to prove that the throughput of this CPU interface isn’t bus limited in general, here is a trace drawn from an AXI-lite memory controller capable of issuing multiple ongoing operation.
Just for the purposes of illustration, I dropped M_AXI_ARREADY
on the
first cycle of the request for address A3
, all this behavior is highly
slave dependent. Doing this, however, helps to illustrate how a bus stall
will propagates through that
controller.
Notice how the CPU then suffers one stall, and that the result takes an extra
cycle to return the item from that address. Beyond that, however, we’ll need
to save the examination of that
controller
for another day. For now we’ll limit ourselves to a
controller
that can only handle a 33% bus
throughput at best.
33% throughput? Is that the performance that can be expected from this type of controller? Well, not really. That would be the performance you’d see if this memory controller were connected directly to a (good) block RAM memory. If you connect it to a crossbar interconnect instead, you can expect it to cost you two clock cycles going into the interconnect, and another clock cycle coming out. Hence, to read from a block RAM memory, it will now cost you 6 cycles, not 3, for a 16% bus throughput. Worse, if you connect it to Xilinx’s AXI block RAM controller, it’ll then take you an additional 4 clock cycles. As a result, your blazing fast ZipCPU would be crippled down to one access for every 10 clock cycles simply due to a non-optimal bus architecture. Unfortunately, it only gets worse from there when you attach your CPU to a slower AXI slave.
Here’s a trace showing what that whole operation, from CPU through interconnect to Xilinx’s AXI block RAM controller and back might look like.
In this trace, we have the outputs of our controller M_AXI_ARVALID
and
M_AXI_ARADDR
going into a
crossbar. The
crossbar forwards these
requests to BRAM_AXI_ARVALID
and BRAM_AXI_ARADDR
, the inputs to Xilinx’s
AXI block RAM controller.
This block RAM controller
takes a clock cycle to raise BRAM_ARREADY
, and then two more clock cycles
before it raises its output on the result pipe, BRAM_RVALID
and BRAM_RDATA
.
From here it will take another clock to go through the
crossbar. This clock is
the minimum timing allowed by the AXI
spec. As a result, the read takes a
full 10 cycles. The ZipCPU’s memory
interface will allow a second request as soon as this one returns, yielding a
maximum throughput of 11%.
As I mentioned above, fixing this horrendous throughput will require a redesigned memory controller. Of course, a better AXI block RAM controller would also help as well.
We’ll get there.
For now, a working AXI memory controller is a good place to start from. We can come back to this project and optimize it later if we get the chance.
Basic Operator
Now that we know what our interface looks like, let’s peel the onion back another layer deeper to see how we might implement these operations when using AXI-lite.
First, let me answer the question of why AXI-lite and not AXI? And, moreover, what will the consequences be of not using the full AXI4 interface? For today’s discussion, I have several reasons for not using the full AXI4 interface:
-
AXI-lite is simpler.
This may be my biggest reason.
-
AXI-lite can easily be converted to AXI (full) by just setting the missing signals.
-
The CPU memory unit doesn’t need AXI IDs. While a CPU might use two separate AXI IDs, only one would ever be needed for any source. Therefore, the fetch unit might use one ID and the memory controller another. If a DMA were integrated into the CPU, it might use a third ID and so on. There’s just no need for separate ID’s in the memory controller itself.
-
Since we’re only implementing a single access at a time today, or in the case of misaligned accesses two accesses at a time, there’s no reason to use AXI bursts.
When (if) we get to building an AXI instruction or data cache, then bursts will make sense. In such cases, a natural burst length will be the size of a single cache line.
While it might make sense to issue a burst request when dealing with misaligned accesses later, AXI’s requirement that burst accesses never cross 4kB boundaries could make this a challenge. By leaving adjacent memory accesses as independent, we don’t need to worry about this 4kB requirement.
There is one critical bus capability that we will lose by implementing this memory controller for AXI4-lite rather than AXI4 (full), and that is the ability to implement atomic access instructions. If for no other reason, let’s consider this implementation only a first draft of a simple controller, so that we can come back later with a more complicated and full featured controller at a later time. Indeed, if you compare this core to a comparable full AXI memory controller, you’ll see the two mostly share the same structure.
For now, let’s work our way through a first draft of setting our various AXI4-lite signals.
The first signals we’ll control will be the various xVALID
and xREADY
signals associated with any AXI request. As discussed above, we’ll use the
xREADY
signals as internal state variables to know when something is
outstanding. Hence, on a write request we’ll set M_AXI_BREADY
and we’ll
clear it once the request is acknowledged. We’ll treat read requests
similarly, only using M_AXI_RREADY
for that purpose instead.
The first step will be to clear these signals on any reset.
While it’s a little out of order, the next group in this block controls
how to handle an ongoing operation. In general, if ever AxREADY
then the
associated AxVALID
signal will be cleared. Likewise, once BVALID
or
RVALID
are returned, we can close up and finish our operation and clear
our xREADY
signals.
As I mentioned above, getting this signaling wrong is a common beginning
AXI mistake. Remember, the AW*
and W*
channels are independent, and
that VALID
cannot be lowered until READY
.
The last step in controlling these signals is to set them on any request.
Assuming a request is incoming, we’ll want to set the various write
flags if i_op[0]
is ever true–indicating a write operation request.
Otherwise, for read operations, we’ll want to set M_AXI_ARVALID
and
M_AXI_RREADY
.
Of course, that’s only if a request is being made on this cycle. Hence, let’s caveat these new values. If there’s no request being made, then these lines should be kept clear. Likewise, if the request is for an unaligned address then (in our first draft) we’ll return an error to the CPU and not issue any request. Finally, on either a bus error or a CPU reset we’ll need to make certain that we don’t start a new request that will immediately be unwanted one the next cycle.
Judging from the AXI requests associated with Xilinx forum posts that I’ve examined, getting those five signals right tends to be half the battle.
There is another signal, however, that we’ll need to pay attention to, and
this is the one capturing whether or not the CPU was reset separate from the
system. In such cases, we’ll need to flush any ongoing
bus
operation without returning its results to the CPU at a later time. To
handle this, we’re going to implement an r_flushing
signal. This
signal will capture the idea of the
bus begin busy,
even through the CPU isn’t expecting a result from it.
This signal will be cleared on any system reset.
The primary purpose of this signal is to let us know to flush any outstanding returns following a CPU reset while a bus operation is ongoing without also needing to reset the bus.
There’s one caveat to this, however, and that is that we don’t want to set
r_flushing
if the CPU is reset on the same cycle the outstanding value is
returned to the CPU.
Otherwise if the bus is idle, we can leave the r_flushing
signal at zero–no
matter whether the CPU is reset or not.
Handling the bus address for this simple controller is really easy. As long as we aren’t in the middle of any operations, we can set the address to the CPU’s requested address. Even better, we can use the same logic for both read and write addresses.
AXI requires an AxPROT
signal accompany any request. Looking through the
AXI spec, it looks like 3'h0
will work nicely for us. This will specify
an unprivileged, secure data access.
That brings us to setting M_AXI_WDATA
and its associated M_AXI_WSTRB
.
In general, we’re going to need to up shift these values based upon where
the data given us will fit on the bus. I like to use AXILSB
to capture the
number of address bits, in an AXI interface, necessary to define which octet
the address is referencing.
Remember not to copy Xilinx’s formula for this bus width, since their calculation is only valid for 16, 32, or 64-bit bus widths. (You can see their bug here. In their defense, this doesn’t really matter in an AXI-lite interface, since Xilinx only allows AXI-lite to ever have a data width of 32-bits. Sadly, they made the same mistake in their AXI full demonstrator.)
We can now use this value to shift our data input by eight times the value of these lower address bits to place our write data in its place on the bus.
We’ll come back in a moment and assign M_AXI_WDATA
to be the same as
this axi_wdata
. For now, let’s just note that the logic for axi_wstrb
is almost identical. In this case, we’re upshifting a series of 1
s,
one for each byte we wish to write, by subword address bits. The second
big difference is that we aren’t multiplying the low order address bits by
eight like we did for the data.
There’s one last step here, and that is that we need to keep track of both the operation size as well as the lower bits of the address. We’re going to need these later, on a read return to know how to grab the byte of interest from the bus.
That leaves only one other signal required to generate a bus request, and that signal is going to tell us if and when we need to abort the request due to the fact that it will require two operations. For this initial implementation, we’ll simply return an error to the CPU in this case. We’ll come back to this in a moment to handle misaligned accesses, but this should be good enough for a first pass.
An access is misaligned if the access doesn’t fit within a single bus word. For a 4-byte request, if adding 3 to the address moves you into the next word then the request is misaligned. For a 2-byte request, if adding one moves you to the next word then the request is misaligned. Single byte requests, however, cannot be misaligned.
Now, if this flag is ever true, we’ll skip issuing the request and instead return a bus error to the CPU. (We’ll get to that in a moment.)
That’s what it takes to make a request of the bus.
The next request is to handle the return from the bus. and to forward it to the CPU.
The first part of any return to the CPU is returning a value. We’ll have a
value to return if and when RVALID
is true. We’ll take a clock cycle
to set this o_valid
flag, so as to allow us a clock cycle to shift RDATA
to the right value.
For now, notice that o_valid
needs to be kept clear following a reset of
any type. Further, it needs to be kept clear if we are flushing responses
as part of a CPU reset separate from an AXI bus reset. Finally, we’ll set
the o_valid
flag on RVALID
as long as the
bus
didn’t return an error.
We now turn our attention to the CPU
bus
error flag. In general, a
bus
error
will be returned when either BVALID
or RVALID
and the response is an
error. We’ll also return a
bus
error
on any request to send something misaligned to the
bus. The exceptions,
however, are important. If the CPU is reset, we don’t want to return an
error,
nor if we are waiting for that reset to complete.
We’ll also need to return some busy flags. This core is busy if ever
M_AXI_BREADY
or M_AXI_RREADY
are true. We’ll also set our
o_pipe_stalled
flag to be equivalent to o_busy
for this simple controller,
but that setting will be external to this logic. Similarly, the CPU can
expect a response if M_AXI_RREADY
is true and we aren’t flushing the result.
When returning a result to the CPU, we need to tell the CPU which register to write the read result into. Since this simple memory controller only ever issues a single read or write request of the bus, we can choose to simply capture the register on any new request and know that there will never be any other register to return.
That leaves us only one more signal to return to the CPU, the o_result
from
a data read. There are two parts to returning this value. The first part
is that we’ll need to shift the value back down from (wherever) it is placed
in the return
bus,
word. This was why we kept the subword address bits in our r_op
register.
We also kept the size of our operation in the upper bits of r_op
. We can
use these now to zero extend octets and half words into 32-bits.
Some CPU’s sign extend sub word values on reading. Not the ZipCPU. The ZipCPU zero extends subword values to full words on any read. This behavior, however, is easy enough to adjust if you want a different behavior.
There you go, a basic AXI-lite based CPU memory controller.
Handling Misaligned Requests
Perhaps I should have been satisfied with that first draft of a basic memory controller.
I wasn’t.
The draft controller will return a bus error response to the CPU if you ever try to write a misaligned word to the bus. Try, for example, to write a 32-bit word to address three. The operation will fail with a bus error. This was by design. Why? Because otherwise you’d then need to write across multiple words.
Well, why can’t we build a controller that will read or write across multiple words when requested? Such a controller could handle misaligned requests.
So, let’s start again, using the design template above, and see if we can adjust this controller to handle misaligned requests.
The first thing we are going to need are some flags to capture a bit of state. Let’s try these:
-
misaligned_aw_request
: This is the first request of twoAW*
requests, as a result of a misaligned write. -
misaligned_request
: This is the first request of either twoW*
requests, or twoAR*
requests. -
misaligned_response_pending
: Two responses are expected. As a result, ifmisaligned_response_pending
is ever true, then we still expect either twoBVALID
returns or twoRVALID
returns. (One might be present on this clock cycle.) -
misaligned_read
: This signal is very similar tomisaligned_response_pending
, except that it isn’t cleared on the first read response. It’s used at the end to let us know that two read results need to be merged together into one before returning them to the CPU. -
pending_err
: Of our two responses, the first has returned an error. Since it is only the first of two, we haven’t returned the error response to the CPU yet. Hence, ifpending_err
, then we need to return a bus error to the CPU on the next bus return–regardless of what status response is returned with it.
We can now go back to the top and re-look at our xVALID
and xREADY
handshaking request flags again.
The big difference here is in how we handle a return. If a misaligned request
is outstanding, then you don’t want to drop xVALID
on the first cycle–you
will want to wait for the second return. The same applies to waiting for two
responses.
That’s the big change there. The logic required to start a memory operation won’t change.
The r_flushing
signal, indicating that shouldn’t forward results to the
CPU is a little more complex. The big difference here is if a misaligned
response is pending. In that case, we don’t want to clear our r_flushing
signal on a return, but rather on the next return.
Address handling gets just a touch more complicated as well. In this case, any time an address is accepted we’ll increment it to the next word address.
There’s a couple things to remember here. First, when handling a misaligned request, we must always move to the next word—that’s what a misaligned request is. Second, the low address bits should be zero. This will be appropriate for little endian systems. It’s not necessarily appropriate for big endian systems like the ZipCPU, but at least it won’t hurt.
The next trick is the M_AXI_WDATA
and M_AXI_WSTRB
values indicating which
bytes to write to and the values to be written to them. The trick to making
this work is to map the request onto two separate
bus
words. Once mapped to two words, we can then send the result to those words
one at a time.
We’ll add one intermediate step here, though, which is to create the
M_AXI_WSTRB
value combinatorially first. This just simplifies writing the logic out,
but not much more. Note that we are again shifting a set of ones up by
the address in the low order bits of i_addr
–just like we did before, only
this time onto two words worth of byte enables instead of just one.
The next change is that we’ll add two new registers: next_wdata
and
next_wstrb
. These will hold the next values of M_AXI_WDATA
and
M_AXI_WSTRB
–the values we’ll use for them on the second clock cycle
of any misaligned request.
Here’s the first of the key steps with next_wdata
and next_wstrb
: their
logic is identical to the logic we used before, save that they are applied
across two bus words.
Given that axi_wdata
and axi_wstrb
are going to map directly to
M_AXI_WDATA
and M_AXI_WSTRB
, that just leaves handling the second
write cycle. For that, we just copy next_wdata
to axi_wdata
as soon
as the channel isn’t stalled. We’ll likewise do the same for next_wstrb
and axi_wstrb
.
What about detecting a misalignment? More than that, what if we want this core to either generate a bus error as before on misalignment, or to issue multiple requests?
To handle both capabilities, we’ll create an single bit OPT_ALIGNMENT_ERR
parameter. If this bit is set, misaligned requests will generate
bus
errors. If not, misaligned
requests will be allowed to take place.
We’ll also split our misalignment signal into two. The first signal,
w_misaligned
, will simply indicate a misaligned request. The second
signal, w_misaligned_err
, will indicate that we want this misaligned
request to turn into a
bus
error.
The next big component will be handling our new misalignment signals. Obviously, if we are just generating errors on any misaligned request, then we won’t need these signals and they can be kept at zero.
This will allow the optimizer to simplify our logic when we just adjust the
OPT_ALIGNMENT_ERR
parameter.
On the other hand, if we are generating misaligned requests, then we’ll need
to define these signals. The first indicates that this is a misaligned
request, and a second W*
or AR*
operation is required.
Since the AW*
and W*
channels need to be handled independently, we need
a separate signal to handle the second request on the AW*
channel. This
signal is almost identical to misaligned_request
above, save that it is
cleared on AWREADY
.
Knowing if a response will be the first of two expected is the purpose of
misaligned_response_pending
. It’s set much the same as the other two.
The big difference in this signal is that it is cleared on either M_AXI_BVALID
or M_AXI_RVALID
–the first return of the misaligned response.
The next signal, misaligned_read
, simply tells us we will need to reconstruct
the read response from two separate read values before returning it to the
CPU.
Finally, our last misalignment signal is the pending_err
signal. This signal
gets set on any write or read error, and then cleared when that error is
returned to the CPU. Once set, we’ll clear it any time the interface
clears. This guarantees that we’ll be clear following any request or response
to the CPU as well.
The next several signals have only minor modifications.
The o_valid
signal, indicating a valid read return to the CPU, needs to be
adjusted so that it waits for the second return of any misaligned response.
Similarly, we don’t return o_valid
if either the current or past response,
in the case of a pair of responses, indicates a
bus
error.
In those cases, we’ll set o_err
next.
The error return is also quite similar. There are only a few differences.
The first is that we don’t want to return an o_err
response if there’s still
a response pending. The second difference is that we’ll also return an o_err
response if the prior response indicated a
bus
error.
Our busy signal returns to the CPU don’t change. Those are the same as before,
as is the o_wreg
register.
That leaves one complicated piece left of this–reconstructing the read
return. This is sort of the reverse of next_wdata, axi_wdata
from above,
save that this time we are using M_AXI_RDATA, last_result
. Note the
reverse ordering–the first value is always going to be on the right in a
little endian bus.
The first step is to construct the two-word wide return, and then to shift it appropriately so the desired data starts at the bottom byte. We handle this with a separate logic block so that we don’t get lint errors when shifting from a value of one size to another.
Now that we have this pre-result, we can construct our final value.
First, on any read return we copy the return to our last_result
register–in
case this is a misaligned return.
The next step is to turn this pre_result
value into the value we return
to the CPU. If this is a half-word or octet request, we’ll zero the upper
bits as well.
In many ways, this second pass at this design illustrates the way most of my development has taken place recently. I’ll often draft a simple version of a design, and then slowly layer on top of it more and more complicated functionality until it’s everything I want.
In hindsight, the misalignment processing wasn’t nearly as complicated as I was fearing. I know, I tend to dread handling misaligned requests. However, it never seems to be that hard when I actually get down to building it. Once you adjust the signaling to handle two requests, the remaining process is fairly basic: place the data into a two word shift register and shift it as appropriate, then deal with each half of that register.
If you look over the design for this memory
controller,
you might notice other options as well. For example, there’s an OPT_LOWPOWER
option that will force all unused signals to zero. There’s a OPT_SIGN_EXTEND
option to sign extend the return data. We’ve already mentioned the
OPT_ALIGNMENT_ERR
option. Finally, there are some experimental
SWAP_ENDIANNESS
options that I’m still working with–as part of hoping that
I can somehow keep a big endian CPU running on a little endian bus without
massive changes. (I’m not convinced any of these endianness parameters,
either the SWAP_ENDIANNESS
or the SWAP_WSTRB
options, will work–they’re
still part of my ongoing development.)
Formal Verification
At this point in my design, I’ve only formally verified this memory controller. I haven’t yet simulated it. Yes, I’m expecting some problems when I get to simulation, but, hey, one step at a time, right?
Let’s now take some time, though, to look over some of the major parts of that proof. These include the AXI-lite interface properties, the CPU interface properties, and some cover checks to make sure the design works. This follows from what I’ve learned from previous experiences about what works for verifying a design. Perhaps it will work the first time I try it in simulation. We’ll see. (I’m still not convinced the big-endian CPU will work with this little-endian controller, formal proof or not … but we’ll see.)
AXI-lite interface
Two years ago, I posted a set of interface properties for working with AXI-lite. At the time, I was very excited about these properties. By capturing all the requirements of an AXI-lite interface into a set of formal properties, I could simplify any future verification problems. I predicted AXI-lite designs would become easy to build as a result.
I haven’t been disappointed. While I’ve made small adjustments to those properties since that time, they’ve seen me through a lot. Using them, I’ve been able to very quickly check designs posted on Xilinx’s forums. The check tends to take about a half hour or so. Even better, it’s pretty conclusive.
So how hard is it? There are only a couple steps. First, on any new design, I start by instantiating my AXI-lite master property set into the design.
I’ll then create a SymbiYosys script file. These files are pretty basic, enough so that I now have a script to handle generating just about all but three lines of the file. At this point, I’ll run the design and often find any bugs.
This design is almost that simple. In this case, I also need to incorporate a CPU interface property file as well, but we’ll get to that part in the next section.
At this point, SymbiYosys will either return a bug in 20 clock cycles in about 5 seconds, or there will likely not be a bug in the design at all. Sometimes I’ll just run it for 40-50 cycles if I’m not sure–or longer, depending on my patience level.
Once I get that far, most of the bugs in the design are gone.
Perhaps I’m a bit of a perfectionist, but this is rarely enough for me. I like to go further and verify these same properties for all time via induction. This, to me, is just a part of being complete.
So let’s spend some time working through some properties we might use to guarantee that this design passes an induction check.
In the case of the AXI-lite bus, this primarily consists of constraining the
three counters: faxil_awr_outstanding
, faxil_wr_outstanding
, and
faxil_rd_outstanding
. We’ll go a bit farther here, and constrain some of our
internal signals as well.
For example, if we are ever in a misaligned_request
, then either WVALID
or ARVALID
should be set since this is our signal that we are in the first
of two request cycles.
Similarly, if misaligned_aw_request
is ever true, then we are in the first
of two AWVALID
cycles. That means M_AXI_AWVALID
had better be true.
If no misaligned responses are pending, then we should be able to at least
limit the number of outstanding items. If any of the request lines, whether
M_AXI_AWVALID
, M_AXI_WVALID
, or M_AXI_ARVALID
are true, then, since
there’s no misaligned responses pending, there must be nothing outstanding.
In all other cases, with no misaligned responses pending the number of
outstanding items must be less than one.
Inequality constraints like this aren’t usually very effective, but they’re often where I’ll start a proof. Over time, I usually turn these inequalities into exact descriptions–although I didn’t do so for this design. Indeed, this particular proof is unusual in that the inequalities above are still important parts of my proof. (If I remove them, the proof fails …)
Of course, if there are no misaligned responses pending, then there can’t be any misaligned requests.
On the other hand, if a misaligned response is pending, and we are in a read
cycle, then the misaligned_read
signal should be true.
Now let’s turn our attention to flags specific to read cycles.
For example, if we aren’t in a read cycle then ARVALID
, misaligned_read
,
and the number of outstanding read requests should all be zero.
On the other hand, if this is a read request then this can only be a misaligned request if nothing else is outstanding. After that, we do our best come up with the correct read count. (It’s still an inequality, but it’s enough …)
Let’s now turn our attention to write signals. If we aren’t in the middle of a write cycle, then the write signals should all be zero. There should be no writes outstanding, nor any being requested.
On the other hand, if we are within a write cycle, then what conclusions can we draw? If we are still within a request, then the number of outstanding items must be zero. Likewise, we will only ever have at most two requests outstanding at a time.
But once I got this far, I punted. I just wasn’t certain how to constrain the write counters. So, I fell back on an old trick I’ve come across: the case statement. Using a case statement, I can often work my way through all the possibilities of something. A case statement also forces me to think about each of the possibilities of something happening individual.
As I’ve mentioned before on this blog, don’t worry about creating too many assertions. If you do, the worst that will happen is that there will be a minor performance penalty–assuming that you have valid assertions. If you assert something that isn’t true, the formal tool will catch it, and you’ll be patiently corrected. Indeed, there’s no way through creating too many assertions to get a design to pass an assertion that isn’t so. The problem isn’t usually too many assertions, rather it is not having enough assertions.
Moving on, and perhaps I should’ve asserted this earlier, we can either be in
a write cycle, a read cycle, or no cycle. There should never be a time when
both M_AXI_BREADY
and M_AXI_RREADY
are true together.
Let’s put a quick constraint on r_flushing
: if we aren’t busy, then we
shouldn’t be flushing any responses. Since we’ve constrained o_busy
to only ever be true if either M_AXI_BREADY
or M_AXI_RREADY
, this also
effectively forces r_flushing
to zero if nothing is outstanding and none
of the AxVALID
or WVALID
requests lines are active.
When putting this
core
together, I made some of the signals combinatorial. One example is o_busy
,
which is set if either M_AXI_BREADY
or M_AXI_RREADY
are true. I may wish to come back later and adjust this design so that
o_busy
is registered. Indeed, this sort of task is common enough that it
forms the basis for a project I often use in my formal
courseware: given a working
design, with a working set of constraints, adjust a combinatorial value so
that it is now registered, and then prove that the design still works. In
order to support this possibility later, I’ve included the combinatorial
descriptions of o_busy
and o_rdbusy
among my formal property set.
In general, I like to have one or more constraints forcing every register into
their correct setting with everything else. Here, we constrain pending_err
:
If we are busy, and there’s a misaligned response pending, then we haven’t
yet gotten our first response back in return. Therefore, if we haven’t gotten
our first of the two responses back, pending_err
should be zero. It
shouldn’t get set until and unless one of our return responses comes back
in error.
While I have more assertions in this section of the
design,
that’s probably enough to convince you that I’ve fully constrained the
various faxil_*_outstanding
counters to the internal state of the design.
What we haven’t done yet, however, is constrain the other half of the design: the CPU interface. Let’s do that next.
CPU Interface
One of the challenges associated with blindly attempting to formally verify an AXI design you’ve never seen before is that many AXI designs, like this one, are effectively bridges. That means they have two or more interfaces to them. An interface property file will only provide you with instant properties for one of those interfaces. You’ll still need to constrain the other interface.
</A> |
In the case of the ZipCPU, there are two interfaces to memory. The ZipCPU, also has many memory interface implementations split across the two categories: instruction and data. When it comes to instruction fetching, the ZipCPU, has a very simple and basic single instruction fetch, as well as a two instruction pipeline fetch and an instruction fetch and cache. In a similar vein, the ZipCPU has three basic data interfaces: a basic single load or store interface, a pipelined memory controller, and a data cache. These three categories have served the ZipCPU well, allowing me to easily adjust the CPU to fit in smaller spaces, or to use more logic in order to run faster in larger spaces.
Those original interfaces, however, are also all Wishbone interfaces.
When it came time to build an AXI interface, I stepped back to rethink my verification approach. The problem with each of those prior memory controllers was that they each had their own assumptions about the CPU within them. When I then verified the CPU, I switched those assumptions to assertions, but otherwise verified the CPU with the memory interfaces intact within it. The consequence of this approach was that I needed to re-verify the CPU with every possible data interface it might have.
This seemed rather painful, so I separated the instruction and data interface assumptions from their respective controllers into one of two property files: one for the instruction interface,and another for the data interface. The property files therefore describe a contract between the CPU core and the instruction and data interfaces. Anything the CPU core needs to assume about those interfaces gets asserted when verifying the interface, or assumed when verifying the CPU. By capturing this contract into one place, verifying new interfaces has become much easier.
All of the former Wishbone memory interfaces have now been re-verified using one of these two property sets as appropriate.
Not only that, but now the ZipCPU has new AXI interfaces. There’s an AXI-lite instruction fetch module that can 1) handle one outstanding transaction, 2) two outstanding instruction fetch bus transactions, or even 3) an arbitrary number of outstanding instruction fetch transactions. I’ve also rebuilt the ZipCPU’s prefetch and instruction cache. One neat feature about these new AXI or AXI-lite interfaces is that they are all parameterized by bus width. That means that I won’t need to slow a 64-bit memory bus down to a 32-bit width for the CPU anymore.
It’s not just instruction fetch interfaces, either. This approach has made it easy to build data interfaces in the same way.
For now, let’s take a look at how easy it is to use this new interface.
The first step is to declare some signal wires to be shared between the memory module and the interface property set. These extra (formal verification only) signals are:
-
cpu_outstanding
: A count of how many requests the CPU thinks the memory is working on. This count will get cleared on a CPU reset,i_cpu_reset
. -
f_done
: This signal is generated by the memory controller to tell the property set that an operation has completed–whether read or write. Normally, something like this would be part of the interface between the memory unit and the CPU, something likeo_valid
oro_err
above. However, there’s no means in this interface to announce the completion of a write operation other than droppingo_busy
, sof_done
takes its place. -
f_last_reg
: Is a copy of the last register target of any CPU load operation. This is important for the CPU pipeline, since there’s enough room in the CPU pipeline to read into any register but the last one, and so this last register needs to be tracked by the memory property set. -
f_addr_reg
: One of the rules of pipelined memory operations is that, in any string of ongoing operations, they all need to use the same base address register. This keeps the CPU from needing to keep track of which register will be written to by the operation. In particular, the address register shall not be written to by any operation–save perhaps the last one. The CPU will insure this, by never issuing a read request into the address register unless it waits for the memory controller to finish all of its reads first. The property set will accept this value asf_areg
–again, it’s not part of the CPU’s interface proper, so we just assume it’s presence here. The CPU will actually produce such a register, since it knows what it is, and properties of that register will be asserted there–here they are only assumed. -
f_pc
: This flag, returned from the memory property set, will be true if the last read request is to read into either the program counter or the condition codes, both of which might cause the CPU to branch. Reads into the program counter or condition codes, if done at all, need to be the last read in any string. This return wire, from the property set, helps to make sure that property is kept. -
f_gie
: The ZipCPU has a lot of “GIE” flags all throughout it. “GIE” in this case stands for “Global Interrupt Enable”. In supervisor mode, the “GIE” bits are clear, whereas they are set in user mode–the only mode where the ZipCPU can be interrupted. These are also the most significant bit in any register address–since the ZipCPU has one register set for user mode (GIE=1) and one for supervisor mode (GIE=0).Any string of read (or write) operations must have the same GIE bit, so this flag captures what that bit is.
-
f_read_cycle
: This value, returned by the interface property set, just keeps track of if we are in a read cycle vs a write cycle. To avoid hazards, the ZipCPU will only ever do reads or writes–never both at the same time. Knowing this value helps keep track of what types of request are currently outstanding, so we can make sure we don’t switch cycles. -
f_axi_write_cycle
: This one won’t get used below. It’s a new one I had to create to support exclusive access when using AXI.First, a brief overview of how AXI exclusive access works when using the ZipCPU: the CPU must first issues a
LOCK
instruction, and then a load instruction of some size. This load is treated as an AXI exclusive access read, soM_AXI_ARLOCK
will be set. If the read comes back asOKAY
, rather thanEXOKAY
, a bus error is returned to the CPU indicating that the memory doesn’t support exclusive access. Otherwise, an ALU operation may (optionally) take place followed by a store instruction. If the store instruction fails, that is if the result isOKAY
rather thanEXOKAY
in spite of receiving theEXOKAY
result from the previous read access, than the memory controller returns a read result. (From a write operation? Yes!) That read result writes into the program counter a jump back to the original LOCK instruction to start over.For this reason, an AXI exclusive access store instruction is the only type of write instruction that will set
o_rdbusy
.That’s a long story, just to explain why this flag is necessary–to explain why
o_rdbusy
might be set on a store instruction, and to help guarantee that the result (if any) will either be written to the program counter or quietly ignored if the write was successful.
One last step before instantiating this property set is to create the f_done
signal. For this AXI interface, that’s pretty easy. An operation is done
when we receive the M_AXI_BVALID
or M_AXI_RVALID
signal–with a couple
of exceptions. We’re not done if a
bus
error will be produced.
That’s another, separate, signal. Neither are we done if there’s a pending
error, if this is the first of two responses, or if we are flushing requests
that a recently reset CPU wouldn’t know about.
Still, it’s not much more complicated than anything we’ve already done.
With that out of the way, we can simply instantiate the formal memory property interface.
Now, with all this background out of the way, we can finally verify this memory core. As I mentioned in the AXI-lite verification section above, if it weren’t for wanting to pass induction, these two property sets alone might well be sufficient to verify all but the data path through the logic.
How well does it work? Well, typically the formal tool takes less than twenty seconds to return any bugs. Even better, it points me directly to the property that failed, and the exact timestep where it failed.
That’s not something you’ll get from simulation.
However, since I like to verify a design using induction as well, I’ll often want to add some more properties.
Our first additional property just asserts that if we are every r_flushing
,
then the CPU should’ve have just be reset so it shouldn’t be expecting
anything more. If we aren’t flushing, then either this design is busy, or
it is in the process of returning a result or an
error to the CPU.
If f_pc
is ever set, then our one (and only) output must be to either
the program counter, o_wreg[3:0] == 4'hf
, or the condition codes
register, o_wreg[3:0] == 4'he
. Otherwise, if f_pc
is clear, then no
reads can read into either PC or CC registers.
If any items are outstanding, then o_wreg
must match the last register
address requested. Hence, following a load into R0
, both o_wreg
and
f_last_reg
should both point to the register address of R0
.
As long as we are busy, the high bit of the return register must match
f_gie
. This finishes our constraints upon all of the bits of o_wreg
.
As one last property, let’s make sure f_read_cycle
matches our logic.
If M_AXI_RREADY
is true, then we should be in a read cycle–unless we
are flushing things out of our pipeline following a CPU reset. Similarly,
if M_AXI_RREADY
is not true and M_AXI_BREADY
is, then we should be in a
write cycle and so we can assert !f_read_cycle
.
Notice how easy that was? All we had to do was to tie a couple of return wires from the interface property set together to the internal state of our design, and we then have all the properties we need.
Cover properties
As a final step in this proof, I’d like to see how well it works. For that, let’s just create some quick counters and count the number of returns we receive–both for writes and then again for reads.
Once we have this count, a simple cover
check can produce some very useful
and instructive traces. Indeed, these traces will show how fast
this core
can operate at it’s fastest speed.
The traces themselves are shown in Figs. 10 and 11 above. They show that this core can only ever achieve a 33% throughput at best.
No, 33% is not great. In fact, when you put 33% in context of the rest of any surrounding system, as we did in Fig. 13 above, it’s downright dismal performance. However, all designs need to start somewhere, and in many ways this is a minimally working AXI master memory controller design.
Conclusions
This is now our third article on building AXI masters. The first article discussed AXI masters in general followed by a demonstration Wishbone to AXI bridge. The second article discussed several of the problems associated with getting AXI bursts working properly, and why they are so challenging. This one returns to the simple AXI-lite master protocol, while also illustrating a working CPU memory interface.
As I’ve alluded to earlier, this is only the first of a set of three AXI memory controllers for the ZipCPU: This was the single access controller. I’ve also built a pipelined controller which should get much better performance. These are both AXI-lite designs. This particular controller also has an AXI (full) sister core, implementing the same basic design while also supporting exclusive access. My intent is to make a similar sister design to support pipelined access and AXI locking as well, but I haven’t gotten that far yet. I have gotten far enough, though, to have ported my basic Wishbone data cache to AXI. While usable, that work isn’t quite done yet, since it doesn’t (yet) support either pipelined memory access or locking, but it’s at least a data cache implementation using AXI that should be a step in the right direction. (Remember, I tend to design things in layers these days …)
Lord willing, I’d like to spend some time discussing AXI exclusive access operations next. I’ve recently modified my AXI property sets so that they can handle verifying AXI designs with exclusive access, and I’ve also tested the approach on an updated version of my AXI (full) demonstration slave. Sharing those updates will be valuable, especially since neither Xilinx’s MIG-based DDR3 memory controller, nor their AXI block RAM controller appear to support exclusive access at all. (Remember, the ZipCPU will return a error on any attempt at an exclusive access operation on a memory that doesn’t support it, so having a supporting memory is a minimum requirement for using this capability.) This can then be a prelude to a companion article to this one, discussing how to modify this controller so that it handles exclusive access requests in the future.
Let me also leave you with one last haunting thought: What would happen if, during a misaligned read operation across two words, a write took place at the same time? That’ll be something to think about.
Whom shall he teach knowledge? And whom shall he make to understand doctrine? Them that are weaned from the milk, and drawn from the breasts. For precept must be upon precept, precept upon precept; line upon line, line upon line; hear a little and there a little (Is 28:9-10)