When I first built the ZipCPU, I built it for the Wishbone bus. Wishbone is very easy to work with, and a good Wishbone pipeline implementation should be able to achieve (roughly) the same performance as AXI-lite. At the time, I had yet to build a crossbar interconnect, so my basic interconnect designs were fairly simple and depended upon the existence of no more than a single master. This forced the ZipCPU to have an internal arbiter, and to only expose that one Wishbone interface. You can see this basic structure in Fig. 1 below.

Fig 1. Basic ZipCPU architecture

My first memory controller was quite simple. It could handle issuing a single read or write at a time and waiting for the return.

When this memory controller turned out to be a CPU performance bottleneck, I chose to write a pipelined memory controller. To get there, I first noticed that the CPU doesn’t need the results of any write operation, so nothing keeps the CPU from continuing with further operations while the write operation is ongoing. Even better, you could issue a string of write operations and as long as the memory controller was able to issue further bus requests, nothing kept the CPU from allowing many requests to be outstanding only to be retired later.

I continued this reasoning to reads as well. A string of memory reads could be issued by the CPU, under the condition that none of those reads overwrote either the base address register, from which the read addresses were being determined, or the program counter. When these conditions held, multiple read requests could thn issued to be retired later–just like the write requests above.

To see how this concept might work, consider Fig. 2 below showing a notional subroutine.

Fig 2. Memory operation sequences

In this notional example, the CPU starts out with a jump to the subroutine instruction. The subroutine then creates a stack frame by subtracting from the stack pointer (SUB), and stores three registers to the stack frame via three store-word (SW) instructions. The memory controller then becomes busy handling these three requests. While the requests are active, further requests of the same type are allowed. Moreover, since the requests are to store data to memory, the CPU can go on with other instructions. It doesn’t wait for the stores to complete, and so the CPU issues first an ADD and then an AND instruction. Once the CPU is finished, it clears up the stack frame by loading (LW) the copies of the registers it used back from the stack. These loads, however, need to first wait for the stores to complete–and so they stall the CPU. Once all the loads have been issued, we then add to the stack pointer to return the stack frame to what it was. However, since the CPU doesn’t keep track of what load requests are outstanding, it can’t tell if this ADD is to a value yet to be returned from the LOAD. Therefore, the CPU stalls again until all loads are complete.

While this might seem slow, consider the alternative. What if the CPU had to wait for every load or store to complete before issuing the next one? Fig. 3 below gives a taste of what that might look like, save that we’ve allowed the CPU to still continue while store operations are ongoing.

Fig 3. Singleton operations only

There were a couple issues with this new approach, however. One was that my original interconnect implementation didn’t understand the concept of a currently active slave. Any slave could respond to a bus request and the interconnect would be none the wiser. Keeping the returns in order meant insisting that the memory accesses were to incrementing addresses, and that slaves were sorted on the bus by how long they would take to respond to a request–so that the fastest responding slaves were always at lower addresses. I handled this by insisting, within the instruction decoder, that any string of memory operations had to be to either the same address or subsequent addresses.

A second issue with this pipelined memory approach involved how to handle bus errors. Once a CPU can issue requests without waiting for their responses, then it becomes possible for the CPU to issue requests for multiple operations before the first one returns a bus error. While this makes analyzing a program in the debugger that much more challenging, the speed benefit provided by this approach was really quite tremendous, and often outweighed any drawbacks.

Fig 4. Comparing several GPIO toggle rates

The result was a basic pipelined memory controller. As an example of the performance that could be achieved using this technique, the ZipCPU can toggle an output pin at 47MHz while running the bus at 100MHz, whereas others have measured the Zynq running a 250MHz bus as only able to toggle the same pin at 3.8MHz. In percentages, the ZipCPU was able to demonstrate a 47% bus utilization using this technique vs. the Zynq’s 1.5% bus utilization.

This pipelined memory architecture worked quite well for the ZipCPU. Hand optimized loops could easily be unrolled for much better performance. Without hand optimization, however, the greatest benefit of this technique was when generating or recovering stack frames where the technique was an awesome fit.

Indeed, I was a bit taken aback later when I finally built a data cache only to discover the pipelined memory controller was often as fast or faster than the data cache. What?? How could that happen? Well, part of the problem was the time it took to load the cache in the first place. Loading the cache could generate more memory requests than necessary, such as if the CPU only wanted a single value but had to load the entire cache line, and so the cache might unnecessarily slow down the CPU. The other problem was that my original data cache implementation resorted to single operations when accessing uncachable memory. As a result, I had to go back and retrofit the data cache to handle pipelined operations for uncached memory just to recover the lost performance.

Recently, however, I’ve found myself doing a lot of work with AXI and not Wishbone. How should the ZipCPU be modified to handle AXI? One approach would be to use my Wishbone to AXI bridge. This approach, however, loses some of the benefits of AXI. The Wishbone to AXI bridge will never allow both read and write transactions to be outstanding (nor will the CPU …), neither will it allow the CPU to use AXI bursts or to issue exclusive access requests. The piece breaking the camel’s back, however, is simply the lost performance going through a bus bridge.

To avoid any lost performance when driving an AXI bus interface I would need to make the ZipCPU bus agnostic.

Bus Agnostic CPU Design

At present, I’m still in the process of making the ZipCPU bus agnostic. As a result, I don’t (yet) have any good examples of completed designs to show you how well (or poorly) the newly updated design works. Expect those within the year. For now, however, I’d like to discuss some of the changes that have taken place.

The ZipCPU as originally written had two problems when it comes to building a bus agnostic implementation. The first is that Wishbone was central to the CPU. The bus interface therefore needed to be removed from the CPU itself and made into a sort of wrapper. The second problem was that the Wishbone arbiter was integrated into the CPU. This also needed to be removed from the CPU core and placed into an external wrapper.

This naturally led to what I’m calling the ZipCore. The ZipCore is the logic left over after removing the bus logic from the original ZipCPU. The ZipCore is independent of any bus implementation. Instead, it exports a custom interface to both the instruction fetch and the memory controller.

Fig 5. Separating the ZipCPU into independent proofs

This also presented a wonderful opportunity to separate the formal verification of the ZipCPU from the verification of the instruction and data bus interfaces. This is shown in Fig. 5 by the introduction of custom interface property sets sitting between the CPU and these two sets of interface modules. I now have one custom property set for verifying the instruction fetch, and another for verifying the memory controller fetch. This means that any instruction fetch or memory controller meeting these properties will then be able to work with the ZipCore. As a result, I no longer need to verify that the ZipCore. will work with a particular instruction fetch or a particular memory controller implementation. Instead, I just need to prove that those controllers will work with the appropriate custom interface property set. If they do, then they’ll work with the CPU.

Of course, they’ll also need to work with the bus they are connected to, and so this requires a bus interface property set–either Wishbone or AXI, but we’ll get to that in a bit.

For now, let’s look at what the ZipCPU’s memory interface looks like.

CPU Interface

The ZipCPU’s memory controller interface can support one of two basic operations: read and write. Each leads to a slightly different sequence. These are shown in Fig. 6.

Fig 6. Memory operation sequences

In the case of a write, the CPU provides the address and the value to be written to the controller. The controller then becomes busy. Once it finishes the task, if all goes well, it quietly ceases to be busy. If something went wrong, the memory controller will instead return an error.

Reads are similar, with the difference that the result needs to be returned to the CPU once the operation is complete. In this case, the memory controller sets a valid signal for the CPU, the value returned from the bus, and then returns to the CPU the register address that this value is to be written into. At least the way the ZipCPU handles this interface, it is the memory controller that keeps track of what register the result will be written into. That’s what happens if all goes well. However, if the bus returns an error, then the controller will set an error flag instead of the valid flag. It’s up to the CPU then to determine what to do in case of a bus error.

In general, the ZipCPU will do one of two things on a bus error. If the CPU is in user mode, it will switch to supervisor mode. If, on the other hand, the CPU is in supervisor mode then it will halt. If desired, an external wrapper can reset the CPU as an attempt to recover from the error, but in general it just halts and waits for the debugger. The S6SoC project was my one exception to this rule, since there was no room for an external debugging bus in that design. In that case, the CPU would simply restart, dump the last CPU register contents, and then attempt to continue a reboot from there.

No matter how the software handles the bus error, the memory controller will not return further results from any ongoing set of operations. Returns from outstanding reads following a bus error will be ignored. Outstanding writes may, or may not be, completed–depending on their status within the memory controller and the bus implementation. At a minimum, only one bus error will be returned. Further error responses from any outstanding accesses on the bus will not be returned to the CPU.

Fig 7. The memory controller interface

Fig. 7 on the left shows the basic interface between the CPU core and it’s memory controller used to implement these operations. Let’s take a moment before going any further to discuss the various signals in this interface. Indeed, the basic interface is fairly simple:

  • The bus reset, i_bus_reset: This is just another name for the system reset pin. Everything resets when the bus reset is asserted.

  • The CPU reset, i_cpu_reset: With Wishbone, it’s easy to reset the CPU separate from the bus. All you need to do is drop the CYC and STB lines. With AXI, this is a bit harder, since you will still get responses back from the bus from any requests that were made before your reset if you don’t also reset the bus. This is why the memory interface separates the CPU reset from the system reset, so that the CPU can be reset separate from the rest of the design. It’s up to the memory controller to make sure that the CPU doesn’t get any stale memory results from prior to the reset request returning afterwards.

  • i_stb: This is the basic request line. When the CPU raises i_stb, it wants to initiate a memory operation. For those familiar with the AXI stream protocol, you can think of this as TVALID && TREADY.

  • o_pipe_stalled: This is the basic stall line. When raised, the CPU will not raise i_stb to make a request of the bus. Continuing with the AXI stream analogy from above, this is similar to the !TREADY signal in AXI stream.

Fig 8. i_op encoding
  • i_op: This specifies the type of operation. To keep logic counts low, the bits to the memory operation are drawn directly from the instruction word. i_op[0] will be true for a write (store) instruction, and false for a read (load) instruction. i_op[2:1] then specifies the size of that operation. 2'b11 specifies a byte-wise read or write, 2'b10 a half-word/short (16b) operation, and 2'b01 a full word operation.

  • i_addr: The address to be written to or read from. This only has meaning when i_stb is true.

  • i_data: The data to be written to the address above. This only has meaning when both i_stb and i_op[0]. For 8’bit writes, only the lower 8-bits have meaning. Likewise for 16’bit writes only the lower 16-bits have any meaning.

  • i_oreg: For reads, this specifies the register address that the read result will be placed into upon completion. The memory unit will hold onto this value, and then return it to the CPU again later. In the case of the pipelined operators, this value will go into a FIFO to be returned later with any read results. This value is ignored in the case of writes.

  • o_busy: If the memory core is busy doing anything, it will set o_busy. For example, if you issue a bus operation, then o_busy will go true. If you later reset the CPU, o_busy will remain true until the memory core can accept another operation.

    o_busy is subtly different from o_pipe_stalled in that the CPU may issue additional memory operations while o_busy && !o_pipe_stalled. However, the CPU will not start a new string of memory operations, nor will it change direction while the memory core asserts o_busy.

    It’s important to note that the CPU may move on to a non-memory instruction if o_busy is true as long as o_rdbusy is low.

  • o_rdbusy: If the memory core is busy reading values from the memory, then o_rdbusy will be set to indicate that a read is in progress and the CPU should not proceed to any other instructions (other than additional reads).

  • o_valid: Once a read value is returned, o_valid will be set to indicate the need to write to the register file the value returned from the read. If all goes well, there will be exactly one o_valid for every i_stb && !i_op[0], although CPU resets and bus errors may keep this count from being exact.

  • o_err: This will be set following any bus error, with two exceptions: First, if the CPU is reset while operations are outstanding, then any bus error response for those outstanding operations will not be returned. Second, after the first bus error, the memory controller will first flush any ongoing operations before returning any more bus errors to the CPU.

  • o_wreg: When returning a data value to the CPU, the o_wreg value tells the CPU where to write the value. This is basically the i_oreg value given to the memory controller reflected back to the CPU, together with the data value that goes with it.

  • o_result: The value from any read return is provided in o_result. In the case of an 8-bit read, the upper 24-bits will be cleared. Likewise, for a 16’bit read, the upper 16’bits will be cleared.

    Some CPU’s sign extend byte reads to the full word size, some do not. By default, the ZipCPU simply clears any upper bits. Two following instructions, a TEST instruction followed by a conditional OR can turn a zero extended read into a sign extended read. Alternatively, changing the memory controller from one behavior to another is fairly easy to do. Adjusting the GCC toolchain and following support, however, can take a bit more work.

There are two other important signals in this interface. These are signals we won’t be addressing today, but they are important parts of the more complex controller implementations.

  • i_clear_cache: This is my way of dealing with cache’s and DMA’s. The CPU can issue a special instruction to clear the cache if the memory may have changed independent of the CPU. This input is also asserted if the debug interface changes memory in a way the CPU is unaware of. If raised, the memory controller will mark any and all cached data as invalid–forcing the cache to reload from scratch on the next request.

  • i_lock: This flag is used when implementing atomic memory instructions. It will be raised by a LOCK instruction, and then lowered three instructions later. This allows for certain four instruction sequences: LOCK, LOAD, (ALU operation), STORE. A large variety of atomic instructions can be implemented this way. Examples include atomic adds, subtracts, or even the classic test and set instruction.

Fig 9. Comparing ZipCPU and MicroBlaze test and set implementations

During these three instructions, the CPU is prevented from switching to supervisor mode on any interrupt until all three instructions following the lock have completed.

Atomic access requests are generally easy to implement when using Wishbone. The Wishbone cycle line is simply raised with the first LOAD instruction (LB, for load byte in Fig. 9), and then held high between the LOAD and STORE instructions (SB, for store byte in Fig. 9). Things are a bit more complicated with AXI, however, since AXI doesn’t allow a bus. master to lock the bus. Instead, the CPU will only discover if it’s atomic instruction was successful when/if the final store operation fails. In that case, the memory controller needs to tell the CPU to back up to the lock instruction and start over. How to make this happen, however, is a longer discussion for the day we discuss the full AXI version of this same memory controller.

To see how this interface might work when driving an AXI bus, I thought I might provide examples of both writing to and reading from the bus. Here’s the write example.

Fig 10. An example write trace

Note the key steps:

  1. The CPU makes a request by setting i_stb, placing the data to be written into i_data, and the address of the transaction into i_addr.

  2. The memory controller then becomes busy. It raises both M_AXI_AWVALID and M_AXI_WVALID to request a transaction of the bus. In this example, we also raise M_AXI_BREADY as a bit in our state machine, to indicate that we are expecting data to be returned from the bus in the future.

  3. M_AXI_AWVALID must remain high, and M_AXI_AWADDR must remain constant until M_AXI_AWREADY is high. In this highly compressed example, M_AXI_AWREADY just happens to be high when M_AXI_AWVALID is set, so that it M_AXI_AWVALID can be dropped on the next cycle.

    The same rule applies to M_AXI_WVALID. M_AXI_WVALID must stay high and M_AXI_WDATA and M_AXI_WSTRB must stay constant until M_AXI_WREADY.

    I’ve seen several beginner mistakes with this handshake. Remember: this chart in Fig. 10 is only representative! Some slaves will delay setting M_AXI_AWREADY longer than others, some will set M_AXI_AWREADY before M_AXI_WREADY and others will set them in a different order. To be compliant, an AXI master must be able to deal with all these situations.

  4. In this compressed example, M_AXI_BVALID is set on the clock immediately following M_AXI_AWVALID && M_AXI_AWREADY && M_AXI_WVALID && M_AXI_WREADY.

    Do not depend upon this condition! I’ve seen beginner mistakes where the beginner’s logic requires all four of these signals to be high at the same time. Remember, either one of these two channels might get accepted before the other.

  5. Once M_AXI_BVALID has been received, the memory controller drops o_busy to indicate that it is idle. A new request may then be made on the same cycle.

Now let’s take a look at a read example using this interface.

Fig 11. An example read trace

While this example is very similar to the previous write example, there are some key differences. Therefore, let’s walk through it.

  1. The CPU indicates the desire to read from memory by raising i_stb and placing the address to be read from in i_addr. The register that will be read into is also placed into i_wreg–the memory controller will need to return this value back when the operation is complete.

    Not shown is the i_op input indicating the size of the read, whether byte (8b), halfword (16b), or word (32b).

  2. Once the memory controller receives i_stb, it immediately sets M_AXI_ARVALID and M_AXI_ARADDR with the information it is given.

    This controller also sets M_AXI_RREADY high at this point, as part of its internal state tracking. This is to indicate that a read return is expected.

    Finally, the controller sets both o_busy and o_rdbusy. The first indicates that a memory operation is ongoing, and the second indicates that we will be writing back to a register upon completion. This latter flag, o_rdbusy, is used to prevent the CPU from moving onto its next operation and so helps avoid any pipeline hazard.

  3. M_AXI_ARVALID must stay high and M_AXI_ARADDR constant until the slave asserts M_AXI_ARREADY. In this example, that happens immediately, but this will not be the case with all slaves.

    Holding M_AXI_ARVALID high past M_AXI_ARVALID && M_AXI_ARREADY will request a second read. Since we don’t want that here, we immediately drop M_AXI_ARVALID upon seeing M_AXI_ARREADY.

  4. Once the slave accomplishes the read, it sets M_AXI_RVALID and M_AXI_RDATA. Since the memory controller is holding M_AXI_RREADY high, these will only be set for a single cycle.

  5. The memory controller then copies the data from M_AXI_RDATA to o_result to send it back to the CPU. o_valid is set to indicate a result is valid. o_rdbusy is dropped, since we are no longer in the middle of any operation. Finally, o_wreg returns the register address that the CPU entrusted to the memory controller.

These are examples drawn from the controller we’ll be examining today. Just to prove that the throughput of this CPU interface isn’t bus limited in general, here is a trace drawn from an AXI-lite memory controller capable of issuing multiple ongoing operation.

Fig 12. An example pipelined read trace, from another controller

Just for the purposes of illustration, I dropped M_AXI_ARREADY on the first cycle of the request for address A3, all this behavior is highly slave dependent. Doing this, however, helps to illustrate how a bus stall will propagates through that controller. Notice how the CPU then suffers one stall, and that the result takes an extra cycle to return the item from that address. Beyond that, however, we’ll need to save the examination of that controller for another day. For now we’ll limit ourselves to a controller that can only handle a 33% bus throughput at best.

33% throughput? Is that the performance that can be expected from this type of controller? Well, not really. That would be the performance you’d see if this memory controller were connected directly to a (good) block RAM memory. If you connect it to a crossbar interconnect instead, you can expect it to cost you two clock cycles going into the interconnect, and another clock cycle coming out. Hence, to read from a block RAM memory, it will now cost you 6 cycles, not 3, for a 16% bus throughput. Worse, if you connect it to Xilinx’s AXI block RAM controller, it’ll then take you an additional 4 clock cycles. As a result, your blazing fast ZipCPU would be crippled down to one access for every 10 clock cycles simply due to a non-optimal bus architecture. Unfortunately, it only gets worse from there when you attach your CPU to a slower AXI slave.

Here’s a trace showing what that whole operation, from CPU through interconnect to Xilinx’s AXI block RAM controller and back might look like.

Fig 13. Adjacent single read requests through an interconnect and then Xilinx's AXI block RAM controller

In this trace, we have the outputs of our controller M_AXI_ARVALID and M_AXI_ARADDR going into a crossbar. The crossbar forwards these requests to BRAM_AXI_ARVALID and BRAM_AXI_ARADDR, the inputs to Xilinx’s AXI block RAM controller. This block RAM controller takes a clock cycle to raise BRAM_ARREADY, and then two more clock cycles before it raises its output on the result pipe, BRAM_RVALID and BRAM_RDATA. From here it will take another clock to go through the crossbar. This clock is the minimum timing allowed by the AXI spec. As a result, the read takes a full 10 cycles. The ZipCPU’s memory interface will allow a second request as soon as this one returns, yielding a maximum throughput of 11%.

As I mentioned above, fixing this horrendous throughput will require a redesigned memory controller. Of course, a better AXI block RAM controller would also help as well.

We’ll get there.

For now, a working AXI memory controller is a good place to start from. We can come back to this project and optimize it later if we get the chance.

Basic Operator

Now that we know what our interface looks like, let’s peel the onion back another layer deeper to see how we might implement these operations when using AXI-lite.

First, let me answer the question of why AXI-lite and not AXI? And, moreover, what will the consequences be of not using the full AXI4 interface? For today’s discussion, I have several reasons for not using the full AXI4 interface:

  1. AXI-lite is simpler.

    This may be my biggest reason.

  2. AXI-lite can easily be converted to AXI (full) by just setting the missing signals.

  3. The CPU memory unit doesn’t need AXI IDs. While a CPU might use two separate AXI IDs, only one would ever be needed for any source. Therefore, the fetch unit might use one ID and the memory controller another. If a DMA were integrated into the CPU, it might use a third ID and so on. There’s just no need for separate ID’s in the memory controller itself.

  4. Since we’re only implementing a single access at a time today, or in the case of misaligned accesses two accesses at a time, there’s no reason to use AXI bursts.

    When (if) we get to building an AXI instruction or data cache, then bursts will make sense. In such cases, a natural burst length will be the size of a single cache line.

    While it might make sense to issue a burst request when dealing with misaligned accesses later, AXI’s requirement that burst accesses never cross 4kB boundaries could make this a challenge. By leaving adjacent memory accesses as independent, we don’t need to worry about this 4kB requirement.

There is one critical bus capability that we will lose by implementing this memory controller for AXI4-lite rather than AXI4 (full), and that is the ability to implement atomic access instructions. If for no other reason, let’s consider this implementation only a first draft of a simple controller, so that we can come back later with a more complicated and full featured controller at a later time. Indeed, if you compare this core to a comparable full AXI memory controller, you’ll see the two mostly share the same structure.

For now, let’s work our way through a first draft of setting our various AXI4-lite signals.

The first signals we’ll control will be the various xVALID and xREADY signals associated with any AXI request. As discussed above, we’ll use the xREADY signals as internal state variables to know when something is outstanding. Hence, on a write request we’ll set M_AXI_BREADY and we’ll clear it once the request is acknowledged. We’ll treat read requests similarly, only using M_AXI_RREADY for that purpose instead.

The first step will be to clear these signals on any reset.

	always @(posedge i_clk)
	if (!S_AXI_ARESETN)
	begin
		M_AXI_AWVALID <= 1'b0;
		M_AXI_WVALID  <= 1'b0;
		M_AXI_ARVALID <= 1'b0;
		M_AXI_BREADY  <= 1'b0;
		M_AXI_RREADY  <= 1'b0;

While it’s a little out of order, the next group in this block controls how to handle an ongoing operation. In general, if ever AxREADY then the associated AxVALID signal will be cleared. Likewise, once BVALID or RVALID are returned, we can close up and finish our operation and clear our xREADY signals.

	end else if (M_AXI_BREADY || M_AXI_RREADY)
	begin // Something is outstanding
		if (M_AXI_AWREADY)
			M_AXI_AWVALID <= 0;
		if (M_AXI_WREADY)
			M_AXI_WVALID  <= 0;
		if (M_AXI_ARREADY)
			M_AXI_ARVALID <= 0;

		if (M_AXI_BVALID || M_AXI_RVALID)
		begin
			M_AXI_BREADY <= 1'b0;
			M_AXI_RREADY <= 1'b0;
		end

As I mentioned above, getting this signaling wrong is a common beginning AXI mistake. Remember, the AW* and W* channels are independent, and that VALID cannot be lowered until READY.

The last step in controlling these signals is to set them on any request. Assuming a request is incoming, we’ll want to set the various write flags if i_op[0] is ever true–indicating a write operation request. Otherwise, for read operations, we’ll want to set M_AXI_ARVALID and M_AXI_RREADY.

	end else begin // New memory operation
		// Initiate a request
		M_AXI_AWVALID <=  i_op[0];	// Write request
		M_AXI_WVALID  <=  i_op[0];	// Write request
		M_AXI_ARVALID <= !i_op[0];	// Read request

		// Set BREADY or RREADY to accept the response.  These will
		// remain ready until the response is returned.
		M_AXI_BREADY  <=  i_op[0];
		M_AXI_RREADY  <= !i_op[0];

Of course, that’s only if a request is being made on this cycle. Hence, let’s caveat these new values. If there’s no request being made, then these lines should be kept clear. Likewise, if the request is for an unaligned address then (in our first draft) we’ll return an error to the CPU and not issue any request. Finally, on either a bus error or a CPU reset we’ll need to make certain that we don’t start a new request that will immediately be unwanted one the next cycle.

		if (i_cpu_reset || o_err || !i_stb || w_misaligned)
		begin
			M_AXI_AWVALID <= 0;
			M_AXI_WVALID  <= 0;
			M_AXI_ARVALID <= 0;

			M_AXI_BREADY <= 0;
			M_AXI_RREADY <= 0;
		end
	end

Judging from the AXI requests associated with Xilinx forum posts that I’ve examined, getting those five signals right tends to be half the battle.

There is another signal, however, that we’ll need to pay attention to, and this is the one capturing whether or not the CPU was reset separate from the system. In such cases, we’ll need to flush any ongoing bus operation without returning its results to the CPU at a later time. To handle this, we’re going to implement an r_flushing signal. This signal will capture the idea of the bus begin busy, even through the CPU isn’t expecting a result from it.

This signal will be cleared on any system reset.

	always @(posedge i_clk)
	if (!S_AXI_ARESETN)
		r_flushing <= 1'b0;

The primary purpose of this signal is to let us know to flush any outstanding returns following a CPU reset while a bus operation is ongoing without also needing to reset the bus.

	else if (M_AXI_BREADY || M_AXI_RREADY)
	begin
		if (i_cpu_reset)
			// If only the CPU is reset, however, we have a problem.
			// The bus hasn't been reset, and so it is still active.
			// We can't respond to any new requests from the CPU
			// until we flush any transactions that are currently
			// active.
			r_flushing <= 1'b1;

There’s one caveat to this, however, and that is that we don’t want to set r_flushing if the CPU is reset on the same cycle the outstanding value is returned to the CPU.

		if (M_AXI_BVALID || M_AXI_RVALID)
			// A request just came back, therefore we can clear
			// r_flushing
			r_flushing <= 1'b0;

Otherwise if the bus is idle, we can leave the r_flushing signal at zero–no matter whether the CPU is reset or not.

	end else
		// If nothing is active, we don't care about the CPU reset.
		// Flushing just stays at zero.
		r_flushing <= 1'b0;

Handling the bus address for this simple controller is really easy. As long as we aren’t in the middle of any operations, we can set the address to the CPU’s requested address. Even better, we can use the same logic for both read and write addresses.

	initial	M_AXI_AWADDR = 0;
	always @(posedge i_clk)
	if (!M_AXI_BREADY && !M_AXI_RREADY)
		M_AXI_AWADDR <= i_addr;

	always @(*)
		M_AXI_ARADDR = M_AXI_AWADDR;

AXI requires an AxPROT signal accompany any request. Looking through the AXI spec, it looks like 3'h0 will work nicely for us. This will specify an unprivileged, secure data access.

	assign	M_AXI_AWPROT  = 3'h0;
	assign	M_AXI_ARPROT  = 3'h0;

That brings us to setting M_AXI_WDATA and its associated M_AXI_WSTRB. In general, we’re going to need to up shift these values based upon where the data given us will fit on the bus. I like to use AXILSB to capture the number of address bits, in an AXI interface, necessary to define which octet the address is referencing.

	localparam	AXILSB = $clog2(C_AXI_DATA_WIDTH/8);

Remember not to copy Xilinx’s formula for this bus width, since their calculation is only valid for 16, 32, or 64-bit bus widths. (You can see their bug here. In their defense, this doesn’t really matter in an AXI-lite interface, since Xilinx only allows AXI-lite to ever have a data width of 32-bits. Sadly, they made the same mistake in their AXI full demonstrator.)

We can now use this value to shift our data input by eight times the value of these lower address bits to place our write data in its place on the bus.

	always @(posedge i_clk)
	if (i_stb)
	begin
		casez(i_op[2:1])
		2'b10: axi_wdata // Half-word store
			<= { { (C_AXI_DATA_WIDTH-16){ 1'b0 }}, i_data[15:0] }
				<< (8*i_addr[AXILSB-1:0]);
		2'b11: axi_wdata // 8-bit (byte) store
			<= { { (C_AXI_DATA_WIDTH-8){  1'b0, }}, i_data[7:0] }
				<< (8*i_addr[AXILSB-1:0]);
		default: axi_wdata <= i_data[7:0] << (8*i_addr[AXILSB-1:0]);
		endcase

We’ll come back in a moment and assign M_AXI_WDATA to be the same as this axi_wdata. For now, let’s just note that the logic for axi_wstrb is almost identical. In this case, we’re upshifting a series of 1s, one for each byte we wish to write, by subword address bits. The second big difference is that we aren’t multiplying the low order address bits by eight like we did for the data.

		// next_wstrb, axi_wstrb
		casez(i_op[2:1])
		2'b0?: axi_wstrb <= { {(DW/8-4){1'b0}},
						4'b1111} << i_addr[AXILSB-1:0];
		2'b10: axi_wstrb <= { {(DW/8-4){1'b0}},
						4'b0011} << i_addr[AXILSB-1:0];
		2'b11: axi_wstrb <= { {(DW/8-4){1'b0}},
						4'b0001} << i_addr[AXILSB-1:0];
		endcase

There’s one last step here, and that is that we need to keep track of both the operation size as well as the lower bits of the address. We’re going to need these later, on a read return to know how to grab the byte of interest from the bus.

		r_op <= { i_op[2:1] , i_addr[AXILSB-1:0] };
	end

	always @(*)
		{ M_AXI_WSTRB, M_AXI_WDATA } = { axi_wstrb, axi_wdata };

That leaves only one other signal required to generate a bus request, and that signal is going to tell us if and when we need to abort the request due to the fact that it will require two operations. For this initial implementation, we’ll simply return an error to the CPU in this case. We’ll come back to this in a moment to handle misaligned accesses, but this should be good enough for a first pass.

An access is misaligned if the access doesn’t fit within a single bus word. For a 4-byte request, if adding 3 to the address moves you into the next word then the request is misaligned. For a 2-byte request, if adding one moves you to the next word then the request is misaligned. Single byte requests, however, cannot be misaligned.

	always @(*)
	casez(i_op[2:1])
	// Full word
	2'b0?: w_misaligned = ((i_addr[AXILSB-1:0]+3) >= (1<<AXILSB));
	// Half word
	2'b10: w_misaligned = ((i_addr[AXILSB-1:0]+1) >= (1<<AXILSB));
	// Bytes are always aligned
	2'b11: w_misaligned = 1'b0;
	endcase

Now, if this flag is ever true, we’ll skip issuing the request and instead return a bus error to the CPU. (We’ll get to that in a moment.)

That’s what it takes to make a request of the bus.

The next request is to handle the return from the bus. and to forward it to the CPU.

The first part of any return to the CPU is returning a value. We’ll have a value to return if and when RVALID is true. We’ll take a clock cycle to set this o_valid flag, so as to allow us a clock cycle to shift RDATA to the right value.

For now, notice that o_valid needs to be kept clear following a reset of any type. Further, it needs to be kept clear if we are flushing responses as part of a CPU reset separate from an AXI bus reset. Finally, we’ll set the o_valid flag on RVALID as long as the bus didn’t return an error.

	always @(posedge i_clk)
	if (i_cpu_reset || r_flushing)
		o_valid <= 1'b0;
	else
		o_valid <= M_AXI_RVALID && !M_AXI_RRESP[1];

We now turn our attention to the CPU bus error flag. In general, a bus error will be returned when either BVALID or RVALID and the response is an error. We’ll also return a bus error on any request to send something misaligned to the bus. The exceptions, however, are important. If the CPU is reset, we don’t want to return an error, nor if we are waiting for that reset to complete.

	initial	o_err = 1'b0;
	always @(posedge i_clk)
	if (i_cpu_reset || r_flushing || o_err)
		o_err <= 1'b0;
	else if (i_stb && w_misaligned)
		o_err <= 1'b1;
	else
		o_err <= (M_AXI_BVALID && M_AXI_BRESP[1])
			|| (M_AXI_RVALID && M_AXI_RRESP[1]);

We’ll also need to return some busy flags. This core is busy if ever M_AXI_BREADY or M_AXI_RREADY are true. We’ll also set our o_pipe_stalled flag to be equivalent to o_busy for this simple controller, but that setting will be external to this logic. Similarly, the CPU can expect a response if M_AXI_RREADY is true and we aren’t flushing the result.

	always @(*)
	begin
		o_busy   = M_AXI_BREADY || M_AXI_RREADY; // also pipe_stalled
		o_rdbusy = M_AXI_RREADY && !r_flushing;
	end

When returning a result to the CPU, we need to tell the CPU which register to write the read result into. Since this simple memory controller only ever issues a single read or write request of the bus, we can choose to simply capture the register on any new request and know that there will never be any other register to return.

	always @(posedge i_clk)
	if (i_stb)
		o_wreg    <= i_oreg;

That leaves us only one more signal to return to the CPU, the o_result from a data read. There are two parts to returning this value. The first part is that we’ll need to shift the value back down from (wherever) it is placed in the return bus, word. This was why we kept the subword address bits in our r_op register.

	// o_result
	always @(posedge i_clk)
	if (M_AXI_RVALID)
	begin
		o_result <= M_AXI_RDATA >> (8*r_op[AXILSB-1:0]);

We also kept the size of our operation in the upper bits of r_op. We can use these now to zero extend octets and half words into 32-bits.

		casez(r_op[AXILSB +: 2])
		2'b10: o_result[31:16] <= 0;
		2'b11: o_result[31: 8] <= 0;
		default: begin end
		endcase
	end

Some CPU’s sign extend sub word values on reading. Not the ZipCPU. The ZipCPU zero extends subword values to full words on any read. This behavior, however, is easy enough to adjust if you want a different behavior.

There you go, a basic AXI-lite based CPU memory controller.

Handling Misaligned Requests

Perhaps I should have been satisfied with that first draft of a basic memory controller.

I wasn’t.

The draft controller will return a bus error response to the CPU if you ever try to write a misaligned word to the bus. Try, for example, to write a 32-bit word to address three. The operation will fail with a bus error. This was by design. Why? Because otherwise you’d then need to write across multiple words.

Well, why can’t we build a controller that will read or write across multiple words when requested? Such a controller could handle misaligned requests.

So, let’s start again, using the design template above, and see if we can adjust this controller to handle misaligned requests.

The first thing we are going to need are some flags to capture a bit of state. Let’s try these:

  • misaligned_aw_request: This is the first request of two AW* requests, as a result of a misaligned write.

  • misaligned_request: This is the first request of either two W* requests, or two AR* requests.

  • misaligned_response_pending: Two responses are expected. As a result, if misaligned_response_pending is ever true, then we still expect either two BVALID returns or two RVALID returns. (One might be present on this clock cycle.)

  • misaligned_read: This signal is very similar to misaligned_response_pending, except that it isn’t cleared on the first read response. It’s used at the end to let us know that two read results need to be merged together into one before returning them to the CPU.

  • pending_err: Of our two responses, the first has returned an error. Since it is only the first of two, we haven’t returned the error response to the CPU yet. Hence, if pending_err, then we need to return a bus error to the CPU on the next bus return–regardless of what status response is returned with it.

We can now go back to the top and re-look at our xVALID and xREADY handshaking request flags again.

	initial	M_AXI_AWVALID = 1'b0;
	initial	M_AXI_WVALID = 1'b0;
	initial	M_AXI_ARVALID = 1'b0;
	initial	M_AXI_BREADY = 1'b0;
	initial	M_AXI_RREADY = 1'b0;
	always @(posedge i_clk)
	if (!S_AXI_ARESETN)
	begin
		M_AXI_AWVALID <= 1'b0;
		M_AXI_WVALID  <= 1'b0;
		M_AXI_ARVALID <= 1'b0;
		M_AXI_BREADY  <= 1'b0;
		M_AXI_RREADY  <= 1'b0;

The big difference here is in how we handle a return. If a misaligned request is outstanding, then you don’t want to drop xVALID on the first cycle–you will want to wait for the second return. The same applies to waiting for two responses.

	end else if (M_AXI_BREADY || M_AXI_RREADY)
	begin // Something is outstanding

		if (M_AXI_AWREADY)
			M_AXI_AWVALID <= M_AXI_AWVALID && misaligned_aw_request;
		if (M_AXI_WREADY)
			M_AXI_WVALID  <= M_AXI_WVALID && misaligned_request;
		if (M_AXI_ARREADY)
			M_AXI_ARVALID <= M_AXI_ARVALID && misaligned_request;

		if ((M_AXI_BVALID || M_AXI_RVALID) && !misaligned_response_pending)
		begin
			M_AXI_BREADY <= 1'b0;
			M_AXI_RREADY <= 1'b0;
		end

That’s the big change there. The logic required to start a memory operation won’t change.

	end else begin // New memory operation

		// Initiate a request
		M_AXI_AWVALID <=  i_op[0];	// Write request
		M_AXI_WVALID  <=  i_op[0];	// Write request
		M_AXI_ARVALID <= !i_op[0];	// Read request

		// Set BREADY or RREADY to accept the response.  These will
		// remain ready until the response is returned.
		M_AXI_BREADY  <=  i_op[0];
		M_AXI_RREADY  <= !i_op[0];

		if (i_cpu_reset || o_err || !i_stb || w_misalignment_err)
		begin
			M_AXI_AWVALID <= 0;
			M_AXI_WVALID  <= 0;
			M_AXI_ARVALID <= 0;

			M_AXI_BREADY <= 0;
			M_AXI_RREADY <= 0;
		end
	end

The r_flushing signal, indicating that shouldn’t forward results to the CPU is a little more complex. The big difference here is if a misaligned response is pending. In that case, we don’t want to clear our r_flushing signal on a return, but rather on the next return.

	initial	r_flushing = 1'b0;
	always @(posedge i_clk)
	if (!S_AXI_ARESETN)
		// If everything is reset, then we don't need to worry about
		// or wait for any pending returns--they'll be canceled by the
		// global reset.
		r_flushing <= 1'b0;
	else if (M_AXI_BREADY || M_AXI_RREADY)
	begin
		if (i_cpu_reset)
			// If only the CPU is reset, however, we have a problem.
			// The bus hasn't been reset, and so it is still active.
			// We can't respond to any new requests from the CPU
			// until we flush any transactions that are currently
			// active.
			r_flushing <= 1'b1;
		if (M_AXI_BVALID || M_AXI_RVALID)
			// A request just came back, therefore we can clear
			// r_flushing
			r_flushing <= 1'b0;
		if (misaligned_response_pending)
			// ... unless we're in the middle of a misaligned
			// request.  In that case, there will be a second
			// return that we still need to wait for.  This request,
			// though, will clear misaligned_response_pending.
			r_flushing <= r_flushing || i_cpu_reset;
	end else
		// If nothing is active, we don't care about the CPU reset.
		// Flushing just stays at zero.
		r_flushing <= 1'b0;

Address handling gets just a touch more complicated as well. In this case, any time an address is accepted we’ll increment it to the next word address.

	initial	M_AXI_AWADDR = 0;
	always @(posedge i_clk)
	if (!S_AXI_ARESETN && OPT_LOWPOWER)
		M_AXI_AWADDR <= 0;
	else if (!M_AXI_BREADY && !M_AXI_RREADY)
	begin // Initial address
		M_AXI_AWADDR <= i_addr;

	end else if ((M_AXI_AWVALID && M_AXI_AWREADY)
			||(M_AXI_ARVALID && M_AXI_ARREADY))
	begin // Subsequent addresses
		M_AXI_AWADDR[C_AXI_ADDR_WIDTH-1:AXILSB]
			<= M_AXI_AWADDR[C_AXI_ADDR_WIDTH-1:AXILSB] + 1;

		// All subsequent addresses shall be aligned per spec
		M_AXI_AWADDR[AXILSB-1:0] <= 0;
	end

There’s a couple things to remember here. First, when handling a misaligned request, we must always move to the next word—that’s what a misaligned request is. Second, the low address bits should be zero. This will be appropriate for little endian systems. It’s not necessarily appropriate for big endian systems like the ZipCPU, but at least it won’t hurt.

The next trick is the M_AXI_WDATA and M_AXI_WSTRB values indicating which bytes to write to and the values to be written to them. The trick to making this work is to map the request onto two separate bus words. Once mapped to two words, we can then send the result to those words one at a time.

We’ll add one intermediate step here, though, which is to create the M_AXI_WSTRB value combinatorially first. This just simplifies writing the logic out, but not much more. Note that we are again shifting a set of ones up by the address in the low order bits of i_addr–just like we did before, only this time onto two words worth of byte enables instead of just one.

	always @(*)
		shifted_wstrb_word = { {(2*DW/8-4){1'b0}},
						4'b1111} << i_addr[AXILSB-1:0];

	always @(*)
		shifted_wstrb_halfword = { {(2*DW/8-4){1'b0}},
						4'b0011} << i_addr[AXILSB-1:0];

	always @(*)
		shifted_wstrb_byte = { {(2*DW/8-4){1'b0}},
						4'b0001} << i_addr[AXILSB-1:0];

The next change is that we’ll add two new registers: next_wdata and next_wstrb. These will hold the next values of M_AXI_WDATA and M_AXI_WSTRB–the values we’ll use for them on the second clock cycle of any misaligned request.

	initial	axi_wdata = 0;
	initial	axi_wstrb = 0;
	initial	next_wdata  = 0;
	initial	next_wstrb  = 0;
	always @(posedge i_clk)
	if (OPT_LOWPOWER && !S_AXI_ARESETN)
	begin
		axi_wdata <= 0;
		axi_wstrb <= 0;

		next_wdata  <= 0;
		next_wstrb  <= 0;

		r_op <= 0;

Here’s the first of the key steps with next_wdata and next_wstrb: their logic is identical to the logic we used before, save that they are applied across two bus words.

	end else if (i_stb)
	begin
		casez(i_op[2:1])
		2'b10: { next_wdata, axi_wdata }
			<= { {(2*C_AXI_DATA_WIDTH-16){1'b0}},
			    i_data[15:0] } << (8*i_addr[AXILSB-1:0]);
		2'b11: { next_wdata, axi_wdata }
			<= { {(2*C_AXI_DATA_WIDTH-8){1'b0}},
			    i_data[7:0] } << (8*i_addr[AXILSB-1:0]);
		default: { next_wdata, axi_wdata }
			<= { {(2*C_AXI_DATA_WIDTH-32){1'b0}},
			    i_data } << (8*i_addr[AXILSB-1:0]);
		endcase

		// next_wstrb, axi_wstrb
		casez(i_op[2:1])
		2'b0?: { next_wstrb, axi_wstrb } <= swapped_wstrb_word;
		2'b10: { next_wstrb, axi_wstrb } <= swapped_wstrb_halfword;
		2'b11: { next_wstrb, axi_wstrb } <= swapped_wstrb_byte;
		endcase

		r_op <= { i_op[2:1] , i_addr[AXILSB-1:0] };

Given that axi_wdata and axi_wstrb are going to map directly to M_AXI_WDATA and M_AXI_WSTRB, that just leaves handling the second write cycle. For that, we just copy next_wdata to axi_wdata as soon as the channel isn’t stalled. We’ll likewise do the same for next_wstrb and axi_wstrb.

	end else if (M_AXI_WREADY)
	begin
		axi_wdata <= next_wdata;
		axi_wstrb <= next_wstrb;
	end

	always @(*)
		{ M_AXI_WSTRB, M_AXI_WDATA } = { axi_wstrb, axi_wdata };

What about detecting a misalignment? More than that, what if we want this core to either generate a bus error as before on misalignment, or to issue multiple requests?

To handle both capabilities, we’ll create an single bit OPT_ALIGNMENT_ERR parameter. If this bit is set, misaligned requests will generate bus errors. If not, misaligned requests will be allowed to take place.

We’ll also split our misalignment signal into two. The first signal, w_misaligned, will simply indicate a misaligned request. The second signal, w_misaligned_err, will indicate that we want this misaligned request to turn into a bus error.

	always @(*)
	casez(i_op[2:1])
	// Full word
	2'b0?: w_misaligned = ((i_addr[AXILSB-1:0]+3) >= (1<<AXILSB));
	// Half word
	2'b10: w_misaligned = ((i_addr[AXILSB-1:0]+1) >= (1<<AXILSB));
	// Bytes are always aligned
	2'b11: w_misaligned = 1'b0;
	endcase

	always @(*)
		w_misalignment_err = OPT_ALIGNMENT_ERR && w_misaligned;

The next big component will be handling our new misalignment signals. Obviously, if we are just generating errors on any misaligned request, then we won’t need these signals and they can be kept at zero.

	generate if (OPT_ALIGNMENT_ERR)
	begin

		assign	misaligned_request = 1'b0;

		assign	misaligned_aw_request = 1'b0;
		assign	misaligned_response_pending = 1'b0;
		assign	misaligned_read = 1'b0;
		assign	pending_err = 1'b0;

This will allow the optimizer to simplify our logic when we just adjust the OPT_ALIGNMENT_ERR parameter.

On the other hand, if we are generating misaligned requests, then we’ll need to define these signals. The first indicates that this is a misaligned request, and a second W* or AR* operation is required.

	end else begin
		reg	r_misaligned_request, r_misaligned_aw_request,
			r_misaligned_response_pending, r_misaligned_read,
			r_pending_err;

		initial	r_misaligned_request = 0;
		always @(posedge i_clk)
		if (!S_AXI_ARESETN)
			r_misaligned_request <= 0;
		else if (i_stb && !o_err && !i_cpu_reset)
			r_misaligned_request <= w_misaligned
						&& !w_misalignment_err;
		else if ((M_AXI_WVALID && M_AXI_WREADY)
					|| (M_AXI_ARVALID && M_AXI_ARREADY))
			r_misaligned_request <= 1'b0;

		assign	misaligned_request = r_misaligned_request;

Since the AW* and W* channels need to be handled independently, we need a separate signal to handle the second request on the AW* channel. This signal is almost identical to misaligned_request above, save that it is cleared on AWREADY.

		initial	r_misaligned_aw_request = 0;
		always @(posedge i_clk)
		if (!S_AXI_ARESETN)
			r_misaligned_aw_request <= 0;
		else if (i_stb && !o_err && !i_cpu_reset)
			r_misaligned_aw_request <= w_misaligned && i_op[0]
					&& !w_misalignment_err;
		else if (M_AXI_AWREADY)
			r_misaligned_aw_request <= 1'b0;

		assign	misaligned_aw_request = r_misaligned_aw_request;

Knowing if a response will be the first of two expected is the purpose of misaligned_response_pending. It’s set much the same as the other two. The big difference in this signal is that it is cleared on either M_AXI_BVALID or M_AXI_RVALID–the first return of the misaligned response.

		initial	r_misaligned_response_pending = 0;
		always @(posedge i_clk)
		if (!S_AXI_ARESETN)
			r_misaligned_response_pending <= 0;
		else if (i_stb && !o_err && !i_cpu_reset)
			r_misaligned_response_pending <= w_misaligned
						&& !w_misalignment_err;
		else if (M_AXI_BVALID || M_AXI_RVALID)
			r_misaligned_response_pending <= 1'b0;

		assign	misaligned_response_pending
				= r_misaligned_response_pending;

The next signal, misaligned_read, simply tells us we will need to reconstruct the read response from two separate read values before returning it to the CPU.

		initial	r_misaligned_read = 0;
		always @(posedge i_clk)
		if (!S_AXI_ARESETN)
			r_misaligned_read <= 0;
		else if (i_stb && !o_err && !i_cpu_reset)
			r_misaligned_read <= w_misaligned && !i_op[0]
						&& !w_misalignment_err;

		assign	misaligned_read = r_misaligned_read;

Finally, our last misalignment signal is the pending_err signal. This signal gets set on any write or read error, and then cleared when that error is returned to the CPU. Once set, we’ll clear it any time the interface clears. This guarantees that we’ll be clear following any request or response to the CPU as well.

		initial	r_pending_err = 1'b0;
		always @(posedge i_clk)
		if (i_cpu_reset || (!M_AXI_BREADY && !M_AXI_RREADY)
				|| r_flushing)
			r_pending_err <= 1'b0;
		else if ((M_AXI_BVALID && M_AXI_BRESP[1])
				|| (M_AXI_RVALID && M_AXI_RRESP[1]))
			r_pending_err <= 1'b1;

		assign	pending_err = r_pending_err;

	end endgenerate

The next several signals have only minor modifications.

The o_valid signal, indicating a valid read return to the CPU, needs to be adjusted so that it waits for the second return of any misaligned response. Similarly, we don’t return o_valid if either the current or past response, in the case of a pair of responses, indicates a bus error. In those cases, we’ll set o_err next.

	initial	o_valid = 1'b0;
	always @(posedge i_clk)
	if (i_cpu_reset || r_flushing)
		o_valid <= 1'b0;
	else
		o_valid <= M_AXI_RVALID && !M_AXI_RRESP[1] && !pending_err
				&& !misaligned_response_pending;

The error return is also quite similar. There are only a few differences. The first is that we don’t want to return an o_err response if there’s still a response pending. The second difference is that we’ll also return an o_err response if the prior response indicated a bus error.

	initial	o_err = 1'b0;
	always @(posedge i_clk)
	if (i_cpu_reset || r_flushing || o_err)
		o_err <= 1'b0;
	else if (i_stb && w_misalignment_err)
		o_err <= 1'b1;
	else if ((M_AXI_BVALID || M_AXI_RVALID) && !misaligned_response_pending)
		o_err <= (M_AXI_BVALID && M_AXI_BRESP[1])
			|| (M_AXI_RVALID && M_AXI_RRESP[1])
			|| pending_err;
	else
		o_err <= 1'b0;

Our busy signal returns to the CPU don’t change. Those are the same as before, as is the o_wreg register.

	always @(*)
	begin
		o_busy   = M_AXI_BREADY || M_AXI_RREADY; // also pipe_stalled
		o_rdbusy = M_AXI_RREADY && !r_flushing;
	end

	always @(posedge i_clk)
	if (i_stb)
		o_wreg    <= i_oreg;

That leaves one complicated piece left of this–reconstructing the read return. This is sort of the reverse of next_wdata, axi_wdata from above, save that this time we are using M_AXI_RDATA, last_result. Note the reverse ordering–the first value is always going to be on the right in a little endian bus.

The first step is to construct the two-word wide return, and then to shift it appropriately so the desired data starts at the bottom byte. We handle this with a separate logic block so that we don’t get lint errors when shifting from a value of one size to another.

	always @(*)
	begin
		if (misaligned_read && !OPT_ALIGNMENT_ERR)
			pre_result = { M_AXI_RDATA, last_result }
					>> (8*r_op[AXILSB-1:0]);
		else
			pre_result = { 32'h0, M_AXI_RDATA }
						>> (8*r_op[AXILSB-1:0]);
	end

Now that we have this pre-result, we can construct our final value. First, on any read return we copy the return to our last_result register–in case this is a misaligned return.

	always @(posedge i_clk)
	if (M_AXI_RVALID)
	begin
		last_result <= endian_swapped_rdata;

		if (OPT_ALIGNMENT_ERR)
			last_result <= 0;

The next step is to turn this pre_result value into the value we return to the CPU. If this is a half-word or octet request, we’ll zero the upper bits as well.

		o_result <= pre_result;

		// Fill unused return bits with zeros
		casez(r_op[AXILSB +: 2])
		2'b10: o_result[31:16] <= 0;
		2'b11: o_result[31: 8] <= 0;
		default: begin end
		endcase
	end

In many ways, this second pass at this design illustrates the way most of my development has taken place recently. I’ll often draft a simple version of a design, and then slowly layer on top of it more and more complicated functionality until it’s everything I want.

In hindsight, the misalignment processing wasn’t nearly as complicated as I was fearing. I know, I tend to dread handling misaligned requests. However, it never seems to be that hard when I actually get down to building it. Once you adjust the signaling to handle two requests, the remaining process is fairly basic: place the data into a two word shift register and shift it as appropriate, then deal with each half of that register.

If you look over the design for this memory controller, you might notice other options as well. For example, there’s an OPT_LOWPOWER option that will force all unused signals to zero. There’s a OPT_SIGN_EXTEND option to sign extend the return data. We’ve already mentioned the OPT_ALIGNMENT_ERR option. Finally, there are some experimental SWAP_ENDIANNESS options that I’m still working with–as part of hoping that I can somehow keep a big endian CPU running on a little endian bus without massive changes. (I’m not convinced any of these endianness parameters, either the SWAP_ENDIANNESS or the SWAP_WSTRB options, will work–they’re still part of my ongoing development.)

Formal Verification

At this point in my design, I’ve only formally verified this memory controller. I haven’t yet simulated it. Yes, I’m expecting some problems when I get to simulation, but, hey, one step at a time, right?

Let’s now take some time, though, to look over some of the major parts of that proof. These include the AXI-lite interface properties, the CPU interface properties, and some cover checks to make sure the design works. This follows from what I’ve learned from previous experiences about what works for verifying a design. Perhaps it will work the first time I try it in simulation. We’ll see. (I’m still not convinced the big-endian CPU will work with this little-endian controller, formal proof or not … but we’ll see.)

AXI-lite interface

Two years ago, I posted a set of interface properties for working with AXI-lite. At the time, I was very excited about these properties. By capturing all the requirements of an AXI-lite interface into a set of formal properties, I could simplify any future verification problems. I predicted AXI-lite designs would become easy to build as a result.

I haven’t been disappointed. While I’ve made small adjustments to those properties since that time, they’ve seen me through a lot. Using them, I’ve been able to very quickly check designs posted on Xilinx’s forums. The check tends to take about a half hour or so. Even better, it’s pretty conclusive.

So how hard is it? There are only a couple steps. First, on any new design, I start by instantiating my AXI-lite master property set into the design.

	faxil_master #(
		.C_AXI_DATA_WIDTH(C_AXI_DATA_WIDTH),
		.C_AXI_ADDR_WIDTH(C_AXI_ADDR_WIDTH),
		.F_OPT_ASSUME_RESET(1'b1),
		.F_LGDEPTH(F_LGDEPTH)
	) faxil(.i_clk(S_AXI_ACLK), .i_axi_reset_n(S_AXI_ARESETN),
		//
		.i_axi_awready(M_AXI_AWREADY),
		.i_axi_awaddr( M_AXI_AWADDR),
		.i_axi_awcache(4'h0),
		.i_axi_awprot( M_AXI_AWPROT),
		.i_axi_awvalid(M_AXI_AWVALID),
		//
		.i_axi_wready(M_AXI_WREADY),
		.i_axi_wdata( M_AXI_WDATA),
		.i_axi_wstrb( M_AXI_WSTRB),
		.i_axi_wvalid(M_AXI_WVALID),
		//
		.i_axi_bresp( M_AXI_BRESP),
		.i_axi_bvalid(M_AXI_BVALID),
		.i_axi_bready(M_AXI_BREADY),
		//
		.i_axi_arready(M_AXI_ARREADY),
		.i_axi_araddr( M_AXI_ARADDR),
		.i_axi_arcache(4'h0),
		.i_axi_arprot( M_AXI_ARPROT),
		.i_axi_arvalid(M_AXI_ARVALID),
		//
		.i_axi_rresp( M_AXI_RRESP),
		.i_axi_rvalid(M_AXI_RVALID),
		.i_axi_rdata( M_AXI_RDATA),
		.i_axi_rready(M_AXI_RREADY),
		//
		.f_axi_rd_outstanding(faxil_rd_outstanding),
		.f_axi_wr_outstanding(faxil_wr_outstanding),
		.f_axi_awr_outstanding(faxil_awr_outstanding));

I’ll then create a SymbiYosys script file. These files are pretty basic, enough so that I now have a script to handle generating just about all but three lines of the file. At this point, I’ll run the design and often find any bugs.

This design is almost that simple. In this case, I also need to incorporate a CPU interface property file as well, but we’ll get to that part in the next section.

At this point, SymbiYosys will either return a bug in 20 clock cycles in about 5 seconds, or there will likely not be a bug in the design at all. Sometimes I’ll just run it for 40-50 cycles if I’m not sure–or longer, depending on my patience level.

Once I get that far, most of the bugs in the design are gone.

Perhaps I’m a bit of a perfectionist, but this is rarely enough for me. I like to go further and verify these same properties for all time via induction. This, to me, is just a part of being complete.

So let’s spend some time working through some properties we might use to guarantee that this design passes an induction check.

In the case of the AXI-lite bus, this primarily consists of constraining the three counters: faxil_awr_outstanding, faxil_wr_outstanding, and faxil_rd_outstanding. We’ll go a bit farther here, and constrain some of our internal signals as well.

For example, if we are ever in a misaligned_request, then either WVALID or ARVALID should be set since this is our signal that we are in the first of two request cycles.

	always @(*)
	begin
		if (misaligned_request)
			assert(M_AXI_WVALID || M_AXI_ARVALID);

Similarly, if misaligned_aw_request is ever true, then we are in the first of two AWVALID cycles. That means M_AXI_AWVALID had better be true.

		if (misaligned_aw_request)
			assert(M_AXI_AWVALID);

If no misaligned responses are pending, then we should be able to at least limit the number of outstanding items. If any of the request lines, whether M_AXI_AWVALID, M_AXI_WVALID, or M_AXI_ARVALID are true, then, since there’s no misaligned responses pending, there must be nothing outstanding. In all other cases, with no misaligned responses pending the number of outstanding items must be less than one.

		if (!misaligned_response_pending)
		begin
			assert(faxil_awr_outstanding <= (M_AXI_AWVALID ? 0:1));
			assert(faxil_wr_outstanding  <= (M_AXI_WVALID ? 0:1));
			assert(faxil_rd_outstanding  <= (M_AXI_ARVALID ? 0:1));

Inequality constraints like this aren’t usually very effective, but they’re often where I’ll start a proof. Over time, I usually turn these inequalities into exact descriptions–although I didn’t do so for this design. Indeed, this particular proof is unusual in that the inequalities above are still important parts of my proof. (If I remove them, the proof fails …)

Of course, if there are no misaligned responses pending, then there can’t be any misaligned requests.

			assert(!misaligned_request);
			assert(!misaligned_aw_request);

On the other hand, if a misaligned response is pending, and we are in a read cycle, then the misaligned_read signal should be true.

		end else if (M_AXI_RREADY)
			assert(misaligned_read);

Now let’s turn our attention to flags specific to read cycles.

For example, if we aren’t in a read cycle then ARVALID, misaligned_read, and the number of outstanding read requests should all be zero.

		if (!M_AXI_RREADY)
		begin
			assert(!M_AXI_ARVALID);
			assert(faxil_rd_outstanding == 0);
			assert(misaligned_read == 1'b0);

On the other hand, if this is a read request then this can only be a misaligned request if nothing else is outstanding. After that, we do our best come up with the correct read count. (It’s still an inequality, but it’s enough …)

		end else begin
			if (misaligned_request)
				assert(faxil_rd_outstanding == 0);
			assert(faxil_rd_outstanding <= 1 + (misaligned_read ? 1:0));
		end

Let’s now turn our attention to write signals. If we aren’t in the middle of a write cycle, then the write signals should all be zero. There should be no writes outstanding, nor any being requested.

		if (!M_AXI_BREADY)
		begin
			assert(!M_AXI_AWVALID);
			assert(!M_AXI_WVALID);
			assert(faxil_awr_outstanding == 0);
			assert(faxil_wr_outstanding  == 0);

On the other hand, if we are within a write cycle, then what conclusions can we draw? If we are still within a request, then the number of outstanding items must be zero. Likewise, we will only ever have at most two requests outstanding at a time.

		end else begin
			if (misaligned_request)
				assert(faxil_wr_outstanding <= 0);
			if (misaligned_aw_request)
				assert(faxil_awr_outstanding <= 0);
			assert(faxil_awr_outstanding <= 2);
			assert(faxil_wr_outstanding <= 2);

But once I got this far, I punted. I just wasn’t certain how to constrain the write counters. So, I fell back on an old trick I’ve come across: the case statement. Using a case statement, I can often work my way through all the possibilities of something. A case statement also forces me to think about each of the possibilities of something happening individual.

			case({misaligned_request,
					misaligned_aw_request,
					misaligned_response_pending})
			3'b000: begin
				assert(faxil_awr_outstanding<= (M_AXI_BREADY ? 1:0));
				assert(faxil_wr_outstanding <= (M_AXI_BREADY ? 1:0));
				end
			3'b001: begin
				assert(faxil_awr_outstanding<= 1 + (M_AXI_AWVALID ? 0:1));
				assert(faxil_wr_outstanding <= 1 + (M_AXI_WVALID ? 0:1));
				end
			3'b010: assert(0);
			3'b011: begin
				assert(faxil_wr_outstanding<= 1 + (M_AXI_WVALID ? 0:1));
				assert(faxil_awr_outstanding == 0);
				assert(M_AXI_AWVALID);
				end
			3'b100: assert(0);
			3'b101: begin
				assert(faxil_awr_outstanding<= 1 + (M_AXI_AWVALID ? 0:1));
				assert(faxil_wr_outstanding == 0);
				assert(M_AXI_WVALID);
				end
			3'b110: assert(0);
			3'b111: begin
				assert(faxil_awr_outstanding == 0);
				assert(faxil_wr_outstanding == 0);
				assert(M_AXI_AWVALID);
				assert(M_AXI_WVALID);
				end
			default: begin end
			endcase
		end

As I’ve mentioned before on this blog, don’t worry about creating too many assertions. If you do, the worst that will happen is that there will be a minor performance penalty–assuming that you have valid assertions. If you assert something that isn’t true, the formal tool will catch it, and you’ll be patiently corrected. Indeed, there’s no way through creating too many assertions to get a design to pass an assertion that isn’t so. The problem isn’t usually too many assertions, rather it is not having enough assertions.

Moving on, and perhaps I should’ve asserted this earlier, we can either be in a write cycle, a read cycle, or no cycle. There should never be a time when both M_AXI_BREADY and M_AXI_RREADY are true together.

		// Rule: Only one of the two xREADY's may be valid, never both
		assert(!M_AXI_BREADY || !M_AXI_RREADY);
	end

Let’s put a quick constraint on r_flushing: if we aren’t busy, then we shouldn’t be flushing any responses. Since we’ve constrained o_busy to only ever be true if either M_AXI_BREADY or M_AXI_RREADY, this also effectively forces r_flushing to zero if nothing is outstanding and none of the AxVALID or WVALID requests lines are active.

	always @(*)
	if (!o_busy)
		assert(!r_flushing);

When putting this core together, I made some of the signals combinatorial. One example is o_busy, which is set if either M_AXI_BREADY or M_AXI_RREADY are true. I may wish to come back later and adjust this design so that o_busy is registered. Indeed, this sort of task is common enough that it forms the basis for a project I often use in my formal courseware: given a working design, with a working set of constraints, adjust a combinatorial value so that it is now registered, and then prove that the design still works. In order to support this possibility later, I’ve included the combinatorial descriptions of o_busy and o_rdbusy among my formal property set.

	always @(*)
		assert(o_busy == (M_AXI_BREADY || M_AXI_RREADY));
	always @(*)
		assert(o_rdbusy == (M_AXI_RREADY && !r_flushing));

In general, I like to have one or more constraints forcing every register into their correct setting with everything else. Here, we constrain pending_err: If we are busy, and there’s a misaligned response pending, then we haven’t yet gotten our first response back in return. Therefore, if we haven’t gotten our first of the two responses back, pending_err should be zero. It shouldn’t get set until and unless one of our return responses comes back in error.

	always @(*)
	if (o_busy && misaligned_response_pending)
		assert(!pending_err);

While I have more assertions in this section of the design, that’s probably enough to convince you that I’ve fully constrained the various faxil_*_outstanding counters to the internal state of the design.

What we haven’t done yet, however, is constrain the other half of the design: the CPU interface. Let’s do that next.

CPU Interface

One of the challenges associated with blindly attempting to formally verify an AXI design you’ve never seen before is that many AXI designs, like this one, are effectively bridges. That means they have two or more interfaces to them. An interface property file will only provide you with instant properties for one of those interfaces. You’ll still need to constrain the other interface.

Fig 14. Bridges require two interfaces to be constrained
</A>

In the case of the ZipCPU, there are two interfaces to memory. The ZipCPU, also has many memory interface implementations split across the two categories: instruction and data. When it comes to instruction fetching, the ZipCPU, has a very simple and basic single instruction fetch, as well as a two instruction pipeline fetch and an instruction fetch and cache. In a similar vein, the ZipCPU has three basic data interfaces: a basic single load or store interface, a pipelined memory controller, and a data cache. These three categories have served the ZipCPU well, allowing me to easily adjust the CPU to fit in smaller spaces, or to use more logic in order to run faster in larger spaces.

Those original interfaces, however, are also all Wishbone interfaces.

When it came time to build an AXI interface, I stepped back to rethink my verification approach. The problem with each of those prior memory controllers was that they each had their own assumptions about the CPU within them. When I then verified the CPU, I switched those assumptions to assertions, but otherwise verified the CPU with the memory interfaces intact within it. The consequence of this approach was that I needed to re-verify the CPU with every possible data interface it might have.

This seemed rather painful, so I separated the instruction and data interface assumptions from their respective controllers into one of two property files: one for the instruction interface,and another for the data interface. The property files therefore describe a contract between the CPU core and the instruction and data interfaces. Anything the CPU core needs to assume about those interfaces gets asserted when verifying the interface, or assumed when verifying the CPU. By capturing this contract into one place, verifying new interfaces has become much easier.

All of the former Wishbone memory interfaces have now been re-verified using one of these two property sets as appropriate.

Not only that, but now the ZipCPU has new AXI interfaces. There’s an AXI-lite instruction fetch module that can 1) handle one outstanding transaction, 2) two outstanding instruction fetch bus transactions, or even 3) an arbitrary number of outstanding instruction fetch transactions. I’ve also rebuilt the ZipCPU’s prefetch and instruction cache. One neat feature about these new AXI or AXI-lite interfaces is that they are all parameterized by bus width. That means that I won’t need to slow a 64-bit memory bus down to a 32-bit width for the CPU anymore.

It’s not just instruction fetch interfaces, either. This approach has made it easy to build data interfaces in the same way.

For now, let’s take a look at how easy it is to use this new interface.

The first step is to declare some signal wires to be shared between the memory module and the interface property set. These extra (formal verification only) signals are:

  • cpu_outstanding: A count of how many requests the CPU thinks the memory is working on. This count will get cleared on a CPU reset, i_cpu_reset.

  • f_done: This signal is generated by the memory controller to tell the property set that an operation has completed–whether read or write. Normally, something like this would be part of the interface between the memory unit and the CPU, something like o_valid or o_err above. However, there’s no means in this interface to announce the completion of a write operation other than dropping o_busy, so f_done takes its place.

  • f_last_reg: Is a copy of the last register target of any CPU load operation. This is important for the CPU pipeline, since there’s enough room in the CPU pipeline to read into any register but the last one, and so this last register needs to be tracked by the memory property set.

  • f_addr_reg: One of the rules of pipelined memory operations is that, in any string of ongoing operations, they all need to use the same base address register. This keeps the CPU from needing to keep track of which register will be written to by the operation. In particular, the address register shall not be written to by any operation–save perhaps the last one. The CPU will insure this, by never issuing a read request into the address register unless it waits for the memory controller to finish all of its reads first. The property set will accept this value as f_areg–again, it’s not part of the CPU’s interface proper, so we just assume it’s presence here. The CPU will actually produce such a register, since it knows what it is, and properties of that register will be asserted there–here they are only assumed.

  • f_pc: This flag, returned from the memory property set, will be true if the last read request is to read into either the program counter or the condition codes, both of which might cause the CPU to branch. Reads into the program counter or condition codes, if done at all, need to be the last read in any string. This return wire, from the property set, helps to make sure that property is kept.

  • f_gie: The ZipCPU has a lot of “GIE” flags all throughout it. “GIE” in this case stands for “Global Interrupt Enable”. In supervisor mode, the “GIE” bits are clear, whereas they are set in user mode–the only mode where the ZipCPU can be interrupted. These are also the most significant bit in any register address–since the ZipCPU has one register set for user mode (GIE=1) and one for supervisor mode (GIE=0).

    Any string of read (or write) operations must have the same GIE bit, so this flag captures what that bit is.

  • f_read_cycle: This value, returned by the interface property set, just keeps track of if we are in a read cycle vs a write cycle. To avoid hazards, the ZipCPU will only ever do reads or writes–never both at the same time. Knowing this value helps keep track of what types of request are currently outstanding, so we can make sure we don’t switch cycles.

  • f_axi_write_cycle: This one won’t get used below. It’s a new one I had to create to support exclusive access when using AXI.

    First, a brief overview of how AXI exclusive access works when using the ZipCPU: the CPU must first issues a LOCK instruction, and then a load instruction of some size. This load is treated as an AXI exclusive access read, so M_AXI_ARLOCK will be set. If the read comes back as OKAY, rather than EXOKAY, a bus error is returned to the CPU indicating that the memory doesn’t support exclusive access. Otherwise, an ALU operation may (optionally) take place followed by a store instruction. If the store instruction fails, that is if the result is OKAY rather than EXOKAY in spite of receiving the EXOKAY result from the previous read access, than the memory controller returns a read result. (From a write operation? Yes!) That read result writes into the program counter a jump back to the original LOCK instruction to start over.

    For this reason, an AXI exclusive access store instruction is the only type of write instruction that will set o_rdbusy.

    That’s a long story, just to explain why this flag is necessary–to explain why o_rdbusy might be set on a store instruction, and to help guarantee that the result (if any) will either be written to the program counter or quietly ignored if the write was successful.

	wire	[3:0]	cpu_outstanding;
	reg		f_done;
	wire	[4:0]	f_last_reg, f_addr_reg;
	(* anyseq *) reg [4:0]	f_areg;
	wire		f_read_cycle, f_pc, f_gie;

One last step before instantiating this property set is to create the f_done signal. For this AXI interface, that’s pretty easy. An operation is done when we receive the M_AXI_BVALID or M_AXI_RVALID signal–with a couple of exceptions. We’re not done if a bus error will be produced. That’s another, separate, signal. Neither are we done if there’s a pending error, if this is the first of two responses, or if we are flushing requests that a recently reset CPU wouldn’t know about.

	initial	f_done = 1'b0;
	always @(posedge i_clk)
	if (!S_AXI_ARESETN || r_flushing || i_cpu_reset)
		f_done <= 1'b0;
	else
		f_done <= (M_AXI_RVALID && !M_AXI_RRESP[1]
			|| M_AXI_BVALID && !M_AXI_BRESP[1]) && !pending_err
				&& !misaligned_response_pending;

Still, it’s not much more complicated than anything we’ve already done.

With that out of the way, we can simply instantiate the formal memory property interface.

	fmem
	fcheck(
		.i_clk(S_AXI_ACLK),
		.i_bus_reset(!S_AXI_ARESETN),
		.i_cpu_reset(i_cpu_reset),
		.i_stb(i_stb),
		.i_pipe_stalled(o_busy),
		.i_clear_cache(1'b0),
		.i_lock(i_lock), .i_op(i_op), .i_addr(i_addr),
		.i_data(i_data), .i_oreg(i_oreg), .i_busy(o_busy),
			.i_areg(f_areg),
		.i_rdbusy(o_rdbusy), .i_valid(o_valid), .i_done(f_done),
		.i_err(o_err), .i_wreg(o_wreg), .i_result(o_result),
		.f_outstanding(cpu_outstanding),
		.f_pc(f_pc),
		.f_gie(f_gie),
		.f_read_cycle(f_read_cycle),
		.f_last_reg(f_last_reg), .f_addr_reg(f_addr_reg)
	);

Now, with all this background out of the way, we can finally verify this memory core. As I mentioned in the AXI-lite verification section above, if it weren’t for wanting to pass induction, these two property sets alone might well be sufficient to verify all but the data path through the logic.

How well does it work? Well, typically the formal tool takes less than twenty seconds to return any bugs. Even better, it points me directly to the property that failed, and the exact timestep where it failed.

That’s not something you’ll get from simulation.

However, since I like to verify a design using induction as well, I’ll often want to add some more properties.

Our first additional property just asserts that if we are every r_flushing, then the CPU should’ve have just be reset so it shouldn’t be expecting anything more. If we aren’t flushing, then either this design is busy, or it is in the process of returning a result or an error to the CPU.

	always @(*)
	if (r_flushing)
	begin
		assert(cpu_outstanding == 0);
	end else
		assert(cpu_outstanding == (o_busy ? 1:0)
			+ ((f_done || o_err) ? 1 : 0));

If f_pc is ever set, then our one (and only) output must be to either the program counter, o_wreg[3:0] == 4'hf, or the condition codes register, o_wreg[3:0] == 4'he. Otherwise, if f_pc is clear, then no reads can read into either PC or CC registers.

	always @(*)
	if (f_pc)
		assert(o_wreg[3:1] == 3'h7);
	else if (o_rdbusy)
		assert(o_wreg[3:1] != 3'h7);

If any items are outstanding, then o_wreg must match the last register address requested. Hence, following a load into R0, both o_wreg and f_last_reg should both point to the register address of R0.

	always @(*)
	if (cpu_outstanding > 0)
		assert(o_wreg == f_last_reg);

As long as we are busy, the high bit of the return register must match f_gie. This finishes our constraints upon all of the bits of o_wreg.

	always @(*)
	if (o_busy)
		assert(o_wreg[4] == f_gie);

As one last property, let’s make sure f_read_cycle matches our logic. If M_AXI_RREADY is true, then we should be in a read cycle–unless we are flushing things out of our pipeline following a CPU reset. Similarly, if M_AXI_RREADY is not true and M_AXI_BREADY is, then we should be in a write cycle and so we can assert !f_read_cycle.

	always @(*)
	if (M_AXI_RREADY)
		assert(f_read_cycle || r_flushing);
	else if (M_AXI_BREADY)
		assert(!f_read_cycle);

Notice how easy that was? All we had to do was to tie a couple of return wires from the interface property set together to the internal state of our design, and we then have all the properties we need.

Cover properties

As a final step in this proof, I’d like to see how well it works. For that, let’s just create some quick counters and count the number of returns we receive–both for writes and then again for reads.

	reg	[3:0]	cvr_writes, cvr_reads;

	initial	cvr_writes = 0;
	always @(posedge i_clk)
	if (!S_AXI_ARESETN)
		cvr_writes <= 0;
	else if (M_AXI_BVALID&& !misaligned_response_pending  && !cvr_writes[3])
		cvr_writes <= cvr_writes + 1;

	initial	cvr_reads = 0;
	always @(posedge i_clk)
	if (!S_AXI_ARESETN)
		cvr_reads <= 0;
	else if (M_AXI_RVALID && !misaligned_response_pending && !cvr_reads[3])
		cvr_reads <= cvr_reads + 1;

Once we have this count, a simple cover check can produce some very useful and instructive traces. Indeed, these traces will show how fast this core can operate at it’s fastest speed.

	always @(posedge i_clk)
		cover(cvr_writes > 3);

	always @(posedge i_clk)
		cover(cvr_reads > 3);

The traces themselves are shown in Figs. 10 and 11 above. They show that this core can only ever achieve a 33% throughput at best.

No, 33% is not great. In fact, when you put 33% in context of the rest of any surrounding system, as we did in Fig. 13 above, it’s downright dismal performance. However, all designs need to start somewhere, and in many ways this is a minimally working AXI master memory controller design.

Conclusions

This is now our third article on building AXI masters. The first article discussed AXI masters in general followed by a demonstration Wishbone to AXI bridge. The second article discussed several of the problems associated with getting AXI bursts working properly, and why they are so challenging. This one returns to the simple AXI-lite master protocol, while also illustrating a working CPU memory interface.

As I’ve alluded to earlier, this is only the first of a set of three AXI memory controllers for the ZipCPU: This was the single access controller. I’ve also built a pipelined controller which should get much better performance. These are both AXI-lite designs. This particular controller also has an AXI (full) sister core, implementing the same basic design while also supporting exclusive access. My intent is to make a similar sister design to support pipelined access and AXI locking as well, but I haven’t gotten that far yet. I have gotten far enough, though, to have ported my basic Wishbone data cache to AXI. While usable, that work isn’t quite done yet, since it doesn’t (yet) support either pipelined memory access or locking, but it’s at least a data cache implementation using AXI that should be a step in the right direction. (Remember, I tend to design things in layers these days …)

Lord willing, I’d like to spend some time discussing AXI exclusive access operations next. I’ve recently modified my AXI property sets so that they can handle verifying AXI designs with exclusive access, and I’ve also tested the approach on an updated version of my AXI (full) demonstration slave. Sharing those updates will be valuable, especially since neither Xilinx’s MIG-based DDR3 memory controller, nor their AXI block RAM controller appear to support exclusive access at all. (Remember, the ZipCPU will return a error on any attempt at an exclusive access operation on a memory that doesn’t support it, so having a supporting memory is a minimum requirement for using this capability.) This can then be a prelude to a companion article to this one, discussing how to modify this controller so that it handles exclusive access requests in the future.

Let me also leave you with one last haunting thought: What would happen if, during a misaligned read operation across two words, a write took place at the same time? That’ll be something to think about.