Welcome to the ZipCPU blog. I started it years ago after building my own soft core CPU, the ZipCPU, and dedicated this blog to helping individuals stay out of FPGA Hell. I then transitioned from working on the ZipCPU, to building bus components that might be used by every project–crossbars, bridges, DMAs and such. Since that time, my time is primarily spent not on the CPU, but rather its peripherals. This last year, for example, has seen work on several memory controllers, to include both NOR and NAND flash controllers, an SD Card(SDIO)/eMMC controller, and (now) a SATA controller. I’ve also had the opportunity to work on high speed networking, video, and even SONAR applications. All of this work is made easier by having both my own soft-core CPU, together with bus interconnect components, that I’m not afraid to dig into to debug if necessary.

With all of these distractions, its nice every now and then to come back the the ZipCPU.

One of my current projects requires that I bench mark AMD(Xilinx)’s DDR3 SDRAM MIG controller against the open source UberDDR3 controller. The performance differences are dramatic and very significant. My current (draft) article discussing these results works through a series of CPU and DMA based tests. For each test, the article describes first the C code for the test, then the assembly for the critical section, then a diagram of the CPU’s pipeline–reconstructed from simulation traces, and then finally traces showing the differences between the two controllers.

All of that led me to this trace from the data cache, shown in Fig. 1 below.

Fig 1. ZipCPU Data Cache Miss

For quick reference, the top line is the clock. The JMP line beneath it is the signal from the CPU’s core to the instruction fetch that the CPU needs to branch. The PF line shows the output of the prefetch (cache), and whether an instruction is available for the CPU to consume and if so which one. The DCD line shows the output of the instruction decoder. OP is the output of the read operands pipeline stage, and WB is the writeback stage. The CYC, STB, and ACK lines are a subset of the Wishbone bus signaling used to communicate with memory. First there’s the Zip-* version of these signals, showing them coming out of the CPU, and then the SDRAM-* signals coming from the crossbar showing these signals actually going to the memory controller itself.

At issue is how long it takes the CPU to respond to a cache miss. Notice how it takes the CPU 3 clock cycles from receiving an LW (load word) instruction from the read operands stage until when the data cache initiates a bus request, another 3 cycles before the request can make it to SDRAM controller, one cycle to return, and another 5 cycles from the completion of that request before the CPU can continue. That’s 11 clock cycles on every data cache miss above and beyond the cost of the memory access itself.

Ouch.

When it comes to raw performance, every cycle counts. Can we do better?

Yes, we can. Let’s talk about wrap addressing today.

That said, I’d like to focus this article on saving a couple clock cycles in the instruction cache rather than the data cache shown in my example. Why? For the simple practical reason that the instruction cache has been easier to update and get working–although I have yet to post the updates. My data cache upgrades to date remain a (broken) work in progress. Both, however, can be motivated by the diagram in Fig. 1 above.

Wrap Addressing

What might we do to improve the performance of the trace in Fig. 1?

The first thing we might do is speed up how long it takes to recognize that a particular value is not in the cache. There’s only so much that can be done here, however, since the cache tag memory is clocked. As a result, it will always take a clock cycle to look up the cache tag for any new request, and another clock cycle to know it’s not the right tag, and then a third clock cycle to activate the bus.

The crossbar is separate from the CPU, and its timing is dominated by the need for a clock rate that matches the CPU.

The UberDDR3 memory controller is a separate product from the CPU, so its performance is independent from the CPU itself.

How about the return? Once a value has been returned from memory to the cache, it then takes another clock cycle to shift the value into place for the CPU, so there’s not much to be done there … or is there?

There are two optimizations that can be made on this return path. The first is that we can take the value directly from the bus and return it to the CPU–rather than waiting for the value to first be written to and then read back from the cache’s memory. The second optimization is wrap addressing. We’ll discuss both of these optimizations today.

First, though, let me introduce the concept of a cache line. A cache line is the minimum amount of memory that can be read into the cache at a time. The cache itself is composed of many of these cache lines. Upon a cache miss, the cache controller will always go and read a whole cache line.

A long discussion can be had regarding how big a cache line can or should be. For me, I tend to follow the results published by Hennessey and Patterson, and keep my cache lines (roughly) 8 words in length. For simplicity, the ZipCPU’s caches are all one-way caches, but, yes, a significant performance can be gained by upgrading to two or even four-way caches–but that’s a story for another day.

Now that you know what a cache line is, notice how the cache miss in Fig. 1 results in reading an entire cache line. As we’ll discuss in the memory performance benchmarking article (still to be finished), memory performance can be quantified by latency and throughput. Caches can get an advantage over single-beat read or write instructions by reading more than one beat at a time, and so increasing the line size improves efficiency. One problem with increasing the line size, however, is that 1) it increases the amount of time the bus is busy handling any request (remember all requests are for a full cache line), and 2) it increases the risk that you spend a lot of time handling requests for instructions or data you’ll never use or need.

Now we can discuss wrap addressing. Wrap addressing is a means of reading the cache line out of order. Without wrap addressing, we might read the words in the cache line in order from 0-7. With wrap addressing, the cache will specifically read the requested item from the cache line first, then finish to the end of the line, then go back and get what was missing from the beginning. This way, as soon as the word that caused the cache miss in the first place has been read, the CPU can be unblocked and continue whatever it needs to do next while the cache controller finishes its read of the cache line. The big difference is that with wrap addressing the cache line is read in more of a priority fashion. “Wrap addressing” is the just name given to this style of out of order addressing.

That’s what it is. Let’s now look at its impact.

Wrap Addressing with the ZipCPU’s Instruction Cache

Some years ago, I added wrap addressing to the ZipCPU’s AXI instruction and data caches. Up until that time, I had poo-poo’d the benefit that might be had by using it. The ZipCPU was designed to be a “simple” and “low-logic” CPU, and wrap addressing would just complicate things–or so I judged. Then I tried it. At the time, I just needed something that used wrap addressing–the AXI bus functional model I had been given just wasn’t up to the task, but the ZipCPU could issue wrap addressing requests quite nicely. In the process, I was surprised at how much faster the ZipCPU ran when the caches used wrap addressing.

That experiment died, however, once the need was over. The big reason for it dying was simply that I don’t use AXI often. Sure, the ZipCPU has AXI memory controllers, but they only fit the CPU so well. The AXI bus is little endian, and the ZipCPU is big endian, so the two aren’t a natural fit. There’s plenty of pain at the seams. Further, adding wrap addressing to my Wishbone memory controllers was simply work that wasn’t being paid for. No, it doesn’t help that the Wishbone bus doesn’t really offer burst or wrap support, but I think you’ll find that issue to be irrelevant to today’s discussion.

As a result, Wishbone wrap addressing for the ZipCPU has therefore languished until I was recently motivated by examining the MIG and UberDDR3 memory controller bench mark results. Indeed, I found myself a touch embarrassed at the performance the CPU was delivering.

For illustration, let’s look at the first several instructions of a basic ZipCPU test program I use. We’ll break it into two portions. There’s the first several instructions.

	; Clear all registers
	;   The "|" separates two instructions, both of which are
	;   packed into a single instruction word.
 4000000:	86 00 8e 00 	CLR        R0            | CLR        R1
 4000004:	96 00 9e 00 	CLR        R2            | CLR        R3
 4000008:	a6 00 ae 00 	CLR        R4            | CLR        R5
 400000c:	b6 00 be 00 	CLR        R6            | CLR        R7
 4000010:	c6 00 ce 00 	CLR        R8            | CLR        R9
 4000014:	d6 00 de 00 	CLR        R10           | CLR        R11
 4000018:	66 00 00 00 	CLR        R12
	; Set up the initial stack stack pointer
 400001c:	6a 00 00 10 	LDI        0x08000000,SP	; Top of stack
 4000020:	6a 40 00 00 
	; Guarantee we are in supervisor mode, and trap into supervisor
	; mode if not
 4000024:	76 00 00 00 	TRAP
	; Provide a set of initial values for all of the user registers 
 4000028:	7b 47 c0 1e 	MOV        $120+PC,uPC
 400002c:	03 44 00 00 	MOV        R0,uR0
 4000030:	0b 44 00 00 	MOV        R0,uR1
 4000034:	13 44 00 00 	MOV        R0,uR2
 4000038:	1b 44 00 00 	MOV        R0,uR3
 400003c:	23 44 00 00 	MOV        R0,uR4

These get us to the end of the first cache line and now the beginning of the second. Take note that there have been no jumps or branches in this assembly, it’s just straightforward walking from one instruction to the next through the test program. (Yes, we’ll get to branches soon enough.)

The instructions then continue loading the user register set with default values.

 4000040:	2b 44 00 00 	MOV        R0,uR5
 4000044:	33 44 00 00 	MOV        R0,uR6
 4000048:	3b 44 00 00 	MOV        R0,uR7
 400004c:	43 44 00 00 	MOV        R0,uR8
 4000050:	4b 44 00 00 	MOV        R0,uR9
 4000054:	53 44 00 00 	MOV        R0,uR10
 4000058:	5b 44 00 00 	MOV        R0,uR11
 400005c:	63 44 00 00 	MOV        R0,uR12
 4000060:	6b 44 00 00 	MOV        R0,uSP
 4000064:	73 44 00 00 	MOV        R0,uCC
	; Finally, we call the bootloader function to load software into RAM
	; from flash if necessary (it isn't in this case), and to zero any
	; uninitialized global values
 4000068:	03 43 c0 02 	LJSR       @0x040000b4    // Bootloader
 400006c:	7c 87 c0 00 
 4000070:	04 00 00 b4 
	; Software continues, but the next section is outside the scope
	; of today's discussion.
	; ....

These end with a jump to subroutine instruction, followed by the beginning of the “_bootloader” subroutine below.

In this case, the cache line starts at address 0x04000080. However, we don’t start executing there in our example. Instead, we start executing partway through the cache line at the beginning of the bootloader subroutine.

040000b4 <\_bootloader>:
	; Our first step is to create a stack frame.  For this, we
	; subtract from the stack pointer, and then store any
	; registers we might clobber onto the stack.  As before,
	; the "|" separates two instructions, both of which are
	; packed into a single instruction word.
 40000b4:	e8 10 ad 00 	SUB        $16,SP        | SW         R5,(SP)
 40000b8:	b5 04 bd 08 	SW         R6,$4(SP)     | SW         R7,$8(SP)
 40000bc:	44 c7 40 0c 	SW         R8,$12(SP)
 40000c0:	0a 00 00 00 	LDI        0x00000004,R1
 40000c4:	0a 40 00 04 
 40000c8:	0c 00 00 04 	CMP        $4,R1
 40000cc:	78 88 01 0c 	BZ         @0x040001dc
 40000d0:	0a 00 00 00 	LDI        0x00000004,R1
 40000d4:	0a 40 00 04 
 40000d8:	0c 00 00 04 	CMP        $4,R1
 40000dc:	32 08 00 20 	LDI.Z      0x04000000,R6
	; ....

Together, these two sets of instructions make an awesome example to see how wrap addressing would work from an instruction fetch perspective.

One of the things I like about this example is the fact that the test starts with many sequential instructions and no jumps (branches). This will help provide us a baseline of how things work–before jumps start making things complicated.

For today’s discussion, our cache line size is 8 words, each having 64bits. The ZipCPU’s nominal instruction size is 32bits. Therefore, each cache line will nominally contain 16 instructions. Our first cache line, however, contains many clear (CLR) instructions (really load-immediate 0 into register …), and two of these instructions can be packed into a single 32b word. This is shown above using the “|” characters. Fig. 2 shows how the CPU pipeline works through these initial instructions–without wrap adddressing.

Fig 2. Starting the cache, without wrap addressing

Following the CPU reset, the cache starts with the JUMP flag set. Following a jump, it takes us 4 clock cycles to determine that the new address is not in the cache and to therefore start a bus cycle.

This bus cycle is painful. When using the MIG, it requires a (rough) 35 cycles (on a good day) to read all eight words. When using the UberDDR3 controller, it requires a (rough) 18 cycles. Since the ZipCPU can nominally execute one instruction per cycle, this is a painful wait.

Once the bus cycle completes, we take another two cycles to present the instruction from the cache line that we just read to the CPU. The decoder then takes two clock cycles with this instruction, since it contains two instructions packed into a single word, and so forth. From here on out, instructions are passed to the CPU at one instruction word per clock cycle–unless the CPU needs to take more clock cycles with them–as is the case of the compressed instruction.

Some instructions, such as the load immediate instruction, are actually two separate instructions–a bit reverse instruction to load the high order bits and a load immediate lo. Other than that, things stay straight forward until the end of the cache line. Once we get to the end, it takes us another 4 cycles to determine the next instruction is not in the cache, and so a new cycle begins again.

Now that we now how things work normally, we have our first chance for an improvement: what if we started feeding instructions to the CPU before all of the instructions had been read from memory and returned across the bus? What if we fed the next instruction to the CPU as soon as it was available?

In that case, we might see a trace similar to Fig. 3 below.

Fig 3. Feeding instructions straight from the bus returns

We can now overlap our instruction read time with our instruction issue, saving ourselves a full 10 cycles!

Let’s follow this further. What would happen in the case of a jump/branch? Without any modifications to our instruction cache (i.e. before wrap addressing), the JSR initiates a jump at the end of Fig. 4 below.

Fig 4. A Jump Instruction

This trace is a touch more eventful. For example, it includes a move to the CC register. On the ZipCPU, this register contains more than just the condition codes. It also contains the user vs supervisor mode control. This creates a pipeline hazard, and so instructions need to be stalled throughout the pipeline until this instruction has had a chance to write back–clearing the hazard.

The ZipCPU’s JSR instruction follows, requiring three instruction words. The first instruction word moves the program counter plus two into R0. This will now contain the return address for the subroutine. On other architectures, such an instruction is often called a “Link Register” instruction, but on the ZipCPU this is simply the first of the three word JSR instruction. The second instruction loads a new value into the program register. Technically, this is a LW (PC),PC instruction–loading the value of memory, as found at the program counter, into the program counter. Practically, it just allows us to place a 32b destination address into the instruction stream. Once the address is passed to the decoder, the decoder recognizes the unconditional jump and sets a flag for the instruction cache that it now wants a new instruction out of order. The instruction cache now takes four clock cycles to determine this new value is not in the cache, and our cycle repeats.

As before, we can compress this a touch by serving our instructions to the CPU immediately as they are read from the bus–instead of waiting for the entire cache line to be read first. You can see how this optimization might speed things up in Fig. 5.

Fig 5. JSR instruction, post optimization

That’s how the first of our two optimizations works.

Following the jump, without WRAP addressing, the pipeline would look like FIG 6.

Fig 6. JSR Landing, no optimization

To see what’s happening here, notice that we just jumped to address 0x040000b4. Given our cache line size of eight words, with each word being 64bits, this cache line starts at address 0x04000080. If we just returned the value from the bus as soon as it was available, we’d have to read six bus words before we get to the one we’re interested in–as shown in Fig. 6.

 4000080:	; Word 0: I don't care about these instructions.  I'm jumping
 4000084:	;	to address 0x040000b4.  I just have to read
 4000088:	; Word 1: these excess instructions because I'm operating on an
 400008c:		entire cache line.
 4000090:	; Word 2:
 4000094:
 4000098:	; Word 3: Still haven't gotten to anything I care about ...
 400009c:
 40000a0:	; Word 4:
 40000a4:
 40000a8:	; Word 5:
 40000ac:
 40000b0:	; Word 6: This is the first half of the word I do care about
 40000b4:	;	THIS IS THE FIRST INSN OF INTEREST!
 40000b8:	; Word 7:
 40000bc:	;

Why not, instead, request the address we are interested in first? Instead of starting with word 0, and reading until word 6, we might instead start with word 6, read word 7, and then finish by reading the first part of the cache line (words 0-5) while the CPU takes our instruction and gets (potentially) busy doing useful things.

Fig. 7 shows how this wrap addressing might look.

Fig 7. Instruction cache miss using WRAP addressing

Here, we request the last two instruction words, words 6 and 7, of the cache line, and then instruction words 0-5. Word 6 contains two instructions, but we’re only interested in the second of those two. That one is a compressed instruction, packing two instructions into 32bits. Word 7 then contains another three instructions–one packed instruction word and one normal one.

The trace gets a touch more interesting, though, given that the second instruction wants to store a word into memory. The ZipCPU, however, has only one bus interface–an interface that needs to be shared between instruction and data bus accesses. This means that the data access, i.e. the store word instruction, must wait until the instruction cache’s bus cycle completes.

Conclusions

The next step in this article should really be an analysis section that artificially quantifies the additional performance achieved by using wrap addressing over what I had been using. This should then be compared against some actual performance measure. Sadly, that’s one part of caches that I haven’t managed to get right–the performance analysis. Even worse, the lack of a solid ability to analyze this improvement has kept me from writing an article introducing the ZipCPU’s instruction cache in the first place. Perhaps I’ll manage to come back to this later–although it’s held me back for a couple of years now.

Since I haven’t presented the instruction cache in the first place, it doesn’t really make sense to write an article presenting the modifications required to introduce wrap addressing. That said, it was easier to do than I was expecting.

Fig 8. Is formal worth it?

I suppose “easier” is a relative term. I upgraded both instruction and data caches quickly–perhaps even in an afternoon. Then, when everything failed in simulation, I reverted the data cache updates to focus on the instruction cache updates. Those updates are now complete, as is their formal proof, so I expect I’ll push them soon. All in all, the work took me a couple of days to do spread over a month or so, with (as expected) the verification part taking the longest.

No, the updates aren’nt (yet) posted. Why not? Because this update lies behind the ZipCPU’s AXI DMA upgrade, and … that one still has bugs to be worked out in it. What bugs? Well, after posting the DMA initially, I then decided I wanted to change how the DMA handled unaligned FIXed addressing. My typical answer to unaligned FIX addressing is to declare it disallowed in the user manual, but for some reason I thought I might support it. The new/changed requirements then made it so that nothing worked, and so I have some updates left to do there before formal proofs and simulations pass again.

So my next steps are to 1) repeat this work with the data cache, and 2) finish working with the ZipCPU’s DMA, so that 3) I can post another upgrade to the ZipCPU’s repository. In the meantime, I’ll probably post my DDR3 controller memory performance benchmarks before these updates hit the ZipCPU official repository.

For now, let me point out that the WRAP addressing performance is significantly better, and the logic cost associated with it is (surprisingly) rather minimal. How much better? Well, that answer will have to wait until I can do a better job quantifying cache performance …