Mystery post: The ugliest bug I've ever encountered

I’ve been working with software most of my life–since long before I ever went to college. Now, after several decades of working with software, I’d like to think I may have learned something about fixing bugs in software.

I’ve learned to debug bugs by their patterns, and the more often I see the same error lead to the same bug the more these patters are enforced in my mind. Hence, any time the symptoms of a bug match a pattern I’ve seen before, I know exactly where to look for the bug. When younger programmers ask me why their code isn’t working, I just match what’s going on to a pattern I’ve seen in the past, and I can usually find any problem they might have quite quickly.

I’ve got to believe I’m not alone in this approach, and that others have also used this approach to debugging software as well.

While this approach has served me well over time, there was one bug I encountered recently that didn’t match any of the patterns I’ve learned over my many decades.

The Symptoms

Are you ready to see if you can recognize this bug? Think you can do it? Here’s what I was looking at.

Fig 1. Writing past the end of memory

The code crashed with a bus error, created by trying to access a non-existent memory address using the stack pointer.

This isn’t the first time I’ve ever had a computer program do the wrong thing. As a result, I design my wishbone interconnects to both return an error upon any attempt to read from a non-existent address, as well as to record the address created the error. For this reason, I knew the address the interconnect was seeing–and it was indeed beyond the end of memory.
If you looked through and followed the code, there should never have been a bus error. GCC had put the proper logic into the code to calculate the correct address. For some reason, the correct address wasn’t getting calculated. Indeed, if you examined the CPU state following the bus error, the stack pointer wasn’t pointing to the right value.
If I used the debugger to step through the code, there would be no error.
Any attempt to add a debugging statement prior to the bug would keep the CPU from getting a bus error at that instruction.

This includes both the BREAK instruction as well any instructions which might have been used to trigger the wishbone scope.
In my frustration at one point while I was trying to figure out what the problem was, I decided to try Voodoo computing.

Defn: Voodoo computing, verb: The process of fixing what isn’t broken in an attempt to fix what is. It is similar to Cargo Cult programming.

So I restarted the ZipCPU from the beginning of the program. (This was a warm reboot, not a cold one from no power.) On this second program start, the code executed as designed and the program failed elsewhere.

Confused yet? I certainly was. Let me show you what the code looked like.

The Software in Question

The code that failed was the very first function call following a reboot.

Fig 2. Initial stack pointer

The ZipCPU would start at the reset address and it would load the stack pointer with a reference to the end of memory. While both of these are configurable addresses within the ZipCPU infrastructure, inspecting the code confirmed that the assembler and linker had done their jobs correctly. After a couple of other instructions upon CPU startup, the CPU would then jump to this first_subroutine and quickly fail.

Here’s what those first few instructions looked like,

reset_address:
	; Load the stack pointer with the address of the very last location
	; in memory.  (LDI = load immediate, SP = stack pointer)
	LDI	#end_of_memory,SP
	; ... some other setup lines
	JSR	first_subroutine
	; ... code continues

The LDI instruction within the ZipCPU can load an arbitrary value into any register over the course of up to two instructions. Specifically, it takes one instruction for anything up to 23-bits, and two instructions for anything longer.

Fig 3. The first Stack Frame

The program then jumped to a subroutine. This involves copying the program counter, PC, into the R0 general purpose register, and then jumping to the address given–in this case the CPU jumped to first_subroutine. (I’ve since forgotten the subroutine’s actual name.)

The first several instructions in the first_subroutine were much like any other subroutine. They created a stack frame by subtracting a value from the stack pointer–adjusting it to point earlier in memory. This creates an area in memory where the program can hold local variables, commonly called a stack frame. The program then attempted to save some registers to this new stack frame. In assembly, this looked like,

first_subroutine:
	SUB	16,SP	; Create a stack fromm, subtracting 16 from SP
	SW	R0,(SP)	; Store R0 into the first position in the stack frame
	SW	R1,4(SP); BUS ERROR HERE!!!!

If you’ve never seen ZipCPU assembly before, it reads from left to right, with the destination of the instruction on the right. Hence SUB 16,SP subtracts sixteen from the stack pointer and places the result back into the stack pointer. Likewise SW R0,(SP) stores the value of R0 into the address given by the stack pointer, and SW R1,4(SP) does roughly the same thing, with the exception that R1 is saved into the location four bytes later. Of these three instructions, only the SUBtract instruction modifies the stack pointer.

If I pulled the code up in a debugger following this bus error, the stack pointer SP would have the value given by the end_of_memory value it was initially loaded with. It was as though the stack frame was never created, despite the SUB 16,SP instruction.

When the ZipCPU then tried to write into this stack frame, it would write past the end of memory. Depending on whether end_of_memory pointed to the last address in memory (where I commonly set it), or just past the end of memory would then determine whether storing R0 to the location in the stack pointer caused the bus error or whether it was storing R1 to the location 4 bytes later that caused the bus error. Either way, the code caused a bus error and none of it made any sense to me.

Before reading any further, think about the description above, and then ask yourself: does this match any bug pattern you’ve ever seen before? The program fails the first time it is tried, but not the second–but then it would fail on the third time. Any attempt to add debugging code kept this error from happening.

Oh, and let me give you a hint: I was running my own code, on my own ZipCPU, within an FPGA.

Any ideas?

What was it?

In many ways, I was at a loss to figure out the problem until I started up the program in a Verilator based simulation. (Yes, I know, I was running code that I hadn’t simulated first–shame on me.) Once I saw the bug take place within the simulator, I knew I could find it quickly.

As it turned out, the problem lied in the prefetch and instruction cache module. This module was declaring that an instruction was valid one clock cycle before the cache logic had finished writing that instruction into the cache. In this case, there was often a NOOP in the cache instead of the SUB 16,SP instruction. Hence, when the CPU was erroneously given the NOOP instruction, the stack pointer, SP, remained where it was–pointing to the end of memory. The subsequent SW (store word) instruction then wrote past the end of memory, which then triggered the bus error.

I’ve tried to draw this in Fig 4 below.

Fig 4. Reading from the cache one clock too early

The figure shows a single cache line. The prefetch/i-cache module was filling that cache line with valid instructions. It had filled all but the last instruction, when it then declared the instruction was valid. Then, as it filled that last instruction with SUB 16,SP, the CPU read the NOOP instruction that had been there one clock earlier, hence leading to the bug.

Had the subroutine’s first instruction landed anywhere else in the cache line, being one clock early wouldn’t have mattered–since those locations would’ve already been filled by the cache logic at that point. This particular location in the cache line made this appear to be one of those “phantom” bugs–bugs that manifest at some times but not at others. It also made it harder to discover–as discussed above.

So let’s go back through those symptoms and explain what happened, shall we?

We just explained why the bus error took place–the stack pointer never got updated, and so the program tried to write off the end of memory.
Although the SUB 16,SP instruction was present in the code, the prefetch returned a no operation instruction (NOOP) to the CPU, so this subtract was never executed–leaving the stack pointer pointing to the end of memory.
The debugger never had a problem because it it operated at a much slower speed than the CPU was capable of operating. This gave the CPU in single stepping mode enough time to load the cache before trying to read from the last address in the cache line.
If I placed another instruction anywhere before the subtract in the program’s instruction stream, then the SUB 16,SP instruction would no longer be the last element in a cache line, and there wouldn’t be any timing error.
If I restarted the computer after this failure, the SUB 16,SP instruction would already be in the cache. Hence, when the prefetch declared the instruction to be valid one clock too early, it would still read what was in the cache from the last time, i.e. the SUB 16,SP instruction.

That the CPU then failed somewhere else should come as no surprise, when there is a bug of this type within its code.

This bug was so strange, and so different, that I though it would be fun to share here. I hope you enjoyed it.

Lessons Learned

So what are some of the lessons I learned from this bug?

I learned the value of the simulator. Traces taken using the wishbone scope just didn’t have enough information within them to permit me to recognize the bug quickly.
I (re)learned the fact that not all test benches will catch all bugs.. This bug was not caught by my prefetch testbench.
I learned to place several cache testing pieces of code into my CPU testing program. As a result, were this bug ever to show again, I’d detect it early on.
Even after all of this, I never quite trusted my prefetch and cache code until some time later when I formally proved that it worked.

Perhaps you’d like to see that formal proof? I’d love to share it! However, that will have to wait for another day and another post.