Mystery post: The ugliest bug I've ever encountered
I’ve been working with software most of my life–since long before I ever went to college. Now, after several decades of working with software, I’d like to think I may have learned something about fixing bugs in software.
I’ve learned to debug bugs by their patterns, and the more often I see the same error lead to the same bug the more these patters are enforced in my mind. Hence, any time the symptoms of a bug match a pattern I’ve seen before, I know exactly where to look for the bug. When younger programmers ask me why their code isn’t working, I just match what’s going on to a pattern I’ve seen in the past, and I can usually find any problem they might have quite quickly.
I’ve got to believe I’m not alone in this approach, and that others have also used this approach to debugging software as well.
While this approach has served me well over time, there was one bug I encountered recently that didn’t match any of the patterns I’ve learned over my many decades.
The Symptoms
Are you ready to see if you can recognize this bug? Think you can do it? Here’s what I was looking at.
-
The code crashed with a bus error, created by trying to access a non-existent memory address using the stack pointer.
This isn’t the first time I’ve ever had a computer program do the wrong thing. As a result, I design my wishbone interconnects to both return an error upon any attempt to read from a non-existent address, as well as to record the address created the error. For this reason, I knew the address the interconnect was seeing–and it was indeed beyond the end of memory.
-
If you looked through and followed the code, there should never have been a bus error. GCC had put the proper logic into the code to calculate the correct address. For some reason, the correct address wasn’t getting calculated. Indeed, if you examined the CPU state following the bus error, the stack pointer wasn’t pointing to the right value.
-
If I used the debugger to step through the code, there would be no error.
-
Any attempt to add a debugging statement prior to the bug would keep the CPU from getting a bus error at that instruction.
This includes both the
BREAK
instruction as well any instructions which might have been used to trigger the wishbone scope. -
In my frustration at one point while I was trying to figure out what the problem was, I decided to try Voodoo computing.
Defn: Voodoo computing, verb: The process of fixing what isn’t broken in an attempt to fix what is. It is similar to Cargo Cult programming.
So I restarted the ZipCPU from the beginning of the program. (This was a warm reboot, not a cold one from no power.) On this second program start, the code executed as designed and the program failed elsewhere.
Confused yet? I certainly was. Let me show you what the code looked like.
The Software in Question
The code that failed was the very first function call following a reboot.
The ZipCPU would start
at the reset address and it would load the
stack pointer
with a reference to the end of memory. While both of these are configurable
addresses within the
ZipCPU infrastructure, inspecting
the code confirmed that the assembler and
linker had done their jobs correctly.
After a couple of other instructions upon
CPU startup, the
CPU would then jump to this
first_subroutine
and quickly fail.
Here’s what those first few instructions looked like,
The LDI
instruction within the
ZipCPU
can load an arbitrary value into any register over the course of up to two
instructions. Specifically, it takes one instruction for anything up to
23-bits, and two instructions for anything longer.
The program then jumped to a subroutine. This involves copying the
program counter,
PC
, into the R0
general purpose register, and then jumping to the address
given–in this case the CPU
jumped to first_subroutine
. (I’ve since forgotten the subroutine’s actual
name.)
The first several instructions in the first_subroutine
were much like any
other subroutine. They created a stack frame by subtracting a value from the
stack pointer–adjusting
it to point earlier in memory. This creates an area in memory where the
program can hold local variables, commonly called
a stack frame. The program
then attempted to save some registers to this new
stack frame.
In assembly, this looked like,
If you’ve never seen
ZipCPU
assembly before, it reads from left to right, with the destination of the
instruction on the right. Hence SUB 16,SP
subtracts sixteen from the
stack pointer
and places the result back into the
stack pointer.
Likewise SW R0,(SP)
stores the value of R0
into the address given by the
stack pointer,
and SW R1,4(SP)
does roughly the same thing, with the
exception that R1
is saved into the location four bytes later. Of these
three instructions, only the SUB
tract instruction modifies the
stack pointer.
If I pulled the code up in a
debugger following
this bus error,
the stack pointer SP
would have the value given by the end_of_memory
value it was initially
loaded with. It was as though the
stack frame
was never created, despite the SUB 16,SP
instruction.
When the
ZipCPU then tried to write into this
stack frame,
it would write past the end of memory.
Depending on whether end_of_memory
pointed to the last address
in memory (where I commonly set it), or just past the end of memory would
then determine whether storing R0
to the location in the
stack pointer
caused the bus error
or whether it was storing R1
to the location 4 bytes later
that caused the bus error.
Either way, the code caused a
bus error
and none of it made any sense to me.
Before reading any further, think about the description above, and then ask yourself: does this match any bug pattern you’ve ever seen before? The program fails the first time it is tried, but not the second–but then it would fail on the third time. Any attempt to add debugging code kept this error from happening.
Oh, and let me give you a hint: I was running my own code, on my own ZipCPU, within an FPGA.
Any ideas?
What was it?
In many ways, I was at a loss to figure out the problem until I started up the program in a Verilator based simulation. (Yes, I know, I was running code that I hadn’t simulated first–shame on me.) Once I saw the bug take place within the simulator, I knew I could find it quickly.
As it turned out, the problem lied in the prefetch and instruction cache
module. This
module was declaring that an
instruction was valid
one clock cycle before the cache logic had finished writing that instruction
into the cache. In this case, there was often a NOOP
in the cache
instead of the SUB 16,SP
instruction. Hence, when the
CPU was erroneously given the
NOOP
instruction, the stack
pointer,
SP
, remained where it was–pointing to the end of memory. The subsequent
SW
(store word) instruction then wrote past the end of memory, which then
triggered the bus error.
I’ve tried to draw this in Fig 4 below.
The figure shows a single
cache line. The
prefetch/i-cache
module was filling that cache line with valid instructions. It had filled
all but the last instruction, when it then declared the instruction was
valid. Then, as it filled that last instruction with SUB 16,SP
, the
CPU
read the NOOP
instruction that had been there one clock earlier, hence
leading to the bug.
Had the subroutine’s first instruction landed anywhere else in the cache line, being one clock early wouldn’t have mattered–since those locations would’ve already been filled by the cache logic at that point. This particular location in the cache line made this appear to be one of those “phantom” bugs–bugs that manifest at some times but not at others. It also made it harder to discover–as discussed above.
So let’s go back through those symptoms and explain what happened, shall we?
-
We just explained why the bus error took place–the stack pointer never got updated, and so the program tried to write off the end of memory.
-
Although the
SUB 16,SP
instruction was present in the code, the prefetch returned a no operation instruction (NOOP
) to the CPU, so this subtract was never executed–leaving the stack pointer pointing to the end of memory. -
The debugger never had a problem because it it operated at a much slower speed than the CPU was capable of operating. This gave the CPU in single stepping mode enough time to load the cache before trying to read from the last address in the cache line.
-
If I placed another instruction anywhere before the subtract in the program’s instruction stream, then the
SUB 16,SP
instruction would no longer be the last element in a cache line, and there wouldn’t be any timing error. -
If I restarted the computer after this failure, the
SUB 16,SP
instruction would already be in the cache. Hence, when the prefetch declared the instruction to be valid one clock too early, it would still read what was in the cache from the last time, i.e. theSUB 16,SP
instruction.That the CPU then failed somewhere else should come as no surprise, when there is a bug of this type within its code.
This bug was so strange, and so different, that I though it would be fun to share here. I hope you enjoyed it.
Lessons Learned
So what are some of the lessons I learned from this bug?
-
I learned the value of the simulator. Traces taken using the wishbone scope just didn’t have enough information within them to permit me to recognize the bug quickly.
-
I (re)learned the fact that not all test benches will catch all bugs.. This bug was not caught by my prefetch testbench.
-
I learned to place several cache testing pieces of code into my CPU testing program. As a result, were this bug ever to show again, I’d detect it early on.
-
Even after all of this, I never quite trusted my prefetch and cache code until some time later when I formally proved that it worked.
Perhaps you’d like to see that formal proof? I’d love to share it! However, that will have to wait for another day and another post.
Be thou diligent to know the state of thy flocks, and look well to thy herds. (Prov 27:23)