I’ve always commented that the way to get an interface working is to lock the engineers responsible for each side of the interface together in a room until it works. I like to say it in jest, but in many ways there’s a lot of truth to it.

One of the challenges of working with Open Source anything really is debugging. To be successful, an open source engineer needs to commit their time to supporting their design–no matter how it is used. As an illustration, what happens when one engineer uses an open source design, uses it inappropriately, and then declares that it doesn’t work? It can reflect poorly on the quality of the design–even if the design was and remains fully functional.

As an example, someone recently attempted to use my digital PLL. They commented that the PLL worked great, as long as they didn’t attempt any frequency tracking. Was the PLL broken? Not at all. Was the frequency tracking broken? No, not that either. In this case, the user wanted to track a 2kHz clock using a 250MHz sample frequency. The problem was twofold: first, they didn’t adjust the gain coefficient appropriately. As a result, the first time the PLL noticed that the two clocks weren’t aligned, it attempted to adjust the frequency by such a large extent that there was no way it would ever come into alignment. The second problem was that they didn’t give the design enough phase bits to track such a low frequency.

Finding these bugs required a test case from the user sufficient to trigger the bug, and a couple of hours running simulations. Thankfully, I knew where to look since I’d worked with the design before, and knew it was sensitive to how you set the tracking coefficient–something anyone who had worked with PLLs before would know.

In another example, I watched a user complain that Xilinx’s FFT wasn’t working. However, when he presented his logic it wasn’t too hard to discover that his AXI stream logic was broken.

Today’s story, however, regards the ZipCPU.

Cheap Hardware for Compressing Weather Data

Early on in the ZipCPU’s development, I met a kind gentleman who was interested in using cheap hardware to compress massive amounts of weather data. Let’s call him Pi, to allow him to remain anonymous. His goal, further, was to be able to accomplish this compression on the cheapest commodity hardware he could find. He was interested in the ZipCPU back then specifically because the ZipCPU promised to be low logic and he wanted a CPU to help him do his work.

We’ve since interacted with each other off and on for, well, I suppose it’s been about five years. He has been very supportive of my efforts, and has always volunteered to help me test and verify any new distribution I put together.

Let’s come back to Pi again in a moment. For now, let me share some of the lessons I’ve since learned about verification from working with the ZipCPU.

Lessons and Stories from ZipCPU verification

When I first built the ZipCPU, I needed some way of testing it. So I built a small assembly based test script to test each instruction. While my goal was to test each instruction in isolation, nothing ever really works out that way. In reality, every instruction under test required that two instruction sequences needed to be tested.

Well, let’s be honest, that first test was built in machine code. A small C++ program helped me generate this code, but the instructions were written in C++, not as an input file. I then converted the test to assembly, and built my own assembler to turn it into machine code. Eventually, the test was compiled using the GNU assembler from binutils, and then turned into a C program that I now run on every updated design.

Once I knew that every instruction worked, I then declared the CPU operational.

Over the next several years, I was surprised to find further bugs in my “operational” CPU. The list below is just a small subset of some of those bugs.

  1. The ZipCPU was originally a 32-bit byte machine. Back in the day, it couldn’t handle 8-bit bytes. Neither could I compile the C-library for the ZipCPU. So, it needed to change.

    When I first converted the ZipCPU to using 8-bit bytes, I came across an ugly printf bug. This was due to the fact that the data structure used by the newlib stdio library is a packed structure. My initial data tests only tested reading and writing 32-bit words–not bytes within a greater structure.

    The broken design failed to pass Hello World, so Hello World is now one of my standard tests.

  2. When it came time to give the ZipCPU an instruction cache, I built one I called pfcache. I built a test bench for it, and verified that it did everything right when running the test bench.

    I was then convinced the prefetch worked. Indeed, everything worked well: the bench test, the CPU test, etc. until I first placed this design into hardware. Once I placed things into hardware, the CPU broke and I was looking everywhere else but the instruction cache for the bug.

  3. I have continued using the ZipCPU for some time after that, and indeed still use it today. However, when it came time to learn formal verification, there was a time when it became time to formally verify the ZipCPU.

    At this point, again, it passed all my test benches. It ran many programs successfully. I had used it in hardware successfully for many programs. I “knew” it worked, and had a lot of confidence in it. I just wanted to formally verify it.

    Much to my surprise, there were many bugs in the CPU that none of my simulation test benches ever caught. Many of these depended on specific instruction sequences that I didn’t have the vision to anticipate, and which weren’t triggered by my C test program.

  4. When I first added GCC support, I ran up against a difficult problem: the ZipCPU only had instruction space to support eight conditions. GCC wanted support for many more conditions. How should the missing conditions be generated?

    Specifically, I had an unsigned less than comparison, but no greater than or equal unsigned comparison. For example, to tell if the unsigned value Rx was less than another unsigned value Ry, and then branch if it was, one might write:

     CMP Rx,Ry		# Is (unsigned)Rx < (unsigned)Ry?
     BC  target		# Branch to target if the carry bit is set

To handle greater than, I could reverse the comparison.

     CMP Ry,Rx		# Is (Unsigned)Rx > (unsigned)Ry?
     BC  target		# Branch to target if the carry bit is set

But how should I check for less than or equal?

My approach for this was to add one to the comparison, so that the comparison became,

     CMP -1+Rx,Ry	# Is (Unsigned)Rx-1 < (unsigned)Ry?

Again, this passed all my tests.

Look closely at this solution, though. What would happen if Rx were zero? If you subtracted one from Rx it would become the greatest possible unsigned integer. If you then checked the comparison, it would fail.

It wasn’t until some time after I had GCC support “working” that I came across this bug. Sure enough, I didn’t expect to find it in my GCC back end.

Eventually, I solved this problem by adjusting the instruction set so as to get rid of the greater than comparison and to replace it with a no-carry check. The solution is only so good, and sometimes breaks down to the point where I need to issue two branch instructions to cover the desired condition–but that’s really a topic for another day.

My point here is simply that, when debugging one part of my design, I found I needed to look somewhere else entirely to trace down this bug.

  1. After using the ZipCPU for many projects, I ran into trouble in one that was using a GbE network controller. For some strange reason, the ZipCPU appeared to be randomly hanging. I struggled to figure out why. I mean, it worked in my test bench, no?

    When I finally traced the problem down, it was due to a race condition in the interrupt logic. If an interrupt happened between two halves of a compressed instruction, the CPU would lock up.

    At the time, I didn’t have any good test scripts for triggering interrupts on the CPU. Unfortunately, I still don’t–although I now have more formal properties to catch bugs like this.

  2. At one time, the HALT instruction wasn’t working. Sure, it would issue, but never actually halt.

    The problem was another instruction sequence thing, combined with handling the HALT instruction with a Verilator C++ test script. In this bug, a particular instruction sequence might keep the CPU from halting following a HALT instruction.

    Does the test bench check the HALT instruction? Well, yes, but … only once. (Fixing this is on my to-do list …)

  3. The LOCK bug: Sometimes it’s just that you haven’t thought through all of the complex interactions between your logic. For example, how should the CPU step through a user instruction sequence that attempts to perform an atomic access instruction, and yet do this from supervisor mode?

    In the case of this bug, the CPU faithfully allowed the supervisor to step through each of the sub-instructions associated with a LOCK instruction sequence:

   LDI  atomic_value,Ra		# Get the address of a semaphore
   LOCK				# Acquire a bus lock
   LW     (Ra),Rd		# Load the semaphore
   SUB    1,Rd			# Attempt to decrement it by one
   SW.GE  Rd			# If the updated semaphore is >= 0, write it back
   # The bus lock is then released, after the store instruction is issued.

The problem with doing this, though, is that stepping through a LOCK sequence destroys the LOCK operation on the bus. All four instructions following the LOCK instruction must complete or fail together–you can’t step through them or interrupt them.

Lesson learned.

My point in going through this list is simple: in each case, the ZipCPU passed all of its simulation test cases. In each case, I was convinced the ZipCPU worked before placing it into either a larger simulation environment or hardware itself. In each case, debugging then became harder because the bug had escaped bench testing.

Yes, I now have tests that will catch most or even all of these bugs should they ever occur again. Am I convinced that the ZipCPU is now free of all bugs? Convinced enough to use it. Beyond that, only time will tell.

SDRAM Problem

I bring all this up to begin another story.

Let’s go back to the story of the kind gentleman I mentioned above, Pi. Pi wanted to build a ZipCPU design for some hardware he had purchased. I didn’t have a copy of his hardware, but sure, go ahead, copy one of my designs and place it onto your hardware. God bless, and have an adventure!

His hardware required an SDRAM controller. I suggested one of my own, but cautioned him: not all SDRAM chips and protocols are the same. The required timing can change from one chip to another. Memory size can change, etc.

I’m not sure how he did it, but he did manage to get it to work.

Later on, I made some updates to the ZipCPU. These changes included bug fixes, and so it was worth upgrading his design for the new ZipCPU. The problem was that, when he upgraded his design, it stopped working. Your CPU, he said, was the problem.

Well, if the ZipCPU has a bug in it, then I want to fix it.

That said, this left me with a bit of a dilemma: this is a kind, retired, gentleman. He has no significant money to hire an engineer. My time fixing his bug would never be paid for, and I had demanding jobs on my plate at the time. On the other hand, a bug in the ZipCPU would reflect poorly on my work, and I try to keep my github repositories working and debugged.

So, I invested a Saturday into debugging his problem.

Sure enough, it wasn’t a bug in the ZipCPU. Yes, the ZipCPU test case was no longer running on his hardware, but the problem wasn’t in the ZipCPU. His problem was due to a misconfiguration of the SDRAM controller he had copied, and then changed to match his chip. That was a copy and change done with little (if any) understanding of how the SDRAM worked in the first place.

This was voodoo engineering at its best:

Voodoo Engineering, Defn: To change what isn’t broken, in an attemp to fix what is.

To make matters worse, the SDRAM RTL was modified one way, and the SDRAM simulation model no longer matched.

So, as a kind teacher, I tried to point out that he had no business trying to run or debug his design on hardware if it didn’t work in simulation.

Indeed, I went further: I pointed out that he had a bug in the SDRAM portion of his design.

However, this was no longer a controller I felt responsible for. Yes, it was originally my controller, but Pi had since changed and significantly modified it. Sure, I could debug it for him, but who would then pay for my time? It wasn’t a bug in my SDRAM controller, nor in my C++ SDRAM model, nor in the ZipCPU. It was a bug in Pi’s changes.

Needless to say, Pi was quite frustrated. To my knowledge, he remains stuck in FPGA Hell to this day. Worse, he seems to have given up on RTL design, and he has certainly stopped trying to get the ZipCPU working on his board.

Was his problem that hard? Not really, but you really have to know the basics, to include how to properly debug a simulation and trace a problem down from the bug (the ZipCPU CPU test not working) through to the problem (the SDRAM mis-configured). That’s a lot of design that needs to be traced through to find a bug.

Conclusions

What conclusions might we draw from these stories? Hardware is hard? Maybe, but that’s not really the conclusion I am going to draw today.

  1. Do not place a design into hardware if the design doesn’t first work in simulation.

    This should go without saying.

  2. If you change the RTL controller, the simulation model should need to be changed to match.

    If not, then was your simulation model really good enough in the first place?

    In Pi’s case, I’m not sure he remembered that he had changed the simulation model …

  3. While I’d like to say that debugging hardware is hard, debugging a simulation really isn’t any harder than debugging anything in software. In fact, simulations are (technically) software. Unlike hardware, you have every signal available to you to analyze when running a simulation!

    • Debug by printf works in simulation

    • When using VCD/trace files, you can get even more information about what’s going on within a design than gdb will ever give you!

  4. Getting a single module working is easy–especially when it is one you’ve written yourself.

    Getting 5-6 modules to work together, and to interact with external hardware? That’s harder. Not only do you need to know enough of how those 5-6 modules work, and how the external module is supposed to work, but you have to know those parts and pieces well enough that you can debug them. You have to know them well enough that you can find the one condition within module 3 (or whichever one it is) that was set improperly. It doesn’t help if those modules were written by someone else either–this just makes the task of an integration engineer that much more challenging.

    I’ve often found hardware debugging sessions to bounce around from place to place, as I try to chase a bug from where it manifests to its cause. The process is time consuming and painful. It’s also why those whose work involves jobs like this can demand big bucks. (At least I think they’re big …)

  5. If the design is complicated enough, and a different engineer has written each of the models that need to be made to work together, then it may be time to force all of the various engineers into the same room to get the design to work.

    In business, where you have the $$ or control to make this happen, this is generally the most successful approach to solving integration bugs.

  6. The bug you are looking for is rarely in the place you are looking.

    I seem to have written about this often enough that it seems to be a recurring theme on this blog. I’ve already linked to several examples of this above. Not only do I experience this problem within my own work, I also come across it when participating in online forums. This also took place when I was working customer support for Yosys.

    Customer: Your design doesn’t work!

    Me: Well, okay, let’s do some joint debugging …

    (After a lot of work …)

    Me: No, actually, it’s your own design that’s at fault.

    Of course, to get to this point, you have to have enough confidence in how the customer’s design works (or doesn’t) to be able to state with confidence that it is their problem.

    As a professional engineer, this interaction tends to be rather frustrating: who do I bill for this time? Do I bill the project the complaint was lodged against? That project worked. Realistically, the bill should be given to the customer, but that’s just not how the open source world works.

Let me know if you want to help Pi out. I’m sure he’d like some help getting his design working again.

One final gem

As one final gem: some of the most challenging problems I’ve had to deal with have involved debugging memory. The CPU might read a value from memory and do something inappropriate with it. When you then try to debug the CPU, it can be very difficult to tell where the problematic value got written to memory in the first place.

If you ever find yourself stuck dealing with this problem, try the following.

First, let’s assume for discussion purposes that your memory model logic looks something like:

	always @(posedge i_clk)
	if (i_write)
		mem[i_addr] <= i_data;

Here’s the trick: check for everytime the value at this memory location changes, and print something out anytime it does. In Verilog, this check could easily look like:

	localparam [ADDRESS_WIDTH-1:0]	BUGGY_ADDRESS = // Whatever ...;
	reg	[31:0]	track_value;

	assign	track_value = mem[BUGGY_ADDRESS];

	always @(track_value)
		$display("mem[0x%08x] <= %08x at time %t",
			BUGGY_ADDRESS, track_valu, $time);

Of course, this logic won’t synthesize, so you’ll want to remove it as soon as you are done debugging, but it should be enough to get you to the next step.

The next step is to look at the output of your simulation to find where in the trace the wrong value got written to memory. Now go look up that time in the trace, and you’ll be able to continue your work backwards through the logic until you can find the source of your bug.

Oh, and yes, you can use this basic technique when using a Verilator C++ model as well, it’s just that the code for it will look a bit different. Indeed, this technique would’ve sent Pi directly to his problem. Perhaps he’ll even read this article and manage to find his bug, since he is an avid reader of this blog.