Debugging the Hard Stuff

There are a couple of really hard problems in digital design. These include debugging FFTs, cryptographic algorithms, and Error Correction Coding (ECC). Debugging a good compression algorithm is also a solid runner up, but it’s not quite as difficult as the others.

What makes these problems hard is the debugging challenge: put simply, given a failing design, pinpoint the line (or lines) of failing logic. This debugging challenge is hard because these particular problems don’t break down easily into visually reducible sub-problems. That is, it can be a real challenge when working with problems like these is to first identify the problem in the trace, and then to be able to walk backwards from there to the bug.

Take, for example, an FFT. Some time ago, a user came to me telling me my FFT was broken. Nonsense I said, it works fine for me. So he sent me a VCD trace, and sure enough, the IFFT of an FFT wasn’t producing the input again. It wasn’t even close. The challenge here was that this user had introduced random data into the system. Because it was random, it was hard to tell at every stage what the right answer should’ve been and therefore where in the algorithm did the design break.

How then do you debug these complex algorithms?

To debug: Build new tools when necessary

Remember, the goal is to find the one (or more) lines of HDL that are broken.

Recently, I was asked to do some ECC work. It started out simple: here’s a working ECC design, now generate a piece of software that will replicate it. Okay, there’s more to the backstory. The ECC was part of a non-volatile memory controller, and the customer wanted to pre-load the device at the factory with data and valid ECC parity bits. Not a problem, right? Most, but not all, ECC is just straight linear algebra over Galois Field 2 (GF2), so this should be basic, no?

Fig 1. ECC wraps the CRC, which then wraps the user data

The problems got worse from there. Gee, Dr. Dan, since you’re an ECC expert (Wait, what? I am?), why not maintain some logic for me? Here’s your first task: a customer wants 32-bit error correction. Our standard 16-bit error correction isn’t good enough for this customer. Then, once that task was done, the next task was for a customer who wanted even more error correction than just 32-bits per packet. Then there was the customer who wanted not just ECC, but an ECC protected CRC so that he could tell if the ECC ever failed.

This is all well and good, but how do you debug this stuff? Remember, the debugging challenge is to find the line (or lines) of HDL (VHDL or Verilog) containing the bug. Also remember, I didn’t write the original logic–I’m just the one getting paid to maintain it.

Here’s the method I came up with:

Start with a golden reference.

This may seem like the hardest part at first. I mean, if you are building an algorithm for the first time, how will you know that your golden reference works properly in the first place?

It helps to start with the references from others. Numerical Recipes in C publishes the C source of an FFT algorithm that I’ve since used often. There’s even a Numerical Recipes in C++, which I would expect is the same recipes but with an updated coding standard. Then there’s the “Advanced Encryption Standard” (AES), which includes an appendix showing what the outputs should look like at each stage of the encryption standard given a known key and a known input. There’s also at least one AES algorithm on OpenCores that you might use. In general, references are available for your work with digital design, you might just need to spend some time looking for them.
Add debug $display statements to the design, and make them both human readable and computer friendly.
Use those debug statements to feed the golden reference.

This is the key feature I’ll be discussing in a moment.
Now look for the first difference between the reference and any broken trace.

Remember, the golden reference works. Therefore, any discrepancy or other difference between the golden reference and your design will be something you want to dig into.

Let’s walk through how this works, by discussing the last ECC problem I worked on: adding a CRC to a memory controller, prior to the ECC parity generator and then checking it at the back end.

What made this CRC task challenging was that the ECC parity bits were packed together–even across byte boundaries. The CRC bits, therefore, needed to be packed together with the ECC parity bits. Not only that, both needed to be gathered together and assembled separate from the data itself. Together, this created data blocks looking somewhat like Fig. 2 below when transmitted. (Note that the Fig 2. is not drawn to scale by any means.)

Fig 2. Data backs consisted of data blocks, followed by CRC then parity

The original algorithm I was given handled this via a combined shift register and memory implementation that had grown to over 2k lines of Verilog. It read like a series of cascaded case statements: if this ECC option, this data block size, the data block number, then adjust these bits, etc.) Indeed, by the time I started working with this logic, it had gotten so repetitive that I was using a C++ program to write and update these thousands of lines of Verilog for me. Now, this may have been fine originally, for the simple ECC codes this controller was originally built to handle, but the ECC bit-vector had now grown to over 7k bits, each individually accessed, and none able to be optimized into memory properly. It got so bad that this part of the algorithm was starting to dominate the total area cost–and I was going to add CRCs to this mess? No, that wasn’t going to happen–not without a rewrite.

So, I rewrote the algorithm with a proper gearbox coupled with a proper FIFO. Moreover, in the hopes that I might be able to reuse the implementation, I built the gearboxes using standard interfaces: AXI stream, VALID, READY, LAST, etc, and I even parameterized the CRC size so the customer could change it at a later time if they wished to. All told, the new implementation took just over 1k Verilog lines of code, and at least half of that was either comments or formal properties. That meant there was a rough 4x reduction in the number of lines of code. Does this mean the design is simpler? In this case, definitely! It was both simpler to understand, easier to see what was going on within it, and therefore simpler to maintain. The giant tables had been replaced. The new design also used less area, so we’re good all around.

But, how to debug this?

To give you an idea, then general test setup I was given looked something like Fig. 3 below.

Fig 3. Test bench setup

A test bench script would drive the entire test. This script would generate commands to write pseudorandom data to the memory device. Then, once all the data had been written, the script would command an error generator to insert errors into the read path, and then generate commands to read from the memory device. The results would be compared against the data written to the device. If (and only if) the number of errors inserted was beyond the ECC’s correction ability, then the CRC failure flag was to be set. In all other cases, the data returned from the device was required to match the data sent to it.

Now consider all the steps involved in this process, and ask yourself where and how would you debug all of this?

The (obvious) answer should be: starting with unit tests. Each individual unit should be tested and verified separately, and known to work before being integrated into the larger whole. For unit tests, I turn to formal methods. One of the things I like about formal methods is that, any time you get a property (i.e. an assertion) failure, that failure will lead you directly to the failed assertion, which is usually nearby the logic it defends, and so formal methods will typically find bugs in about 5-10 steps. In my case here, the gearboxes required 40 steps. Why so many? Because of a technical problem. Formal methods don’t handle division very well, and a full induction description would require that the formal methods implement a divide. As a result, I didn’t get induction working. Still, 40 time steps was enough to give me strong confidence the gearboxes worked.

My next step was to use a bench test with Verilator. You can see this basic setup in Fig. 4 below.

Fig 4. Verilog unit testing

In this case, the unit under test was the ECC block separated from the rest of the design. In particular, this unit testing approach allowed me to separate my test from the AXI bus interface, the device interface, and the device model. Data came into the test design via the AXI stream protocol, and then the same protocol was used to take data back out once encoded or decoded. Still, as you can see from the figure, the ECC block under test wasn’t quite the working design given to me, but rather my modified version of it, now containing CRC generation and checking.

Fig 5. Test vector choice

One key to this type of testing was the test vectors chosen. For example, when building an FFT, you’ll want to use both impulses and sine waves. Since this is ECC, i.e. linear math over GF2, I started with vectors of all zeros, and then vectors with a single bit set, and only then a small number of countable bits set. From there I moved on to all bits set and then random settings.

What if the design fails with random data? In that case, you just reduce the random data vector back to the relevant basis vectors, and try again.

Thankfully, it wasn’t too difficult to get this Verilator based test bench working. In general, the trace led me straight to any bugs–well, that and the fact that I knew that the ECC algorithm worked initially, so the only thing that could’ve broken would be associated with one of my changes. That meant that this Verilator test bench could then become my “golden reference” model.

Once things worked in this “golden” model, I could then integrate this updated design back into the memory controller it was originally a part of. Not only that, but remember this was a maintenance task. In other words, I needed to modify logic someone else had written years ago, logic that I wasn’t familiar with–a recipe for problems, no?

This brings us back to the integrated test bench design shown in Fig. 3. The good news is that we come into this integrated test design with something known to work.

Now, let me point out a problem with the type of integrated testing shown in Fig. 3: the test bench script won’t notice any data failures until late in the process. This is just the nature of the perspective of a black box. Data will first need to be written to a staging area, then pushed through the device path, transferred to the external memory device model, then transferred back, then copied out of the controller–and only then would any errors be noted and flagged–all in good black box testing fashion.

This wonderful black box testing fashion might be great for proving that a design does what it is supposed to, but it is horrible for telling you what fails when it fails. Why? Because the failure isn’t noticed until microseconds of simulation time later, and only after thousands of data transfers. That’s a lot of trace you have to back through manually to find any bugs. There’s another word for that: pain.

This is where the debugging statements came into the design. To show you what I mean, the following lines have been clipped from the design. First, as every word went into the ECC, and now the CRC+ECC algorithm, I dumped it to the simulation log.

	localparam [0:0]	OPT_ECC_DEBUG = 1'b1;

	if (OPT_ECC_DEBUG)
	begin : GEN_ECC_DEBUG
		integer ik;

		always @(posedge i_clk)
		if (i_reset_n && ecc_enabled
				&& tx_data_valid && tx_data_ready)
			$display("ECC-TX: %08x", ecc_data_tx);

We can call this test point #1.

Once an entire block of data had gone through the ECC encoder, it would then generate parity bits. I dumped these to the same log as well.

		always @(posedge i_clk)
		if (i_reset_n && ecc_enabled && tx_parity_valid
				&& tx_parity_ready)
		begin
			$write("ECC-TX: RES");
			for(ik=0; ik < PARITY_BITS; ik = ik+16)
				$write(" %04x", tx_parity_data[ik +: 16]);
			$display("");
		end

(This test point doesn’t get a number, because my diagram in Fig. 3 doesn’t really show this part of the path very well.)

I repeated the same process on receive. As data came into the ECC decoder, it was also dumped to the log.

		always @(posedge i_clk)
		if (i_reset_n && rx_data_valid && ecc_enabled)
		begin
			if (rx_data_last)
				$display("ECC-RX: %08x -- LAST at %t",
							rx_data, $time);
			else
				$display("ECC-RX: %08x", rx_data);
		end

We can call this test point #2.

Before moving on, I should point out a key feature of this test point that I don’t want to pass up, and that is $time. One of the challenges of working with both simulation logs and trace files together is synchronizing the two. That’s where $time comes in handy, and I’ve gotten to using it in this manner often. Because this test point outputs $time, I can now use the $time in this log to find the associated step in the trace and to then see the context associated with this $display() function call.

Finally, when the decoder made it’s error correction decisions, those were also dumped to the log.

		always @(posedge i_clk)
		if (i_reset_n && ecc_enabled
				&& ecc_data_valid && ecc_data_ready)
			$display("ECC-MASK: %08x", ecc_data_mask);
	end

We can call this test point #3.

For reference, you can see all of these test points in Fig. 6 below.

Fig 6. Test points

The first step to analyzing this log file was to search for all lines starting with “ECC-“. In the future, if I do this again, I’ll probably write to a specific ECC log file, but for now these messages were dumped together with the rest of the simulation log. The good news is that the rest of the simulation log provides the context for what’s going on. The bad news is that dumping 16kB of data for an ECC engine doesn’t make a pseudo human-readable log any more legible. Because of the mess this makes of the simulation log, I’ve gated all of this logic with the OPT_ECC_DEBUG parameter and only turn the parameter on when I need to.

Even before building any better capabilities, this helped: it allowed me to see the data going into the gearboxes, and then coming out again. I could then make sure that things were lined up properly.

It wasn’t enough.

One particular problem was that the TX and RX data were so far separated in time. That made it difficult to tell whether any problem lied between ECC components, such as in the transmitter to the device, the memory model, the test-bench error generator, or the receive logic on the other end. However, it wasn’t all that hard to generate a simple C++ program to process this pseudo human-readable data, to place the TX data next to the RX data and then notice if anything had changed. This one piece of software alone helped me catch nearly all of my bugs. (Remember, I started with a working ECC design–that means most bugs should be confined to my changes alone. Unfortunately, I changed a lot of stuff.)

Fig 7. Feeding the golden reference model

That software alone, however, wasn’t sufficient. It was close, but not quite there, and this is where today’s particular problem comes into play: because ECC is a challenge to debug, I needed to know more of what was going on inside the ECC algorithm. This is where the third debugging step comes into play: feeding the golden reference model. That is, I adjusted my log reading program so that it could generate a binary file that could then be ingested into my golden reference model, as shown in Fig. 7 above.

At one point, it was only by comparing the trace coming out of the golden reference model with the one from my larger integrated test environment that I found the bug. One of the bugs I found was even in the Verilator translation of my Verilog, and so I thank the Verilator team, and in this case Mr. Snyder in particular, for fixing it quickly.

Customers

I wish I could say this was the end of the story: the design worked, and should now be shipped. Sadly, it wasn’t. Three further problems then turned the simple update listed above into a nightmare for both my team and our customer.

First, as a simple background, let me set the stage: I offered to make this maintenance upgrade for a simple two weeks of work. The updates were simple enough, they should’ve only taken two weeks.

It was only once those two weeks were over that the challenges began.

Once I was ready to turn in the work, the customer started demanding other changes. (Changes that they had no intention of paying for …) Worse, they were asking for fundamental changes which would force me back to the drawing board for this design.

The laughable part of these demands? In the same breath they asked for a fundamental redesign, they also asked for the current design to be shipped in order to support their RTL freeze date in less than two more weeks. Then they were upset when we didn’t deliver according to their schedule, when they were the ones who changed the requirements mid-task. Seriously. You just can’t make this stuff up.
One problem I wasn’t expecting was that the memory device model implemented address regions not present in the actual memory device. Adding a four byte CRC required all of the addresses within to be updated to make room for these extra four bytes. How many places did these addresses need to be updated in? Let’s just say that I really dislike magic numbers, since their usage in this project by the prior developer made this part of the update all the more painful.

You can read some more of my thoughts on test bench design here.
The last problem was a repository merge nightmare. Sadly, I had made my changes to a design that wasn’t the “latest” design–the customer had submitted changes to the design that I was unaware of. I didn’t discover this until after I had turned my changes in. Then I discovered I was about two versions behind the official latest version. Bugs discovered here were … anything but what I was expecting.

For example, one of the test scripts depended upon a 1000 clock cycle reset. This script would wait 100 cycles, then set a “fixed” value that needed to be referenced when the design came out of reset. The test bench script then then waited for the signal that this (optional) startup process had taken place. Not knowing this, I had come along and tried to speed up the simulation by switching to a 3-cycle reset and … all kinds of regression hell broke lose when the two “working” designs needed to be merged. Was there any documentation discussing why a 1000 clock cycle reset was necessary, or why a key reset input wouldn’t get set until 100 clock cycles into the simulation? Well, one might hope.

The end result was that I spent another 2-3 months, beyond the 2-week update task, working on the test bench, rebuilding the model, fixing magic numbers, fixing the many places the same task was accomplished in error, and more.

My simple two week task? Sure, it took two weeks to do. It then took another two months arguing with the customer over what the requirements needed to be. Another month was then spent chasing down bugs associated with merging the repository, and then another two were spent dealing with further consequences of the original merge as they continued to ripple through the test model and test script library.

And the customer? The customer wanted to know why the design couldn’t be delivered the same afternoon that I submitted my changes to the official repository. Worse, they are now convinced that the problems associated with these updates are due to the unreliability of the IP they purchased, rather than rippling consequences of the changes they made working their way through the design.

Conclusions

When faced with a really hard HDL problem, consider using a golden reference model and comparing your design against that reference. Any differences between the two should lead you directly to any bugs.

As for difficult customers? Be honest. Smile, and do your best. There will be other customers.

Oh, and the end of this tale? I’m now working to port those same changes to another controller, which means I get to reuse both my golden reference model, as well as the software I used to reformat the data into something that could be ingested into it. Once I post this article, I’ll go back to looking for the differences between the two. The good Lord only knows what I’ll find at this point.