Chasing resets

A true story.

Some years ago, given a customer’s honest need and request, I proposed a change to a client’s ASIC IP. Specifically, I wanted to add CRC checking, based upon a CRC kept in an out-of-band memory region, to verify the ability to properly read memory regions error free. I said the change shouldn’t take more than about two weeks, and I’d clean up some other problems I was aware of in the mean time. This change solved an urgent problem, so he agreed to it.

By the time I was done, my 80 hr proposal had turned into 270+ hrs of work.

Build it well

I’d like to start my discussion of what went wrong with a list of good practices to follow.

Fig 1. Basic test bench components

Just as a background, a general test bench follows the format shown in Fig. 1, on the right. The “test bench” itself is composed of a series of scripts. These scripts then interact with a common test bench “library”, which then makes requests of an AXI bus via a “bus functional model”. This project was designed to make minor changes to the device under test.

With that vocabulary under our belt, here are some of the good practices I would expect to find in a well built design.

Avoid magic numbers.

Yes, I harp on magic numbers a lot. There’s a reason for it. While it wasn’t hard at all to make the requested changes, I had to come back later and spend more than two weeks chasing down magic numbers buried in the test bench.

Specifically, I wanted to add a hardware capability to calculate and store a CRC in an out of band area on a storage device, and then to check those values again when reading the data back. CRCs can be calculated and checked quickly and efficiently in hardware–especially if the data is already moving. Unfortunately, the test bench had hard coded locations where everything was supposed to land in the hardware, and as a result all of these locations needed updating in order to add room for the CRC.

I spent quite a bit of time chasing down all of these magic numbers.

This applies to register address names as well–but we’ll come back to these in a moment.
The “Rule of three”: If you have to write the same thing three times, refactor it.

If the magic numbers were confined to one or two places, that would be one thing. Unfortunately, they were found throughout the test library copied from place to place to place. Every one of those copies then needed personal attention to double check, in order to answer the question of whether or not the “copied” number was truly a copied number that could be modified or removed.
Name your register addresses. It makes moving them easier.

Or, in this case, four versions of this IP earlier someone had removed a control register from the IP. The address was then reallocated for another purpose. No one noticed the test scripts were still accessing the old register until I came along and tried to assign names to all of the registers within the IP. I then asked, where is the XYZ register? It’s not at this address …

I hate coming across situations like this. “Fixing” such situations always risks making a change (which needs to be made) that then might break something else later. (Yes, that happens too …)
There’s a benefit to naming even one bit magic numbers.

Not to get side tracked, but in another design there was a one-bit number to indicate data direction. Throughout the logic, you’d find expressions like: if (direction), or if (!direction). While you might think this was okay, the designer wrote the design for the wrong sense.

I then came along and then wanted to “fix” things.

Not knowing how deep the corruption lie, or whether or not I was getting the direction mapping right in the first place, I changed all of these expressions to if (direction == DIR_SOURCE) or if (direction == DIR_SINK). This way, if necessary, I could come back later and change DIR_SOURCE and DIR_SINK at one location (okay, one per file …) and then trust that everything would change consistently throughout the design.

I got things “mostly” right on my first pass. The place where I struggled was in the test bench, where things were named backwards. Why? Because if the design was the source, the test bench needed to be the sink.
That reset delay.

This is really what I want to discuss today. How long should a design be held in reset before being released?

My personal answer? No longer than it needs to be. Xilinx asks for a 16 clock period AXI reset. Most designs don’t need this. Indeed, most digital designs can reset themselves in a single clock period, although some require two.

Some designs do very validly need a long reset. I’ve come across this often where an analog tracking circuit needs to start and lock before the digital logic should start working with the results of that circuit. This make sense, I can understand it, and I’ve built this sort of thing before when the hardware requires it. SDRAMs often require long resets as well, on the order of 200us.

In the case of today’s example and lesson learned story, the test bench for the digital portion of the design was using a 1,000 clock reset. That is, the test bench held the design in reset for 1,000 clock cycles. Why? That’s a good question. Nothing in the IP required such a long reset. So, I changed it to 3 cycles. Three cycles was still overkill–one cycle should’ve been sufficient, but simulation time can be expensive. Why waste simulation time if you don’t need to?

After changing to a 3 cycle reset, the design worked fine and passed its test cases. I turned my work in, and counted the project done. All my work had been completed in (roughly) the 80 hours I had projected. Nice.

(Okay, my notes say my initial turn in took closer to 120hrs, but I’m going to tell the story and pretend my cost estimate was 80hrs. I can eat a 40hr overrun on an 80hr contract if I have to–especially if it’s an overrun in what I had proposed to do.)

Constants should be constant. Parameters are there for that purpose.

If a design has a startup constant, something it depends upon, then that constant should be set on startup–before the first clock tick is over, and not later.

Some engineers like to specify fixed design parameters via input ports rather than parameters. While there are good reasons for doing this–especially in ASIC designs, those fixed constants should be set before the first clock cycle. If they are supposed to be equivalent to wires that are hardwired to either power or ground, then they should act like it.

Personally, I think this purpose is better served by parameters rather than hard wired constants, but I can understand a need to build an ASIC that can then be reconfigured in the field via hard switches. For example, consider how switches can be used to adjust the FPGA wires controlling the boot source. In other words, there is a time for configuring a design via input wires. Just … make those values constants from startup for simulation purposes.

Calculated values should be calculated, not set in fixed macros.

This particular design depended upon a set of macros, and one test configuration required one set of macros whereas another test configuration might require another set of macros.

These macros contained all kinds of computed constants. For instance, if the design had 512 byte ECC blocks, then the block boundaries were things like bytes 0-511, 512-1023, 1024-1535, etc–all captured in macros used by the test bench, and all dependent on the devices page size. Further constants captured things like where the ECC would be located in a page, or how many ECC bytes were used for the given ECC size–which was also a macro.

These constants got even worse when it came time to test the ECC. In this case, there were macros specifying where to place the errors. So, for example, the test bench for a 4bit ECC might generate one error in bytes 0-63, one in bytes 64-127, and macros existed defining these ranges all the way up to the (macro-defined) size of the page which could be 2kB, 4kB, 8kB, etc.

Sadly, the test script would only run a set of 30 test cases for one set of macros. The design then needed to be reconfigured if you wanted to run it with another set of macros. Specifically, every time you needed to change which ECC option you were testing, or which device model you wished to test against, then you needed to switch macro sets. In all, there were over 50 sets of macros, and each macro set contained between 40-150 macros the design required in order to operate. Worse, many of those macros were externally calculated. Running all tests required starting and restarting the test driver, one macro set at a time.

Here was the problem: What happens when a macro set configures the IP to run in one fashion, and you need to reconfigure your operations mid-sim-runtime to another macro set? More specifically, what happens when you need to boot with one ECC option (defined as a macro), and then switch to another? In this case, the macro set determined how memory was laid out, and the customer wanted to change the memory layout in the middle of a test run. (He then couldn’t figure out why this was a problem for us …)

Lesson learned? When some configuration points are dependent upon others, use functions and calculate them within the IP. That way, if you switch things around later–or even at runtime, those test-library functions can still capture all the necessary dependencies.

Second lesson learned? IP should be configured via parameters, not macros, and those parameters should all be able to be scripted by the test driver. Perhaps you may recall how I discussed handling this in an article discussing an upgrade to the ZipCPU’s test infrastructure some time back.
If requirements are in flux, the IP can’t be delivered.

This should be a simple given, a basic no-brainer–it’s really basic engineering 101. If you don’t know what you want built, you shouldn’t hire someone to build it until you have solid requirements. If you want to change things mid-task, any rework that will be required is going to be charged against your bottom line.

In this case, the end customer of this IP discovered how I was intending to meet their requirement by adding a CRC. They then wanted things done in a different manner. Specifically, they wanted the CRCs stored somewhere else. Of course, this didn’t take place until after I’d already proposed a fixed price contract based upon 80 hours of work, and accomplished most of that work. Sure, I can support some changes–if the changes are minor. For example, I initially built a 32b CRC capability and they then wanted a 16b CRC capability. I figured that’d be a cheap change–since the design was (now) well parameterized, only two parameters needed to change to adjust. In this case, however, their simple desire to switch CRC sizes from 32b to 16b now doubled the time spent in verification–since we now needed to run the verification test suite twice–once for a 32b CRC and once again for the 16b CRC they wanted. Their other change request, moving the CRC storage elsewhere, was major enough that it couldn’t be done without starting the entire update over from scratch.

Change is normal. Customers don’t always know what they want. I get that. The problem here was that as long as requirements were in flux I wasn’t going to deliver any capability. Let’s agree on what we’re going to deliver first, then I’ll deliver that.

Then the customer started asking why it was taking so long to deliver the promised changes, when could we deliver the IP, and they had a hard RTL freeze deadline, and … Yes, this became quite contradictory: 1) They wanted me to make a change that would force me to start my work all over from scratch, but at the same time 2) wanted all of my changes delivered immediately to meet their hard deadline.

You can’t make this stuff up.

If a design can fail, then a simulation test case should exist that can trigger that failure.

This is especially true of ASIC designs, and a lesson I’m needing to learn in a hard way. In my case, I knew that I could properly calculate and detect CRC errors. I had formally proven that.

However, because I didn’t (initially) generate a simulation test to verify what would happen on a CRC failure, no one noticed how complicated the register handling for these CRC failures had become.
Test bench drivers should mirror software

At some point in time, someone’s going to need to build control software. They’ll start with the test bench driver. The closer that test bench driver looks to real software, the easier their task will be.

So what happened?

Okay, ready for the story?

Here’s what happened: I made my changes inside my promised two weeks. I merged and delivered the changes the customer had requested. Everything worked.

Life was good.

Fig 2. Everything fell apart when merging

Then my client then said, oops, we’re sorry, you made the changes to the wrong version of the IP. The end customer had asked us to make a simple change to allow the software to read a sector from non-volatile memory to boot from on startup. Here’s the correct version to change.

The changes appeared minor, so I merged my changes and re-submitted. This time, many of the tests now failed.

What went wrong?

Fig 3. I now use watchdog timers in my test benches

The first problem was the reset. Remember how I removed that 1,000 clock reset, because it wasn’t needed? One of the test cases was waiting 100 clock cycles, and then calling a startup task which would then set the “constant” input values that were only sampled during reset. This value would determine whether the new bootloader capability would be run on startup or not. The test bench would then wait on the signal that the bootloader had completed its task. However, with a 3 cycle reset, the boot on startup constant was never set before the end of the reset period, so the bootloader never started and the test bench then hung waiting for the bootloader to complete. (Waiting on a non-existent boot loader wasn’t a part of the design I started with.)

It didn’t help that the test script (in file #1) called a task (in file #2), that set a value (in file #3), that was checked elsewhere (in file #4), that was … In other words, there was so much indirection on this reset between where it was set and its ultimate consequence that it took quite a bit of time to sort through. No, it didn’t help that I hadn’t written this IP, nor its test bench, nor its test scripts, nor its test libraries in the first place.

Unfortunately, that was only the first problem.

The second problem was due to an implied requirement that, if your test bench reads from memory on bootup, there must be an initial set of valid data in memory for it to boot from–especially if you are checking for valid CRCs and failing a test if any CRC failed. This requirement didn’t exist in either branch, but became an implied requirement once the boot up and CRC branches were merged together. We hadn’t forseen that one coming either.

A third problem came from how fault detection was handled. In the case of a fault, an interrupt would be generated. The test bench would wait for that interrupt, read the interrupt register from the IP, and then handle each active interrupt as appropriate.

In order to properly handle a CRC failure, I needed to adjust how interrupts were handled in the test library. That’s fair. Let’s look at that logic for a moment.

Interrupts were handled in the test library within a Verilog task. The relevant portion of this task read something like:

	do begin
		wait(interrupt);
		read(interrupt_register);

		if (interrupt_register == 8'h01) begin
			// Handle interrupt #1
		end else if (interrupt_register == 8'h02) begin
			// Handle interrupt #2
		end else if (interrupt_register == 8'h03) begin
			// Handle interrupts #1 and #2
		end // ...
	end while(task_not_done);

This was a hidden violation of the rule of three, since you’d find the same interrupt handler for interrupt #1 following a check for the interrupt register equalling 8'h01, 8'h03, 8'h05, 8'h07, etc.

Worse, the interrupt handlers didn’t just handle interrupts. They would also issue commands, reset the interrupt register, use delays, etc., so that handling interrupt #1 wasn’t the same between a reading of 8'h01 and 8'h05.

My solution was to spend about two days refactoring this, so that every interrupt would be given its own independent handler properly. The result looked something like the logic below.

	do begin
		wait(interrupt);
		read(interrupt_register);

		if (interrupt_register[0]) begin
			// Handle interrupt #1
		end

		if (interrupt_register[1]) begin
			// Handle interrupt #2
		end

		if (interrupt_register[2]) begin
			// Handle interrupt #3
		end // ...

		clear_interrupts; // and adjust the mask if necessary
	end while(task_not_done);

Among other things, I removed all of the register accesses from the interrupt “handling” routines, capturing their needs instead in some registers so the accesses could all happen at the end. As a result, nothing took simulation time during these handlers and things truly could be merged properly.

I was proud of this update. The portion of the test library handling interrupts now “made sense”.

So, I sent the design off to the test team again only to have it come back to me again a couple days later. It had failed another test case. Where? In a second copy of the same broken interrupt handler that I had just refactored.

While I might argue that the rule of three should’ve applied to this second copy, you could also argue that it didn’t simply because it was a second copy of the same interrupt handler and not a third.

I could go on.

As I mentioned in the beginning, a basic 80 hour task became a 270+ hour task. Further, the task went from being on time to late very suddenly. Yes, this was how I spent my Thanksgiving weekend that year.

Conclusion

A good design plus test bench should be easy to adjust and modify.

Building a poor design, a poor test bench, or (worse) both constitutes taking out a loan from your future self. This is often called “technical debt.” If this is a prototype you are willing to throw away later, then perhaps this is okay. If not, then you will end up paying that loan back later, with interest, at a time you are not expecting to pay it. It will cost you more than you want to pay, at a time when you aren’t expecting a delay.

What about formal methods? Certainly formal methods might have helped, no?

I suppose so. Indeed, all of my updates were formally verified. Better yet, everything that was formally verified worked right the first time. What about the stuff that failed? None of it had ever seen a formal tool. Test bench scripts, libraries, and device models, for example, tend not to be formally verified. Further, why would you formally verify a “working” design that you were handed? Unless, of course, it was never truly “working” in the first place.

Remember, well verified, well tested RTL designs are gold in this business. Build them well, and you can sell or re-use them for years to come.