Is formal verfication enough, or is simulation required?

While at ORCONF this past weekend, several folks asked me to what extent I used formal methods. Since I found myself answering this question with some stories from my own experiences that I had yet to blog about, I thought I’d write them down and share them here.

Unlike many of my articles, this one will avoid deep dives into code. We’ll have plenty of opportunities to do that another time.

That said, I’d like to discuss four design examples: an instruction cache that wasn’t formally verified, chasing down a flash memory verification error, a data cache built with formal from the beginning, and a recent speed upgrade I worked on for an external PicoRV32 Bus interface.

The Instruction Cache

The instruction cache story is one I’ve told before.

Fig 1. The ZipCPU Instruction Cache

The ZipCPU’s instruction cache is actually one of several possible instruction fetch modules it can be configured with. We’ve already discussed the basic prefetch, and a more pipelined prefetch implementation. The cache is a drop-in replacement for either. Then, within the ZipCPU, the prefetch and the load-store unit (also a plug-in with several options) are arbitrated between to determine which gets access to the Wishbone bus at any given time, as shown in Fig. 2 on the left.

This is just basic CPU design and background.

Now, imagine if you had a program running on a CPU, and the program wasn’t working. Whenever you called the first function after a reset, it would fail with a bus error after writing beyond the end of memory.

How might you find this bug?

Desk checking

A desk check of the failing software clearly indicated that the software shouldn’t cause a bus error.
When I added an instruction to trigger an internal scope and so to get a trace, the failure vanished.

This made me wonder, did I have a Heisenbug?
If I then re-ran the broken program, it no longer failed at this same step but instead later on.

Restarting the ZipCPU a second time brought the bug back.
If I stepped through the code, there was no bug.

I think this definitely meets the criteria for a Heisenbug.

It’s also a bug you can’t let go of to come back to on another day, since it isn’t clear that you’d be able to reproduce it some other time. I now had to find it while it was hot.
Simulation

Okay, I gave up. Maybe simulation would help?

In this case, the design failed in Simulation as well as in hardware, and with the same symptoms even. I was on the way to a solution!

You can read more about the story here.

The bottom line was that this bug took me way too long to find, and cost me way more project time than I wanted to devote to it.

By using formal methods, I have kept this instruction cache design unit from failing since. Indeed, I’m now convinced that if I had used formal methods from the get-go, I would’ve never had this bug in the first place.

In a moment, we’ll come back and test that assumption–since I’ve since built a data cache using formal methods from the very beginning and there’s a story to share there. First, though, I want to share the story of a design that got missed during my switch to formal methods.

Flash Memory Verification Error

Fig 2. A Serial port to Wishbone converter

One of my recent projects involved building an FFT accelerator that used the RISC-V PicoRV32 CPU. In this project, like so many others, I needed to write the PicoRV32 software to flash and then have an embedded CPU (i.e. the PicoRV32 in this case) run that program. The first step in this process is to read what’s currently in the flash, and then to compare it to what I want to be in the flash. Depending on the result of that comparison, the flash sector needs to be either erased (bits turn to ones) and then programmed (specific bits turned to zeros), or just programmed. Ideally, if the design already exists in the flash then nothing more needs to be done.

In testing, however, when I tried to load the flash a second time with a particular program, the comparison would fail and the loader would say that the design needed to be loaded again, as shown in Fig. 3 on the left.

Fig 3. Comparing just-written flash memory failed

This was an indication that the first effort to program the flash had either failed for some reason, or there was a wild pointer error writing strange things to the flash. Either way, it wasn’t how things were supposed to work.

My first thought was that my updated flash controller had a bug in it, or that maybe the current flash memory I was working with had a unique feature that I wasn’t prepared for. Much to my surprise, when I tried to reproduce the bug in simulation, it failed in the exact same place with the exact same symptoms–even though I was using an emulated flash device.

So I turned on the “debug by printf” feature of the emulated flash device, and created a test: I would load the correct design into this emulated flash, and then attempt to load my program into the flash a second time. As before, the second load attempt detected a difference between the original image and the new image, indicating that what was in the flash memory didn’t match–when it was supposed to. However, the debugging statements indicated that the flash memory was never changed from its initial load.

Something else was wrong, and it wasn’t the flash controller.

I restarted the simulation, this time with tracing enabled. Many minutes later (it might’ve even been an hour or two …) I had a 22GB file I could work with.

I shouldn’t need to mention that 22GB is not a trace file size I like to work with. It takes forever to generate/write, and slows down my CPU user experience. (I’ve often noticed my favorite music stuttering during this time.) Further, anything over about 200MB will cause GTKWave to complain. It also tends to lock up my local WiFi network anytime I try to debugging on my laptop, when the trace is kept on my desktop.

I suppose I should point out that I tried using the Verilator FST option to generate a compressed trace. While the result was a shorter trace, it took even longer to generate. That I even have this memory tells you that just running the simulation once wasn’t enough. It rarely is.

The other problem is that it’s really hard to discover where the bug is in a large trace.

Still, I managed to find the Wishbone request that read from the address that was causing the fault. (It was always the same address, and within this section of the design the address only ever incremented.) Much to my surprise, the simulation read the correct value from the flash.

Then why did my load software report that the flash didn’t match the software image, when in fact it actually did?

Fig 4. The bus to PC processing chain

I needed to trace this result through the design–from where I had found it reading the correct value from the emulated flash device all the way to the PC host software that was receiving the wrong value, as shown in Fig. 4 on the right.

Not all of the trace points were easy to work through. The first trace point was a FIFO. I watched the correct data word go into the FIFO. I looked up the address it was written to, and then watched the correct word come out of the FIFO a significant time later. No errors yet.

The word then went through another processing stage with no change, just a delay. No error there.

Then it went into the compression algorithm.

Ouch. Did I really need to dig into the compression algorithm? Unlike the prior steps, it would not be easy to “see” the correct word work its way back. Compression could fundamentally change the word into something else.

Like any good engineer, I did what I could to avoid digging into the compression algorithm: I looked at the metadata. I watched the stb (valid) and ready (!stall) signals. I quickly noticed data changing when the downstream interface was still stalled. Perhaps there was a pipeline bug in the compression algorithm?

As it turned out, this data change was “normal”–it had just been too long since I’d written and examined the code, and so I wasn’t remembering well what it was supposed to do.

Nevertheless, I stepped into and through all of the steps in the compression algorithm. I had to.

Before continuing, let me tell you a bit about this compression scheme.

First, it’s part of my debugging bus protocol. If you’ve read the introductory post in my series on building a debugging bus, you’ll know that I have two versions I use. One is a simplified version that we built together on the blog. I call that one the hexbus, since it is built around transmitting hexadecimal numbers. The other is my workhorse debugging bus implementation, outlined in Fig. 2 above. This workhorse design includes this compression module within in it. The compressor is designed to compress 36-bit codewords, where the first 4-bits contain meta-information, such as whether the word contains an address update, write acknowledgment, read-return, etc., and the last 32-bits may contain encoded data. After the compression, the 36-bit words are broken into one or more 6-bit words followed a synchronization/end-of-word character, and these are then mapped into printable, human-readable characters.

The compression itself is based upon one of three schemes. The first scheme encodes the last data value into a new codeword that can be sent using a single byte. The second scheme encodes any one of the next eight values into what will become a single byte again. The third scheme encodes any of the next 512 values into what will become a two byte sequence.

Sadly, this requires a compression table of 521 elements in length. Since FPGAs don’t offer block RAM lengths of 521 elements, I use a 1024 element table to record the last data words I’d sent.

Fig 5. Serial port compression algorithm

How does the compression work?

Any time the host (PC) processor sets the address of a bus request, the new address acknowledgment going back to the CPU is used as a signal to reset the compression table to zero length. This synchronizes any external program to the compression algorithm, especially since multiple independent programs may want to interact with the FPGA design over time. (Only one will ever interact with the FPGA at any given time.)

This is a key point. Because I was reading from the flash over a long period of sequential addresses, the PC didn’t need to send new address/synchronization requests. This also allowed the table to fill up, like it was supposed to, but perhaps more than it did for me normally.
Then, once a 32-bit value is read from the internal Wishbone bus, it is (eventually) sent to the compression algorithm as a 36-bit data word.

This was where I was at, while following the logic through the trace.
The compressor immediately places this 36-bit data word onto the output.

This looked good–the 36-bit word I had been following was now on the output.
However, since the serial port interface tends to be slow, it may take many, many clocks for this output to be accepted. During this time, the compressor goes back through its past memory to see if any of the values it has sent in the past 521 (uncompressed) samples matches this current one.
On a match, the compression algorithm recodes the outgoing data word to indicate a compressed codeword, and the search stops.

This was the reason I had initially thought there was a fault in the compression algorithm–the output was changing even though the output was stalled. In this case, it was the expected behavior.

It’s amazing what you forget about your code between the time you write it and the time years later you have to go back and look at it again to understand what it’s doing when something goes wrong.
If no matching code word is found by the time the outgoing word is ready to be accepted, then the compression algorithm then writes this word to the table and knocks the last word out of the table. All indices then advance.
On a new word, the process starts over from step 2.

This is sort of the background of how this algorithm is supposed to work.

To find my bug, I walked through every step of this algorithm from within the trace file. I had to walk through each of the steps above, through the algorithm, checking the work at every step. I then got to the point where the algorithm found a match from the table at 522 words ago. (It was easy to find–since that’s where the module output codeword changed.) Since the number of possible compressions is only 521 words ago, a distance further than that should’ve been dismissed.

It wasn’t.

This was a serious bug. It took a long time to find it. The fix, once found was easy, but finding it?

So, why hadn’t I found it earlier?

Let’s look a little bit deeper, and see the real story.

First, understand that this algorithm was built for one of my very first digital designs–long before I started posting all of my designs on GitHub, and even before I was posting them on OpenCores. For that early design, I verified this compression algorithm as part of verifying a larger project in simulation. Once that larger project worked, I figured this algorithm worked as well.

After being used very successfully on that first project, this design was lifted from that project to be used on a next one. It was then lifted from that one to be used on another, and another, and so on. Today, I count its use in twelve separate designs. Indeed, over the years that I have used this module it has become well-honed and well trusted.

Not once during this time did the design ever accumulate any automated tests, to verify if any changes updating the design maintained the designs functionality.

This was probably my key mistake.

Sometime later, after I had been using this compression design for some time, I noticed the design was creating a synthesis warning. For some reason, the synthesizer was complaining about comparing the found code-word index against 521–the maximum table offset I had an encoding for. The warning said there were too many bits in the comparison. This didn’t make sense to me. It should be a 10-bit comparison, but … perhaps I wasn’t seeing the problem the synthesis tool was seeing. So I removed the comparison to make the warning go away.

Knowing I needed to test the design, I left it in my master debugging bus repository and forgot about it. My intent was to “verify it” when it was later incorporated into another project.

So, here I was, with a broken design and cursing my broken design process.

After way too long staring at this code, I discovered the real bug: I was originally comparing against 10'h521 rather than 10'd521. Look carefully. Do you see the difference? This bug could happen to anyone. Had I properly created a unit verification process for this core, I might have caught the bug long ago–or not. I’m not sure I would’ve been rigorous enough to build a test that would’ve loaded the FPGA with this many values just to test index overflow. That’s something only the formal solver would be annoying enough to try.

There’s a lesson I learned from jumping out of airplanes that applies well here: Back in my days as a cadet in the US Air Force Academy, I had the opportunity to jump out of perfectly good airplanes. It was a lot of fun. While we all enjoyed arguing over whether this was a smart or stupid thing to do, no one argued about whether the reserve parachute was a smart or stupid thing to have. Instead, we were all taught that our reserve chute was our last chance at life–our last chance to live. We were to protect that reserve chute with everything we had.

In many ways, the debugging bus has somewhat of the same purpose–although certainly not quite as dramatic. This, plus the scope and the simulator, are my last lines of defense against FPGA Hell. If any of my designs should’ve been formally verified, it would be these ones.

As an aside, I recently had the opportunity to formally verify the AXI version of my scope. Much to my surprise, I also found a subtle bug that had been hiding in the clock domain crossing logic of the reset signal in the Wishbone version I’d been using as well.

My point? Verify any debugging logic you use. Getting out of FPGA Hell depends upon it.

It’s designs like these, and hard-earned lessons like this one, that continue to convince me that I need to use formal methods. Why? Because I tend spend the longest amount of time chasing bugs in prior designs that were never properly verified in the first place.

The Data Cache

Fig 6. The ZipCPU Data Cache

The third story centers around building a data cache for the ZipCPU.

Indeed, it wasn’t long after building the ZipCPU that I decided I wanted to build a data cache. Even as far back as when I first presented the ZipCPU at ORCONF, I was using my spare time to scribble out how this data cache would work. I then scribbled that out and started over and over again and again. (It didn’t help that my scribbles were being done in pen, or that I kept finding so many bugs in the scribbles ….) I just struggled to get a grasp on the whole algorithm and, as I’m sure you know, details matter.

That was in 2016.

It wasn’t until later, in 2018, when I was preparing to teach my first course in formal methods that I actually managed to get the time and focus to build it from top to bottom. At that time, I was experimenting with the cover() statement, and getting excited that I could just cover() the data returned from the cache to get almost all the information I needed.

You can see an example of what this trace might have looked like in Fig. 7, below. This particular trace is drawn, however, from the current implementation, rather than the original one we are discussing.

Fig 7. Data Cache Cover Trace

Notice that in order to cover() o_valid, the formal tool needed to generate a trace that showed:

The CPU requesting to read data from the cache

You can see this in Fig. 7 where i_request goes high. (Inside the ZipCPU, the wire name used is i_pipe_stb–but we’re trying to make things readable here.)
That data had to be determined to be in a cachable address

You might notice that it takes two clocks before o_wb_cyc goes high. One clock is required to read from the table of cache tags, the second to compare the current request against the result.
The cache. would’ve been empty at the time, since it was fresh from a restart. Hence, the requested address isn’t found in the cache.
The cache logic would then go out to the Wishbone bus and read a cache line. into memory. (As shown above in Fig. 7)
Once the cache line was in memory, the data cache would then return the value to the CPU.

This again requires several clocks, since it costs at least one clock to write to the cache memory, another one to check that it’s in memory, and another one to select between several possible cache results, etc.

I could examine all of this logic from one simple cover() generated trace!

Even better, were either the CPU’s or the Wishbone bus protocol ever violated, the cover() request would fail and I’d get a trace showing where the failure took place.

Not only that, the proof that I was using verified that the “right” value would always be returned by the cache.

I had similar cover() statements in the data cache logic to verify:

Words written to memory would go right through the cache to the bus, since I had chosen a “write-through” cache implementation.
Words read or written from non-cachable memory would also go straight to the bus

With no properties violated, bus protocols maintained, and a series of cover() statements in hand, I was really excited to go and run Dhrystone to see how my performance might have improved.

Any guesses as to what happened?

Much to my shock and surprise, the CPU with its brand new data cache installed froze when I ran it in simulation. Right out of the gate. After staring at the screen in complete puzzlement, I restarted the simulation. with tracing enabled, killed it, and then pulled the trace up into GTKWave.

It didn’t take much work to find when the CPU froze.

As it turns out, I had verified every part of the data cache’s operation except one: I never verified that the busy line would clear after returning its result to the CPU.

Ever since, I no longer cover() a core just returning the “right-answer” any more. Instead, I will always create a trace showing the core returning to idle after the “right-answer” has been found. Even better, my current cover() checks will look into a minimum of two requests going through the core, so I can examine what sort of overall throughput I might expect.

When trying to answer the question of whether or not formal is enough on its own, this was a valuable lesson learned. Since this time, however, I haven’t found any further bugs in this core–in spite of updating it several times. Indeed, looking back over the commit log, it looks like all of the subsequent commits had to deal with tool issues–rather than any more functional issues.

The New PicoRV32 Front End

Fig 8. A basic PicoRV32-based Accelerator

The last story focuses on the PicoRV32 CPU. In this case, a customer had asked me to create a basic FPGA accelerator design, such as the one shown in Fig. 8 on the right. I chose to demonstrate this using an FFT accelerator. As built, the board was to receive data from GbE network, and then the RISC-V CPU within the design would then push that data through the on-chip accelerator.

Of course, in the end, nothing works as desired. The first problem I had was the DDR3 SDRAM. Since the goal of the project was to use entirely open source tools, logic, and software, I needed an open source DDR3 controller for an ECP5. Since this wasn’t initially available, the team chose to skip the DDR3 SDRAM. (There’s now a litedram controller that works with the ECP5.)

Sadly, this had consequences.

The first consequence was that the newlib C-library didn’t fit into my 64kB of block RAM. That meant that the CPU’s instructions would need to be placed into flash memory. As you may recall from our discussions about building a flash controller, a QSPI flash memory requires 8 serial port clocks (6 for address, and then 2 dummy clocks) just to start a transaction. Another 8 clocks are needed to read any 32-bit instruction word.

Making matters worse, the ECP5 chip the project ran on didn’t provide first class access to the flash clock pin. I needed to go through a logic block to get access to it, and that block didn’t offer access to the DDR I/O primitives my controller needed for high speed access. In other words, I’d need to suffer a minimum of 32 clocks per instruction access, since each SPI clock would cost two system clocks. When all was said and done, interfaces written, bus delays added up, the cost ended up being closer to 38 clocks per instruction access.

The next piece of this problem surrounded the PicoRV32’s interfaces. Unlike the instruction pre-fetch interface I used for the ZipCPU, the PicoRV32’s bus interface involves setting a valid line and an address line. The environment is then expected to go look up what ever is at that address, and then to return it to the CPU. Further, the only difference between reading and writing is that the PicoRV32 sets the write select bits when writing–so the wrapper needs to check these as well. Finally, the PicoRV32 also includes a flag to indicate an instruction read request rather than a memory request.

While I had a block RAM controller and a flash controller sitting on a shelf, I didn’t have a PicoRV32 bus control wrapper that worked with AutoFPGA’s version of Wishbone. Building this was one of the project deliverables.

You can see my first draft of this wrapper here. On any new bus request, the design sends the request to the AutoFPGA generated bus interconnect, and then returns the result to the CPU.

The problem with this design is that every instruction fetch is a separate and independent bus transaction. As a result, when the flash controller sees that there are no further fetches coming, it ends the burst read. Subsequent reads will require sending the address and mode bits again. That’s 12 clocks for the address bits, and 4 more clocks for the mode bits, for a total of 16 extra clocks per instruction that wouldn’t be needed were multiple requests made one after another.

Therefore, once I had completed the requirements of the contract, I took a moment to see if I might build a better PicoRV32 interface.

The new design followed the techniques I discussed here, with just a couple of differences.

First, I required an extra clock to know if the requested address was the result of a branch instruction. In my own pre-fetch interface design the CPU exports a wire indicating any time the program counter has changed as a result of a branch. Not so with the PicoRV32. I needed to detect this in my bus wrapper.
I also built a deeper lookup. Instead of looking up only two instruction words, and then looking up the next word anytime one of those two were accepted, I chose to lookup up four words regardless of how fast the PicoRV32 consumed them.

This was a draft design after all, to see if this might help, not something final that was going to be part of this contract.
Finally, the PicoRV32 muxes both instruction and data requests together across the same external CPU interface. My updated prefetch routine needed to handle both types of requests. Only the instruction stream, where I could predict the next instruction address with some certainty, would be partially cached.

As with any formal proof, I started with the bus properties. I then added further properties to include properties I felt necessary for induction, and also created properties to describe the CPU side of the interface.

When I was confident the design worked using formal methods, I switched to simulation and immediately my “working” programs (i.e. Hello World) stopped working.

What happened?

In this case, the software called for a data load during one of my set of four instruction fetches. Instead of waiting for the series of four instruction fetches to finish and then initiating a request for the data, the core returned the value from the ongoing instruction fetch to the data port.

Why? Because I returned a data value on any bus acknowledgment … a simple oversight.

Couldn’t formal methods have caught this?

Yes, they could’ve and they should’ve. Why didn’t they? Because I got sloppy. I was trying to do something quick, and I had tried to short-circuit the memory property check on the data bus in order to build this design on the quick.

Lesson learned.

The Project with no bugs

The past four project stories have all involved finding ugly bugs in them.

I should mention before going on that I’ve also had the experience where a formally verified project has no logic bugs in it.

Tool issues? Yes. What works with the Yosys parser doesn’t necessarily work with the Verific parser, and even that doesn’t necessarily work with Vivado or Quartus. What about language “standards”? Yeah. However, I should point out that those who have tried my formally verified AXI or AXI-lite slaves have often commented that they’ve worked the first time.

The key takeaway here is that it is possible to fully verify something using formal methods alone. I just need to learn how to reproduce this success in all of my designs.

Lessons Learned

Every year that I’ve gone to DVCON, there’s been a contractor or two selling “formal sign-off” services. I understand this to mean that by using their services, they will use formal methods to verify your design to the point where it can be trusted enough for an ASIC tape-out.

That’s awesome. I’ll be honest, I look up to anyone who can stand by such a claim and ability.

However, as illustrated from the examples above, this has not been my experience in general. While it has always been my goal, and while I’ve had a success or two doing this, I haven’t yet been able to fully eliminate simulation simulation from my personal design process.

That said, these four examples above do illustrate some important takeaways:

There’s always a need for simulation in integration testing.

To date, I’ve only ever formally verified design components, never entire designs. The larger the component has been, the more difficult the formal proof becomes. As a result, I’ve always had a need to use simulation to test an entire design.
Most of my simulation testing I’ve done has tended to be rather ad-hoc. My tests tend to be designed towards only ever demonstrate a single working path through my design. Indeed, simulations in general never verify every working path within a design. Only formal does that.

The best evidence for this is the fact that I never sufficiently tested whether element 521 could be properly returned by the compression algorithm, but that element 522 would be properly rejected.

I am looking forward to fixing some of my personal test-script issues by using Symbiotic EDA’s new mutation engine–but we’ll save the discussion on how to use that engine for a later date.
Don’t forget to cover() the design returning to idle.

cover() is great for proving that an operation can complete, or that you haven’t assumed away the answer. However, just verifying that a design will return the correct answer upon request isn’t enough–you have to cover() the return to the beginning or idle state.

As I mentioned above, I also like checking 2-3 transactions via a cover() statement, so that I can measure best-case throughput as well.
Even though the two formally verified designs above still needed to be simulated, and even though new bugs were caught in that simulation check, I never found more than one bug in each.

The data cache has now worked for me for over a year without further logic problems. (There were some tool incompatibilities …) The instruction cache has likewise been formally verified, and it has worked for much longer without problems. Indeed, it’s undergone several revisions without suffering from any more bugs caught during simulation.
Individual design components need their own verification infrastructure.

This was one of the painful lessons I learned from the bus return compression algorithm. In this case, I had tested it as part of a larger design and missed the difference between 10'h521 and 10'd521. I then copied it from one design to the next, to the next, to the next, and then needed to make a change to it. Having no method of testing the component, I instead queued a minor disaster up for myself some time later.

Now, sadly, I’ve learned that I need to go back through this entire debugging bus implementation in order to add in proper verification logic–logic that should’ve been created when the design was initially created years ago.
There’s not a single bug listed above that could not have been caught using either formal methods or an appropriate simulation.

This goes both ways. Any bug caught via simulation, but not formal, can be turned into a formal property to be subsequently checked. Any bug caught via formal methods but not simulation, can be used to generate a better test script.

Indeed, my guess is that now that I’ve found bugs that weren’t caught by Xilinx’s AXI VIP, they’ll quickly adjust their product. I certainly would if I were them. I also expect that once these updates go through, that things that pass their AXI VIP will look like they work again, the demonstration designs will get fixed … and all will be look good again on the surface. However, the moment you make one change to those designs, everything will be up in the air again.
Just because something “works” in practice, doesn’t truly mean it “works”

The reference to 10'h521 rather than 10'd521 existed in the compression scheme since it was written. It was there for many years, waiting until just the right sequence triggered it.

Although I hadn’t noticed this bug earlier, the design never truly worked earlier either.
The really ugly lessons learned come only after working with a design for some period of time.

This is a sad reality of IP core development. Just because a given IP core works in some designs, just because it passes the threshold for the instructor to grant an A, or for the customer to accept the design work, doesn’t mean it will work in all designs. This is fundamentally the problem with simulation–you can’t test everything.

The good news is that the longer you use a given core, then (hopefully) the closer it gets to perfection. That is, at least until you need to rewrite it.

Let me share one final thought: If you choose to cut corners in verification, like I did to test the updated Wishbone wrapper for the PicoRV32, then it doesn’t matter whether when using simulation techniques or formal methods. Either way, you are setting yourself up for getting burned later on.

The good news, though, is that by using the updated wrapper I could speed up the PicoRV32 by a factor of 2x.

We’ll leave that thought, though, to be the topic of another article.