My design works in simulation, but not in hardware. Can formal methods help me?

It’s not uncommon in FPGA design to have a design that “works in simulation”, but fails the ultimate test in hardware. I might argue that within FPGA designs, this is almost cliche. While I’d like to blame the problem on poor test design, it can also happen with formal methods if you haven’t fully verified your design.

Since this happens more often that I’d like to admit, let’s ask the question of whether or not formal methods can help find a bug within an FPGA design that is built, complete, loaded onto a piece of hardware, and yet doesn’t work.

The immediate answer shouldn’t surprise anyone: It depends. Sometimes formal methods can help, sometimes they can’t.

For example, if you look through my list of reasons why a design might pass simulation, but still fail in hardware, you’ll see several situations where formal methods aren’t going to help much. A classic example would be “timing problems.” While formal methods can help when you are crossing clock domains, they can’t do as much to help when logic within your implemented design can’t make it from one FF to the next FF within a single clock period. In a similar manner, formal methods aren’t going to help when you haven’t handled your I/O timing well.

Where formal methods can help is in eliminating possible causes of problems.

For example, I recently tried to run my 4x4x4 Tic-Tac-Toe game on the ZipCPU within an ICO board. Much to my dismay, my design didn’t load onto the board properly. Digging deeper, I discovered that read and write requests of the debugging bus were missing their acknowledgments. I struggled to figure out what was going on. Where should I even look for the problem?

In this example, formal methods were able to help me. Let me outline three ways in which formal methods can help in such a case, and then tell you what I found. Using formal methods, you can …

Build a better test bench

One of the reasons why I got involved in formal methods in the first place was because my test bench authoring methods were just too poor to be complete. I’ve since replaced my individual test benches with proofs using formal methods. These tend to be of a higher quality, and they tend to find more bugs.

While saying you should build a better bench test is sort of like saying you should’ve built in right in the first place, it can help to go back and thoroughly examine a component that you think might have faults within it.
“Cover” the problem

Once you see a problem in your logic within hardware, sometimes it helps to cover the situation. In this case, you would pick a component, and describe within that component a situation that you think is happening in hardware. If you then cover() that situation, you can then get a trace showing how your design might get into that situation.
Assert the problem will never happen

Alternatively, if the state the design gets into is an “illegal state” that it should never get into, then it might help to assert() that it should never happen. For example, if your design outputs values one and three but skips the second value, then you can make an assertion that it will never skip values and see if you can formally prove that assertion.

In the case I outlined above, I had a two sides of an interface I was working with. To illustrate, consider Fig 1 below.

Fig 1: ICO Board Parallel Port bug

One side of this interface had my FPGA logic on it, and the other side contained software on a Raspberry Pi. By inspection, I could see that data was getting dropped in the interface. But which side was at fault?

Perhaps you might remember the module of interest on the FPGA: it was the same debugging bus core we built here on the blog before I discovered formal methods. Since it had never been formally verified, I suspected a fault within it.

If you remember the design from when we built it, the return data path following the bus request consisted of a series of separate stages, each with back pressure. In other words, each stage was given a stall signal from the next stage, and generated a stall signal for the prior stage. Similarly, each stage created a data valid signal I called a “strobe”. Together with this data valid signal was data that should be output from the stage. I’ve written about this pipelining strategy before, calling it the “travelling CE” strategy.

This was one obvious place where data might be lost.

I simply wanted to verify that no data would ever be lost in the pipeline.

Therefore I went through all of the stages within the interface, and added an assertion between the various stages: if the current stage was producing a valid output but the next stage was busy, then the same data should still be valid on the next clock. No changes were allowed, and data was not allowed to be dropped.

always @(posedge i_clk)
if ((f_past_valid)&&($past(i_busy))&&($past(o_valid)))
begin
	assert(o_valid);
	assert($stable(o_data));
end

Of course, this assertion didn’t pass at first. When I dug further, I found some race conditions within my code. I also found a couple of cases where a value would be created by one of the stages, but yet it was designed to be pre-empted should a newer value show up. So I spent a bit of time to clean up the code and my properties, and eventually it passed forma verification through and through.

The only problem was this interface still didn’t work, and hence I still couldn’t play 4x4x4 tic-tac-toe using the ZipCPU on the icoboard.

This time, though, I now knew that my debugging bus would NEVER skip data words.

That meant the problem had to lie within the Raspberry Pi code. Sure enough, I found something similar to the following logic.

data_read = read_data(number_requested, buffer);
while((iterations++ < MAX_ITERATIONS)
	&&(data_read > 0)) {
	//
	// Process the data
	//

	// Read the next data
	data_read = read_data(number_requested, buffer);
}

This was supposed to read the data from my device and process it. To keep the routine from hanging, it would only repeat the loop MAX_ITERATIONS times.

See the bug?

And to think, I was chasing this all over my FPGA design wondering what was going wrong. Once I formally verified through the relevant portions of my FPGA code, I was able to stop chasing phantoms and move quickly to the real problem.

My point is here is simply this: formal methods can still help–even when the design is already implemented in (and failing in) hardware.