CPU based simulation, first thoughts

I’m currently working on a building a basic memory controller. It’s a big project, and a fairly fun one. The controller handles multiple hardware data channels, has a scatter-gather DMA, and will (eventually) have full ECC support. The device type involved is rather complex, and so the specification requires complicated instruction sequences to access it.

At issue is, how shall such a design be tested?

Those who have been reading this blog know that I am a strong supporter of formal verification. I’ve been known to try to formally verify everything I can get my hands on, and again I’ve had fun doing this. The obvious problem with this approach is that the design as a whole is far too complicated to fit into a formal solver. So, instead, I’ve only formally verified the parts and pieces. The design as a whole still requires some amount of verification that cannot be done formally.

So, I had a crazy thought: Why not verify the design by using a CPU within the test environment?

Background

To understand what I’m proposing, let me share a bit of the project’s background. The first point is that I didn’t create it. The project was given to me to both maintain and upgrade.

Fig 1. Traditional test bench structure

As the design was given to me, the test bench had a fairly traditional structure. It had a Verilog test script that drove an AXI Bus Functional Model (BFM) sometimes called a Verification IP. There were also models for the external peripheral(s) the module under test needed to interact with, as well as a separate AXI slave BFM for the design to interact with as it might with a memory.

The original test script contained about 1.2k lines of a Verilog test script that referenced another file containing 7.5k Verilog lines of non-synthesizable, test-bench tasks. These tasks are essentially Verilog subroutines. Unlike traditional Verilog design, these tasks are written as sequential logic: wait until this happens, set these wires, wait two clock ticks, etc. Because these tasks are sequential in nature, they act more like software than they do hardware.
The original test script depended upon a client-provided AXI VIP module–shown as the AXI BFM in Fig. 1 above. This module contained a set of Verilog tasks which could be called to issue AXI requests to anywhere on the bus.

We’ve already discussed on the problems with this kind of VIP module on the blog. Basically, the module can be used to create a form of scripted simulation test. (Read this address, write that address, etc.) Most modules of this type set RREADY and BREADY to 1, so they’ll never test for back pressure. Further, because of the sequential nature of their design, they’ll never issue concurrent read and write requests, or even multiple subsequent burst requests. This VIP was no different. A full check can’t really be done via simulation at all–but that really leads to another discussion for another day.

So, my thought was, why should I build my updated test bench in Verilog, only to rebuild it later in C? Why not build it in C in the first place? I mean, if I have to deliver a working software test bench to the customer, to include sample C software demonstrating how well the design works, then wouldn’t it make sense to do the job once and then only deliver the C code?

Preparation

Fig 2. The alternative: Using software to drive the test bench

My first thought was that the obvious reason why you wouldn’t use a CPU, and thus why you wouldn’t write your test bench in C, would be that you didn’t have a CPU available to you. However, in this case I came to this project with a ZipCPU, an AXI crossbar, a demonstration AXI (full) design which could be used to implement a AXI based block-RAM type device, and an AXI DMA. All of these together would make for a very comprehensive test bench environment. Even better, I reasoned, I could then port the entire test bench (sans external peripheral models) to an FPGA to prove the design at a later time.

The only problem was that the ZipCPU didn’t (originally) support AXI.

So, the first step was to get the ZipCPU to support AXI. As mentioned in the example AXI4 master article, I now have several versions of instruction and data interfaces. The debug interface has also gotten a significant upgrade in the process. Once done, however, AutoFPGA did a nice job of connecting everything together.

Perhaps some of you can guess what happened next. None of what followed should really come as a surprise to anyone. However, I still found the entire exercise very instructive.

Observations

Yes, I decided to try this approach. Why not? I had all the pieces I needed, and it seemed like the right thing to do.

Here are some of the things I learned along the way. We’ll call these some of the observations that I made.

First and foremost, you’ve never wanted to optimize a piece of code more than when running it from within a simulation environment.
printf() is a very convenient way of outputting characters. Actual output can be done via the Verilog $write() command or a UART, but printf() can handle a lot of formatting requirements. (So, too, can $display() …)

However, what you may not expect is that you are paying for simulation time while simulating the standard library call to printf(), and simulation is not necessarily cheap.
Did I mention that the design spends a lot of simulation time in the C library?
At one point, I tried using the rand() function to generate random test data, and I generated a new random data set for each test.

	for(int k=0; k<length; k++)
		buf[k] = rand();

At 3ms of simulated time, I quickly discovered that was a non-starter.

So, I rewrote the routine to use the ZipCPU’s shift and carry facility. Specifically, when the ZipCPU executes a shift instruction, like the logical shift right (LSR) instruction below, the last bit shifted is placed into the carry flag. This means that a second, conditional, exclusive OR (XOR) instruction can make a nicely implemented Galois linear feedback shift register: shift right, and if the bit shifted out is a one then exclusive OR the result with the new shift register value–called fill below.

#define	STEP(F, T)	asm("LSR 1,%0\n\tXOR.C %1,%0" : "+r"(F) : "r"(T))
	unsigned	fill = 1;
	const unsigned TAPS = 0x485b5;
	// char * buf;

	for(int k=0; k<length; k++) {
		STEP(fill, TAPS);
		buf[k] = fill;
	}

Even this was too slow.

However, when I unrolled the loop so that it filled up two words at a time, instead of eight separate bus writes, then things dropped down to 1.6ms.

Still too long.

Then I reasoned I could just re-use this data set for each of my tests–and that seemed to speed things up enough to be bearable.

Fig 3. Pond scum is known for moving faster than simulated serial ports

Simulated UARTs are slower than molasses in the winter time. I think I’ve even seen faster moving pond scum. My original UART speed was 1MBaud. In the end I bumped that up to 10MBaud and the UART interface still felt slow.
Increasing the serial port buffer size from 16-bytes to 256 bytes helped as well. This kept the CPU from spending its time polling for a space to be available in the serial port’s buffer.
Buffering UART requests means there’s a huge lag between the request to write to the UART and the actual UART output. The design can fail or even complete during this time–before the UART output is completed. As a result, the last line from my simulation often read haltSimulation complete, instead of the two lines starting with halting from the CPU and then Simulation complete from the simulator.
When I first started with this design setup, the CPU was spending a lot of timing fetching instructions from the bus. Increasing the cache size to something ridiculously large, such as 256kB for the instruction cache and another 256kB for the data cache, helped.
The ZipCPU is big endian. The AXI4 bus (and everything on it) is little endian. Swapping endianness in seems like a waste of precious time. simulation

void
byteswapln(uint8_t *dst, uint32_t *src, int ln) {
	while(ln > 0) {
		register	uint8_t	a,b,c,d;

		a = src[0];
		b = src[1];
		c = src[2];
		d = src[3];

		dst[0] = d;
		dst[1] = c;
		dst[2] = b;
		dst[3] = a;

		src++;
		dst += 4;
		ln--;
	}
}

Rewriting this function in assembly helped, since GCC doesn’t (yet) do a good job of exploiting the ZipCPU’s pipelined memory features.

Verilator variables associated with the external C++ device model aren’t automatically captured in the trace. This made it a challenge, at times, to debug why the model didn’t do what you wanted. There was a VCD trace file and two debug-by-printf dumps: one from the simulated ZipCPU and another from the host. It took work to synchronize those three traces to discover what was going on.

On other designs I’ve created design inputs and filled them with the state from the external model. I could’ve done that here, and might’ve, if this had turned out to be any harder than it was.
The larger the CPU stack size becomes, the slower the CPU gets.

This was not one I was expecting, although perhaps I should’ve expected this–since I was the one who wrote the ZipCPU’s GCC back end and I seem to recall implementing this “feature”.

The root of the problem stemmed from placing a large 8kB page buffer on the stack. This plus a couple other registers forced the stack size to be larger than the ZipCPU’s maximum load register offset of fourteen signed bits, or -8192, … 8191. GCC then turned commands to move data from the stack to a register into the instruction sequence:

	LDI	#Offset, R0
	ADD	R12,R0
	LW	(R0),Rx

instead of the desired

	LW	#Offset+R12,Rx

Moving the data buffer off the stack and into global memory helped–since all accesses were then based upon shorter offsets from a buffer pointer rather than large offsets of the frame pointer.

The Bible can make a useful (and fun) data source. If you are staring at data coming across your screen, it’s more enjoyable to stare at a Bible passage than raw hex or anything else for that matter. Of course, this does have the draw back that a Bible passage isn’t necessary a full and complete test suite, since it tends not to test bit 7–but the separate random data check discussed above and below helped to mitigate this issue.
Generating a VCD file can really slow down the simulation. A simple GPIO peripheral, however, can be useful for intelligently controlling (based upon circumstances within the design) whether or not trace recording is enabled in Verilator. With only a minor change to the C++ test script, turning on the trace can then become as simple as the C statement:

	_axilgpio->g_set = SET_TRACE;

Eventually, I got in the habit of something like:

	if (error_condition) {
		// Something bad happened.

		// If the trace hasn't (yet) been turned on, turn it on now.
		if ((_axilgpio->g_set & SET_TRACE)==0)
			_axilgpio->g_set = SET_TRACE;
		else
			// Otherwise this has been captured in a trace, so we
			// can exit the simulation now with a failure.
			fail_simulation = 1;
	}

The fail_simulation variable above isn’t really anything special–it’s just a C integer that’s then used to skip further testing. Were this a C++ test bench, it would’ve been implemented as a boolean.

A basic AXI DMA is vastly superior to memcpy() when it comes to high speed data transfer. Even better, the AXI DMA helps to test more of the designs AXI capabilities.
Truly testing random data moving across an interface requires a bit of work on both ends of the interface. On one end, you’ll need a source of pseudo-random data to push through the interface. On the other end, you’ll want to be able to compare the received results with the sent data. In whole, the test requirement looks something like Fig. 4 below.

Fig 4. Many tests require some form of memory comparison

I’m not sure I have a good solution (yet) for this requirement.

The requirement is easy enough to accomplish using memcmp(), but as I noted above the mem*() functions can be notoriously slow and slow simulations are painful. Something similar to an AXI DMA that implements a memcmp() might be really useful here. Perhaps a memory to stream (MM2S) DMA could make this easier?

Fig 5. Here's what a hardware memory comparison might look like

For now, just keep your eyes peeled to the wb2axip repo. If I build such a hardware memcmp(), that’s where I’d put it. The basic design would likely follow Fig. 5 above, and so it would be built around one (or two) MM2S DMA’s followed by a stream comparison.
A Verilator test bench script can quickly move from one bus request to the next with no simulation time taken between the two requests. The CPU, on the other hand, needs to process instructions between bus requests. This will necessarily slow down any CPU based test when compared to its Verilog counterpart.

I guess that’s the big bottom line here: although the CPU software based test script leads to a more realistic test, it will always require more simulation cycles to accomplish.

Simple changes:

With some simple changes, many of these drawbacks became quite bearable. Here are some more of the changes I made to get things working better.

Boost the cache size. I mentioned this above. Basically, while you might not be able to afford a 16kB cache in any real life FPGA, you can certainly afford a 1MB cache in simulation land. Once the design ends up in actual hardware running at true hardware speeds, then no one will notice that you went back to the smaller cache size.
Rather than checking whether or not a byte can be output on every clock CPU cycle, a quick fix to the ZipCPU’s newlib back end made it so that the device write command would first check the serial port’s buffer availability, and then send that many bytes to the serial port before going back to check on availability again. I would’ve never noticed the impact of such a change had I not been running the design in simulation in the first place.
Clock gating can speed up simulations … I think. While clock gating is difficult to implement in an FPGA, it is more common in ASIC designs. Along the way, after a measurement or two, I convinced myself that clock gating can speed up simulations.

Fig 6. Does clock gating speed up sim time?

I was so excited about this technique that I started drafting an article about how clock gating could be used to speed up simulation time. To complete this article, and make my point about how awesome clock gating was, I spent some time and made some measurements. The simulation then ran five minutes slower with clock gating enabled. Oops.

What happened? Well, I had upgraded Verilator between the two tests. Might this have affected things?

Looking over Verilator’s documentation, I’ve since found some optimizations that might possibly speed up designs when gated, so stay tuned. I’m not convinced this is the end of the story yet.
A simple nonce at both the beginning and end of a test sequence can help make it easier to verify, from a simulation trace, that the first and last bytes were accurately communicated as desired. In my case, I replaced the first character of what would otherwise be a string of = with a “0”, and the last character with a “Z”. This left me with a test set that started out reading:

0==========================================================================+
|                                                                          |
|  Psalm 1                                                                 |
|                                                                          |
|  Blessed is the man that walketh not in the counsel of the ungodly, nor  |
|    standeth in the way of sinners, nor sitteth in the seat of the        |
|    scornful.                                                             |
|  But his delight is in the law of the LORD; and in his law doth he       |
|    meditate day and night.                                               |
|  And he shall be like a tree planted by the rivers of water, that        |
|    bringeth forth his fruit in his season; his leaf also shall not       |
|    wither; and whatsoever he doeth shall prosper.                        |
|  The ungodly are not so: but are like the chaff which the wind driveth   |
|    away.                                                                 |
|  Therefore the ungodly shall not stand in the judgment, nor sinners in   |
|    the congregation of the righteous.                                    |
|  For the LORD knoweth the way of the righteous: but the way of the       |
|   ungodly shall perish.                                                  |
|                                                                          |
============================================================================

Unfortunately, the block size was longer than the Psalm, so you don’t get to see the “Z” in the output–even though it’s present in the VCD file.

When verifying whether bytes in a message have been gained or lost, it helps to force every line to have the same length and end with the same character. Missing or extra characters then stand out loudly. Check out the test sequence above, and ask what would happen if a space were skipped or some other character inserted. The lines at the edges of the message wouldn’t line up.
I was worried initially about how I might transition a design that requires my normal TCP/IP based serial port to an all-Verilog simulation model that I might use with Icarus, XCellium, or some other commercial simulation tool. An all-verilog simulation model was something my customer wanted, but not something I normally build or use. I eventually found out that I can incorporate Verilog’s $write() command into my serial port controller to achieve a result that’s close enough.

Realistically, the biggest changes came as a result of just staring at the simulation traces showing the ZipCPU running software. The more I did so, the more obvious any slow software became, and therefore the more I wanted to dig into the slow parts to speed them up.

Conclusion

In hind sight, this entire approach arguably violates a fundamental principle of engineering: Tests should be accomplished by dividing the design into components that are known to work and components that aren’t (necessarily) known to work. From there, you should only test one component at a time.

I say arguably because any test requires both trusted components and not so trusted components. The most obvious trusted component is the simulator itself–in my case Verilator (so far). Likewise the most obvious untrusted component is the design or module under test. Other untrusted components tend to include the external device module and the test script itself–whether written in C or Verilog. This distinction is important, because it helps to reveal that the module under test is never the only untrusted component in any design. As a result, we might argue to what extent the basic engineering principle applies here.

This particular design approach treated certain pieces of the design, the ZipCPU, AXI DMA, RAM model, and interconnect as working infrastructure pieces. Indeed, in general, they did “just work”, although the exercise began with a hiccup or two. Remember, this was only the ZipCPU’s second AXI implementation (here’s the first), and this was the first design where the ZipCPU ran from a 64-bit bus. This design, therefore, also tested some new bus width adjustment components [1] [2]. Not only that, but the design also tested the ZipCPU’s (new) clock gating capability. Much as one might expect, the first couple of tests weren’t pretty. For example, AutoFPGA initially got its cachable address markers off by 3-bits, and so even though the RAM was entirely cachable, very few of the transactions were actually ending up in the data cache.

The really big question of this whole exercise is, now that I’ve been through it, whether or not I would do so again. The answer to that question is: I’m not sure. Indeed, I’m now faced with a second, very similar project, and the question before me today is whether or not I should write a Verilog test script for it or just write the script in software as I did for this project.

We’ll see.