Assignment delay's and Verilog's wait statement

I’ve now spent more time than I want to admit to debugging simulation issues when using Verilog’s simulation semantics. Let me therefore share some problems I’ve come across, together with my proposed solution for them.

The Problems

Today’s problem stems from logic like the following:

always @(trigger_condition)
begin
	if (some_other_condition_determining_relevance)
	begin
		#1;
		state_variable = complex_expression;
		// This then continues for another 50 lines or so
	end
end

In general, this comes to me in “working” simulation code that’s been handed down to me to maintain. The simulations that use this logic often take hours to run, and so debugging this sort of thing can be very time consuming. (Costly too–my hourly rate isn’t cheap.)

Let’s walk through this logic for a moment–before tearing it apart.

Fig 1. Avoid assignment delays

In this example the first condition, the one I’ve called trigger_condition above, is simply some form of data change condition. Sometimes its a reference to a clock edge, sometimes its a reference to a particular piece of data changing. This isn’t the problem.

The second condition, some_other_condition_determining_relevance, is used to weed out all the times the always block might get triggered when you don’t want it to be. For example, it might be triggered during reset or when the slave device being modeled is currently responsive to some other trigger_condition. This is natural. This is not (yet) a problem either.

So what’s the problem with the logic above? Well, let’s start with the #1 assignment delay. In this case, it’s not representing a true hardware delay. No, the #1 is there in order to schedule Verilog simulation statement execution. Part of the reason why it’s there is because the rest of the block uses blocking logic (i.e. via the =). Hence, if this block was triggered off of a clock edge, the #1 allows us to reason about what follows the clock edge but before the next edge.

Fig 2. Recipe for trouble

Now, let me ask, what happens five years from now when clock speeds get faster? Some poor soul (like me) will be hired to maintain this logic, and that poor soul will look at the #1 and ask, why is this here? Maybe it was a 1ns delay, and they are now trying to run a clock at 500MHz instead of 100MHz. That 1ns delay will need to be understood, and replaced–everywhere it was used. It doesn’t help that the 1ns doesn’t come with any explanations, but that may be specific to the examples I’m debugging.

Here’s a second problem, illustrated in Fig. 2: what happens when you use this one nanosecond delay in multiple always blocks, similar to this one, all depending on each other? Which one will execute first?

The third problem often follows this one, and it involves a wait statement of some type. To illustrate this, let me modify the example above a bit more.

always @(trigger_condition)
begin
	if (some_other_condition_determining_relevance)
	begin
		#1;
		state_variable = complex_expression;
		// Continue for a while ...
		wait(negedge clk);
		output_value = other_complex_expression;
	end
end

In this case, the user wants to make certain his logic is constant across the clock edge, and so he sets all his values on the negative edge of the clock. This leads to two problems: what happens when the #1 delay conflicts with the clock edge? And what happens when the output value depends upon other inputs that are set on the negative clock edge?

Fig 3. Giant case statement dispatching tasks

Fig. 3 shows another problem, this time when using a case statement. In this case, it’s an attempt to implement a command structure within a modeled device. The device can handle one of many commands, so depending on which one is received you go and process that command. The actual example this is drawn from was worse, since it depended not only on commands but rather command sequences, and the command sequences were found within case statements within case statements.

What’s wrong with this? Well, what happens when the original trigger takes place a second time, but the logic in the always block hasn’t finished executing? Perhaps this is erroneous. Perhaps it finishes just barely on the wrong side of the next clock edge. In my case, I find the bug four hours later–on a good day. It doesn’t help that simulations tend to run rather slow.

Fig 4. FSMs are often easier to debug than long-running tasks

A better approach would’ve been to use a state machine rather than embedded tasks. Why is this better? Well, if for no other reason, a case statement would contain state variables which could be seen in the trace file. That means that you could then find and debug what would (or should) happen when/if the new command trigger shows up before a prior command completes.

Fig 5. Repeat LLC logic

These problems are only compounded when this logic is copied. For example, imagine a device that can do tasks A, B, and C, but requires one of two IO protocols to accomplish task A, B, or C. Now, if that IO protocol logic is copied and embedded into each of the protocol tasks, then all three will need to be updated when the IO protocol is upgraded. (I2C becomes I3C, SPI becomes Quad SPI, etc.)

While some of these problems are specific to hardware, many are not. Magic numbers are a bad idea in both RTL and software. Design reuse and software reuse are both very real things. Even a carpenter will build a custom jig of some type when he has to make fifty copies of the same item.

The good news is that better approaches exist.

Defining terms

Before diving into some better approaches, let me take just a couple of moments to introduce the terms I will be using. In general, a test bench has three basic (types of) components, as illustrated in Fig. 6.

Fig 6. Test bench components

The Device Under Test (DUT): The is the hardware component that’s being designed, and for which the test has been generated.

Since the DUT is intended to be synthesizable, Verilog delay statements are inappropriate here.
The Hardware Device Model, or just model: Our hardware component is being designed to interact with an external piece of hardware. This component is often off-chip, and so our “model” is a simulation component designed to interact with our IP in the same way the actual hardware would.

Although I’ve called these “models” “emulators” in the past, these aren’t truly “emulators”. An “emulator” would imply a description of the actual hardware existed, such as an RTL description, yielding an additional level of realism in simulation. Barring sufficient information from the external device’s manufacturer to actually and truly “emulate” the device, the test designer often settles for a “model” instead.

Hardware models may naturally require Verilog delays in order to model the interfaces they are designed for. For example, a signal may take some time to transition from a known value to an unknown one following a clock transition. As another example, a hardware device may become busy following a command of some kind. The good news is that Verilog can model both of these behaviors nicely.

How to handle these delays “properly” will become part of the discussion below.
The Test Script, or driver: This is the component of the design that interacts with the device under test, sequencing commands to given to it to make sure all of the capabilities of the DUT are properly tested.

This component of the Verilog test script often reads more like it is software than hardware. Indeed, we’ve already discussed the idea of replacing the test script with a piece of software compiled for a soft-core CPU existing in the test environment, and then emulating that CPU as part of the simulation model. The benefit of this approach is that it can test and verify the software that will be used to drive the hardware under test. The downside is that simulation’s are slow, and adding a CPU to the simulation environment can only slow it down further.

For the purposes of our discussion today I’ll simply note that the test script commonly interacts with the design in a synchronous manner. Any delays, therefore, need to be synchronized with the clock.

There is another problem with the driver that we won’t be discussing today. This is the simple reality that there’s no way to test all possible driver delays. Will a test driver accurately test if your DUT can handle back to back requests, requests separated by a single clock cycle, by two clock cycles, by N clock cycles? You can’t simulate all of these possible delays, but you can catch them using formal methods.
Not shown in Fig. 6, but also relevant is the Simulation Environment: While the DUT and model are both necessary components of any simulation environment, the environment might also contains such additional components as an AXI interconnect, CPU, DMA, and/or RAM, all of which are neither the test script, DUT, or model.

Ideally these extra components will have been tested and verified in other projects prior to the current one, although this isn’t always the case.

Now that we’ve taken a moment to define our terms, we can now return to the simulation modeling problem we began.

Better practices

The good news is that Verilog was originally written as a language for driving simulations.

Even better, subsets of Verilog exist which can do a good job of modeling synthesizable logic. This applies to both asynchronous and synchronous logic. The assignment delay problems that I’ve outlined above, however, arise from trying to use Verilog to model a mix of logic and software when the goal was to create a hardware device model.

Here are some tips, therefore, for using delays in Verilog:

Write synthesizable simulation logic where possible.

This is really only an issue for test bench or modeling logic. It’s not really an issue for logic that was meant to be synthesizable in the first place.

The good news about writing test bench logic in a synthesizable fashion is that you might gain the ability to synthesize your model in hardware, and then run tests on it just that much faster. You could then also get a second benefit by formally verifying your device model–it’d save you that much time later when running integrated simulations.

As an example, compare the following two approaches for verifying a test chip:

ASIC Test chip #1: Has an SPI port capable of driving internal registers. This is actually a really good idea, since you can reduce the number of wires necessary to connect to such a test chip. The problem, however, was that the SPI driver came from encrypted vendor IP. Why was this a problem? It became a problem when the test team tried to connect to the device once it had been realized in hardware. They tried to connect their CPU to this same SPI port to drive it–and then didn’t drive it according to protocol properly.

The result of testing ASIC test chip #1? I got a panic’d call from a client, complaining that the SPI interface to the test chip wasn’t working and asking if I could find the bugs in it.

ASIC Test chip #2: Also has a SPI port for reading and writing internal registers. In this chip, however, the SPI port was formally verified as a composition of both the writer and the reader–much as Fig. 7 shows below.

Fig 7. Sometimes, you'll have both RTL pieces available to you

I say “much as Fig. 7 shows” because the verification of this port wasn’t done with using the CPU as part of the test script. However, because both the SPI master and SPI slave were verified together, and even better because they were formally verified in an environment containing both components, the test team can begin it’s work with a verified RTL interface.

You can even go one step farther by using a soft-core CPU to verify the software driver at the same time. This is the full extent of what’s shown in Fig. 7. As I mentioned above, the formal verification for ASIC test chip #2 stopped at the AXI-lite control port for the SPI master. When testing this chip as part of an integrated test, a test script was used to drive a Bus Functional Model (BFM), rather than actual CPU software. However, if you just read the test script’s calls to the BFM, you would have the information necessary to build a verified software driver.

Use always @(*) for combinatorial blocks, and always @(posedge clk) (or negedge) or always @(posedge clk or negedge reset_n) for synchronous logic.

While I like using the positive edge of a clock for everything, the actual edge you need to use will likely be determined by the device and protocol you are modeling. The same is true of the reset.

I would discourage the use of always @(trigger), where trigger is some combinatorial signal–lest you forget some required trigger component. I would also discourage the use of any always @(posedge trigger) blocks where trigger wasn’t a true clock–lest you create a race condition within your logic. I use the word discourage, however, because some modeling contexts require triggering on non-clocked logic. If there’s no way around it, then you do what you have to do to get the job done.
Synchronous (clocked) logic should use non-blocking assignments (<=), and combinatorial logic should use blocking assignments (=).

It seems like my debugging problems began when the prior designer used a delay instead of proper blocking assignments.

	always @(posedge clk)		// SYNCHRONOUS block
	begin
		#1;			// MAGIC NUMBER, doesn't model H/W, etc
		value = expression;	// BLOCKING LOGIC!
		// ...
	end

Just … don’t do this. When you start doing things like this, you’ll never know if (whatever) expression had finished evaluating, or be able to keep track of when the #1 delay needs to be updated.

Device models aren’t test drivers. Avoid consuming time within them–such as with a wait statement of any type. Let the time be driven elsewhere by external events.

This applies to both delays and wait conditions within always blocks, as well as any tasks that might be called from within them. Non-blocking assignment delays work well for this purpose.

Ideally, device models should use finite state machines, as in Fig. 4, to model the passing of time if necessary, rather than consuming time with wait statements or ill defined assignment delays, as in Fig. 3.
When driving synchronous logic from a test script, synchronize any test driven signals using non-blocking assignments.

I have now found the following simulation construct several times over:

	initial begin
		@(posedge ARESETN);
		@(posedge ACLK);
		@(posedge ACLK);
		@(posedge ACLK);

		ARVALID = 1;
		@(posedge ACLK);
		ARVALID = 0;

		// the problem continues on ...
	end

Sometimes the author uses the negative edge of the clock instead of the positive edge here to try to “schedule” things away from the clock edge. Indeed, I’ve been somewhat guilty of this myself. Sadly, this causes no end of confusion when trying to analyze a resulting trace file.

A better approach would be to synchronize this logic with non-blocking assignments.

	initial begin
		ARVALID = 0;			// Any initial value
		// Other initial AR* values may be reset here as well
		@(negedge ARESETN);
		@(posedge ACLK);
		@(posedge ACLK);
		@(posedge ACLK);

		@(posedge ACLK)
		begin
			ARVALID <= 1;
			// Set the rest of the AR* values
		end

		while(ARVALID)
		begin
			@(posedge ACLK)
			if (ARREADY)
				ARVALID <= 0;
		end

		// Script continues further
	end

This will avoid any delta-time cycle issues that would otherwise be very difficult to find and debug. Note that this also works because this block is the only block controlling ARVALID from within the test bench. Should you wish to control ARVALID from multiple test bench clocks, you may run into other concurrency problems.

While you can still do this sort of thing with Verilator, I’ll reserve my solution for how to do it for another post.

Pick a clock edge and use it. Don’t transition on both edges–unless the hardware protocol requires it.

As I alluded to above, I’ve seen a lot of AXI modeling that attempts to set the various AXI signals on the negative edge of the clock so that any and all logic inputs will be stable later when the positive edge comes around. This approach is all well and good until someone wants to do post–layout timing analysis, or some other part of your design also wants to use the negative edge, and then pain ensues.

Sadly, this means that the project may be turned in and then rest in a “working” state for years before the problem reveals itself.

In a similar fashion, what happens when you have two always blocks, both using a #1 delay as illustrated in Fig. 2 above? Or, alternatively, what happens when you want the tools to put real post place-and-route delays into your design for a timing simulation? You may find you’ve already lost your timing slack due to a poor simulation test bench or model. Need I say that it would be embarrassing to have to own up to a timing failure in simulation, due to your own simulation constructs?
There is a time for using multiple always blocks–particularly when modeling DDR devices.

Fig 8. Example DDR simulation logic

In today’s high speed devices, I’ve often found the need for multiple always blocks, triggered off of different conditions, to capture the various triggers and describe the behavior I want. One, for example, might trigger off the positive edge, and another off the negative edge. This is all fine, well, and good for simulation (i.e. test bench) logic. While this would never work in hardware, it can easily be used to accurately model behavior in simulation.

Use assignment delays to model physical hardware delays only.

For example, if some event will cause the ready line to go low for 50 microseconds, then you might write:

	parameter	tWAIT = 50_000;

	always @(posedge clk)
	if (event_and_not_reset && ready)
	begin
		ready <= 1'b0;
		ready <= #tWAIT 1'b1;
	end

Notice how I’ve carefully chosen not to consume any time within this always block, yet I’ve still managed to create something that will capture the passage of time. In this case, I’ve used the Verilog <= together with a delay statement to schedule the transition of ready from zero back to one by #tWAIT ns.

I’ve now used this approach on high speed IO lines as well, with a lot of success. For example, if the data will be valid tDVH after the clock goes high and remain valid for tDV nanoseconds, then you might write:

	always @(posedge clk)
	if (chip_enable)
	begin
		pre_output_value <= some_expression;

		output_valid <= #tDVH 1'b1;
		output_valid <= #(tDVH+tDV) 1'b0;
	end

	assign	output_value = (output_valid) ? pre_output_value : 1'bz;

I’ve even gone so far in some cases to model the ‘x values in this fashion as well. That way the output is properly ‘x while the voltage is swinging from one value to the next.

No magic numbers! Capture hardware delays in named parameters, specparams, and registers, rather than using numeric assignment delays.

For example, were I modeling a flash memory, I might do something like the following to model an erase:

`timestep	1ns / 1ns
// ...
	parameter real	tERASE = 500_000;	// 500 microseconds

	// Decode the SPI interface.  We start by counting clocks
	//  from the negative edge of CSN
	always @(posedge SCK or posedge CSN)
	if (CSN)
		clock_counts <= 0;
	else if (!clock_counts[5])
		// Count clock ticks
		clock_counts <= clock_counts + 1;
	else
		// Once clock_counts[5], we're past 32.  We
		// can keep counting, but the results will be
		// irrelevant for this example.
		clock_counts[4:0] <= clock_counts[4:0] + 1;

	// With each new clock tick, we capture one more bit
	// from the interface.
	always @(posedge SCK)
		sreg <= (sreg << 1) | MOSI;

	// An erase command takes place after 32 SCK clock edges: the
	// first 8 contain the command, the next 24 contain the address
	// for the command.  Yes, this assumes 24-bit addressing.
	always @(*)
	begin
		erase_command = CSN && (clock_counts == 31)
				&& (sreg[30:23] == CMD_ERASE);
		erase_address = { sreg[22:11], 12'h0 };
	end

	// We only issue and act on the command once we get to the final
	// SCK clock edge of the command sequence--the 32nd clock edge after
	// CSn activates (lowers)
	always @(posedge SCK)
	if (erase_command && !busy)
	begin
		// Set an internal busy bit.  We'll remain busy for
		// tERASE ns.
		busy <= 1;
		busy <= #tERASE 1'b0;

		// Actually erase the memory in question
		for(k=0; k < BLOCK_SIZE; k=k+1)
			mem[k + erase_address] <= 8'hff;
	end

Notice the use of tERASE rather than some arbitrary erase time buried among the logic. Placing all such device dependent times in one location (at the top of the file) will then make it easier to upgrade this logic for a new and faster device at a later time.

We can also argue about when the actual erase should take place. As long as the user can’t interact with the device while it’s busy, this probably doesn’t make a difference. Alternatively, we could register the erase address and set a time for later when the erase should take place.

	initial	erase_memory_flag = 0;
	always @(posedge SCK)
	if (erase_command && !busy)
	begin
		busy <= 1;
		busy <= #tERASE 1'b0;

		erase_memory_flag <= 1'b1;
		r_erase_address <= erase_address;

		// Render the memory in question unknown
		for(k=0; k < BLOCK_SIZE; k=k+1)
			mem[k + erase_address] <= 8'hx;
	end

	always @(negedge busy)
	if (erase_memory_flag)
	begin
		// Actually erase the memory in question
		for(k=0; k < BLOCK_SIZE; k=k+1)
			mem[k + r_erase_address] <= 8'hff;

		// Clear the flag
		erase_memory_flag <= #tCK 1'b0;
	end

Even this isn’t perfect, however, since we now have a transition taking place on something other than a clock. Given that the interface clock isn’t continuous, this may still be the best option to create a reliable edge.

The rule of 3 applies to hardware as well as software: if you have to write the same logic more than twice, then you are doing something wrong. Refactor it. Encapsulate it. Make a module to describe it, and then reuse that module.

Fig 9. If you have to build it more than twice, refactor it

Remember our example from Fig. 5 above? Fig. 9 shows a better approach to handling three separate device tasks, each with two separate protocols that might be used to implement them.

For protocols that separate themselves nicely between the link layer control (LLC) protocol and a media access control (MAC) layer, this works nicely to rearrange the logic so that each layer only needs to be written once, rather than duplicated within structures implementing both MAC and LLC layers together.

Fig 10. The rule of Gold

Remember: fully verified, well tested, well written logic is pure re-usable gold in this business. Do the job right the first time, and you’ll reap dividends for years to come.

Today’s story

A client recently called me to ask if I could modify an IP I had written so that it would be responsive on an APB slave input with a different clock frequency from the one the rest of the device model used.

The update required inserting an APB cross clock domain bridge into the IP. This wasn’t hard, since I’d built (and formally verified) such a bridge two months prior–I just needed to connect the wires and do a bit of signal renaming for the case when the bridge wasn’t required.

That was the easy part.

But, how shall this new capability be tested? It would need an updated test script and more.

Thankfully, this was also easy.

Because I had built the top level simulation construct using parameters, which could easily be overridden by the test driver, the test suite was easy to update: I just had to set an asynchronous clock parameter, create a new parameter for the clock speed, adjust the clock speed itself, and away I went. Thankfully, I had already (over time) gotten rid of any inappropriate delays, so the update went smoothly.

Smoothly? Indeed, the whole update took less than a half an hour. (This doesn’t include the time it took to originally build and verify a generic APB cross-clock domain bridge.)

… and that’s what you would hope for from well written logic.

Well, okay, it’s not all roses–I still have to go back and update the user guide, update the repository, increment the IP version, update the change log, and then bill for the task. Those tasks will take longer than the actual update, but such is the business we are in.

Conclusion

Let’s face it, this article is a rant. I know it. Perhaps you’ll learn something from it. Perhaps I’ll learn something from any debate that will ensue. (Feel free to comment on Reddit …)

Yes, I charge by the hour. Yes, messes like these will keep me gainfully employed and my family well fed for years to come. However, I’d rather charge for doing the useful work of adding new capabilities to a design rather than fixing up someone else’s mess.