It’s always fun to design something simple every now and then–something that doesn’t take too much thought, yet still fits a needed place in something you are building.

Fig 1: A Signal Delay Element
Block Diagram showing the concept of a delay element: the incoming data line is split, one line gets delayed, both go through flip flops

Today, let’s look at a delay element. This is a fundamental signal processing operation that takes a single stream and creates two streams–with the second stream delayed by some programmable amount of samples from the first one.

This is actually a very common signal processing need. Imagine if you will that you had one piece of processing code that was applied to the input, took many samples (N) to accomplish, and that the result of this processing told you how to lock onto the signal that began N samples ago.

A classic example of this would be a burst preamble–a known sequence that occurs at the beginning of a burst transmission to help you to synchronize to that transmission. However, once synchronized, you then want to go back and process any samples immediately following that preamble. Should you have any delay in your preamble processing chain, then you’d need to go “back in time” to start processing your signal immediately following this preamble. This is one purpose of a delay element.

So, just for fun and to have a change-up from some more serious and complex topics, let’s examine a simple delay element.

Pseudocode

At first blush, the logic for a delay element. seems quite simple: just delay the incoming samples by some variable amount. Indeed, you might wish to start coding the algorithm together immediately (I did). You’d start with a delay of zero, and then build the logic for the delay of one.

always @(posedge i_clk)
begin
	if (i_delay == 0)
	begin
		o_word <= i_word;
		o_delayed <= i_word;
	end else if (i_delay == 1)
	begin
		o_word <= i_word;
		o_delayed <= o_word;
	end else begin

Then you’d get stuck.

It’s right here at this point in the pseudocode that you need to transition to a block RAM delay, and so you need a memory value read from block RAM. We’ll call this value memval.

		o_word <= i_word;
		o_delayed <= memval;
	end
end

Ok, so we’ll need a memory. That means we are going to want to write our data into memory.

always @(posedge i_clk)
	mem[wraddr] <= i_word;

We are also going to want to read it back out.

always @(posedge i_clk)
	memval <= mem[rdaddr];

And in order to make this all work, we’re going to need some memory address manipulation code. Most of this is straight boilerplate.

always @(posedge i_clk)
begin
	if (i_reset)
	begin
		wraddr <= 0;
		rdaddr <= ... // Something ... but what?
	end else if (i_ce)
	begin
		wraddr <= wraddr + 1'b1;
		rdaddr <= rdaddr + 1'b1;
	end

The read address, though, is not boilerplate. It needs to be related to the write address. Indeed, this is perhaps the only difficult part of building a signal delay element such as this.

So how should the read address relate to the write address?

The first answer in this case would be that the read address should be less than the write address by i_delay elements. When you then try this code within a test bench, you’ll find that this choice just doesn’t work.

So let’s think this through a touch more.

Scheduling the Memory Pipeline

To get the read and write address correct, let’s examine how our signals would move through this pipeline. We can build a pipeline schedule as we’ve done before on this blog. You can see the schedule for our delay logic shown in Fig 2.

Fig 1: A Signal Delay Element
The stages of the delay pipeline

The basic concept of this diagram is that variables that are valid at one time step lead to new variables that are valid on the next. So if i_data is valid on one time step, o_data will be valid on the next. Likewise if we write i_data to memory using the wraddr signal on one time step, then the memory element, mem[wraddr] will have that value on the next time step.

Let’s follow what happens to this memory a touch further. If after writing to memory we immediately read from it into memval, that will require a read address, rdaddr. We can then place this memval into our output delay element, o_delayed and be done.

So how many clocks did that take? Two. Count the difference between when o_delay was produced and when o_data was produced. This is then our minimum delay when using memory: two clocks.

If you’ve been following this blog, you may remember going through this same exercise when we built a moving average filter.

From here, we can work out how the read address corresponds to the write address. In particular, if rdaddr == wraddr-1, then we are delaying by two. So what we want, then, is to have rdaddr = wraddr+1-i_delay and that’s all the missing logic required to make this work.

Ok, I’ll admit … I didn’t put any time into figuring out how to schedule the pipeline. I just built it wrong, and then adjusted the relationship between wraddr and rdaddr in the test bench until I got things right. That should help illustrate for you, though, the power of building a test bench and simulating–rather than just implementing something and then wondering what went wrong later.

Building this

So let’s build our final delay element!

Much of this logic is the logic you might expect from our discussion above.

For example, we need to increment the write address on every sample.

	initial	wraddr = 0;
	always @(posedge i_clk)
		if (i_ce)
			wraddr <= wraddr + 1'b1;

You may notice that this write address doesn’t depend upon a reset signal. The reason is simply because it doesn’t need to. As long as it increments by one on every clock from whatever address it’s at, it will work.

Likewise we are going to want to write our incoming samples into memory.

	always @(posedge i_clk)
		if (i_ce)
			mem[wraddr] <= i_word;

The difficult trick from above was that we need to make certain that the read address equals the write address plus one minus the delay. Making this happen in clocked logic is a touch more difficult–particularly because of the i_ce pipeline control signal.

So that we can keep the read address a fixed distance from the write address any time the delay, herein called w_delay–you’ll see why in a bit, changes, we’ll violate the rules of the global CE bit and set this on every clock. If CE is valid, we set the read address to the write address minus the delay plus two–not one. The two allows us to compensate for the fact that the write address is also changing on this clock. However, if the CE line is low, then the write address isn’t changing and the logic may appear more intuitive.

	initial	rdaddr = one;
	always @(posedge i_clk)
		if (i_ce)
			rdaddr <= wraddr + two - w_delay;
		else
			rdaddr <= wraddr + one - w_delay;

Now that we have our read address, we can simply read from memory.

	always @(posedge i_clk)
		if (i_ce)
			memval <= mem[rdaddr];

With all this information, we can now make our delay logic. You migt recognize this from before–the delay of zero and the delay of one samples are identical.

	always @(posedge i_clk)
	if (i_ce)
	begin
		if (w_delay == 0)
		begin
			o_word <= i_word;
			o_delayed <= i_word;
		end else if (w_delay == 1)
		begin
			o_word <= i_word;
			o_delayed <= o_word;
		end else begin

Even the delay logic, which is implemented using memory, reads just about the same as it did before.

			o_word <= i_word;
			o_delayed <= memval;
		end
	end
endmodule

Pretty simple, right?

Well, okay, so let’s get one touch fancier. Right now this delay element works off of a variable, user-selectable delay. Suppose instead that you wanted this delay element to use a fixed delay instead. You could just feed a constant value to i_delay and allow the optimizer within the synthesizer to handle everything that follows. We’ll take a separate approach here. We’ll capture this desired fixed delay with a FIXED_DELAY parameter, and then use this parameter to determine the delay any time FIXED_DELAY != 0.

Remember that w_delay item I said we’d touch on later? This value is set to i_delay when the parameter isn’t forcing the delay amount, and FIXED_DELAY when it is.

	assign	w_delay = (FIXED_DELAY != 0) ? FIXED_DELAY : i_delay;

That’s a nice improvement to our delay component.

Still, the overall design isn’t all that different from the one we started out with–even with the details filled in.

Building a Test Bench

Since this is a fairly simple component, we can discuss the test bench before we finally conclude–rather than separating the test bench into a separate post. The test bench for this delay element follows from the same principles I laid out earlier, when we examined Verilator. Basically, when you are using Verilator your test bench is a C++ program that interacts with your design, and then compares the responses from the design to known responses that we might expect.

We’ll capture our parameters before starting, since our test will be dependent upon them.

const int	DW = 12, LGDLY=4, NTESTS=512;

Setting up the main program itself is fairly boilerplate. You need to make certain you call the commandArgs function to initialize Verilator. We’ll then declare our test class–wrapping it within the TESTB class so that we can get clock ticks, resets, and VCD file generation code for free.

int	main(int argc, char **argv) {
	Verilated::commandArgs(argc, argv);
	TESTB<Vdelayw>	tb;
	unsigned	mask = 0, wptr = 0;;
	unsigned	*mem;

Our first task will be to open a VCD trace file so that we can debug any problems later.

	tb.opentrace("delayw.vcd");

Then we’ll reset our core, so that we can start this test in a known state.

	tb.m_core->i_ce    = 0;
	tb.m_core->i_delay = 0;
	tb.reset();

You may recall from our first formal methods post the problem associated with testing a reset in a test bench: that there are more combinations of when a reset can happen with respect to this logic than I have the creativity to imagine. It’s a problem we’re going to ignore here, but a valid one and hence one worth remembering.

We’re going to need our own copy of the delay memory, so that we can also create our own delay here in C++ to compare the unit under test to.

	mem = new unsigned[(1<<LGDLY)];
	mask = (1<<LGDLY)-1;

Let’s run our test across every delay that this delay element may produce. We’ll loop through each possible delay, testing and validating the results along the way.

	tb.m_core->i_ce = 1;
	for(int dly=0; dly<(1<<LGDLY); dly++) {
		tb.m_core->i_delay = dly;

The first step, following any change in delay value, is to load up that many values in the memory without testing any of the output delays.

		for(int k=0; k<dly+1; k++) {

To do this, we’ll generate a random number,

			unsigned	v = rand() & ((1<<DW)-1);

and to write it to our core.

			tb.m_core->i_word = v;
			tb.tick();

We’ll also record that number into our own memory copy at the same time.

			mem[wptr] = v;
			wptr += 1; wptr &= mask;
		}

After loading one element per delay, we can now come back and test whether or not the output was properly delayed. We’ll check NTESTS (512) of these for each possible delay.

		for(int k=0; k<NTESTS; k++) {

As before, each test consists of creating a random value,

			unsigned	v = rand() & ((1<<DW)-1);

writing that value to the core,

			tb.m_core->i_word = v;
			tb.tick();

and recording a copy of it for ourselves.

			mem[wptr] = v;
			wptr += 1; wptr &= mask;

Now we can check whether or not the output from the core is the value from dly clocks ago.

			assert(tb.m_core->o_word == mem[(wptr-1)&mask]);
			assert(tb.m_core->o_delayed == mem[(wptr-1-dly)&mask]);
		}
	}

At this point, the tests are complete and all we need to do is close nicely.

You may notice that, in the closing lines of the test bench, there’s no possibility for failure. The reason is simply because a failure to match will cause a failure above in the assert() statements, and so on any failure we’ll never reach this point.

	printf("\n\nSimulation complete: %ld clocks\n", tb.m_tickcount);
	printf("SUCCESS!!\n");
}

That’s it! We’re all done with our test bench.

If you choose to look through the actual test bench, you will notice one more capability that we haven’t discussed here: a certain amount of fuzzing the i_ce line. Specifically, I ticked the clock once with i_ce valid, and then ticked it some (random) number of additional clocks with i_ce equal to zero–just to see if it affected the behavior of the core. (It didn’t)

All of this put together gives us confidence that this delay element works as designed.

Conclusion

We’ve still got lots of other problems and examples to work through, but it’s always fun to pick a simple one to go over that every one can understand.

For now, let’s think about what can be done with a delay element. We’ve already discussed one example above: synchronizing to a packet based upon a preamble. That wasn’t my purpose in building this element today, though. My own purpose is to allow me to measure the Power Spectral Density (PSD) in a waveform input—but we’ll leave that discussion for another day.