Handling multiple clocks with Verilator

For some reason, every time I’ve ever worked with video I’ve never managed to be fortunate enough to have the same clock rate for both the pixel clock and the memory. The closest I came was using a 25MHz pixel clock on the Basys3 board which I could create by dividing a 100MHz clock by four in logic. While that probably wasn’t the best way to do it, I did manage to successfully create a 640x480 image on my test display.

Fig 1. A Nexys Video Board

When I moved on to the more serious pixel clock of 148.5 MHz in my VideoZip project using the Nexys Video board, I could no longer manipulate my 100MHz system clock in logic to generate a 148.5MHz pixel clock. Xilinx’s DDR3 memory controller insisted on a clock of 100MHz, so I was stuck needing to deal with two dissimilar clocks.

Up until that project, I had never used more than one clock with Verilator. Many of my designs were based upon just a single clock. How was I going to handle multiple clocks? This turned into one of the biggest challenges I had when developing VideoZip. (VideoZip remains a work in progress.)

The pixel clock on the Nexys Video board isn’t the only problem for VideoZip. The Gb Ethernet port (RGMII) wants to run at 125 MHz, reasoning about 8-bits at a time. If this weren’t bad enough, the I2S audio interface wants an outgoing clock rate near 49.152 MHz. While logical and ugly kludges to this problem exist (which I may yet write about), the appropriate way to deal with this is to use a PLL or digital clock manager to generate these dissimilar clocks.

Fig 2: Multiple Clocking Needs

The unfortunate consequence was that I needed a multiple clock simulation capability. Ouch.

The solution I eventually chose crosses multiple project boundaries, but it is worthwhile enough that I’ll share it here. It involves not only modifying my prior Verilator test bench wrapper, but also a test-bench clock helper class. While the updated test bench wrapper can be created manually, I’ll show you in the end how to use AutoFPGA to tie each piece together into your design.

Reasoning about clocks

If you remember how I use Verilator, you’ll remember that I like to wrap a Verilated design in a test bench class I call TESTB. Among other things, this test bench class has an tick() method that I can call any time I want the clock within my design to tick once. In my AutoFPGA enabled projects, this TESTB class is created via AutoFPGA. The class also has some nice capabilities for opening and closing VCD trace files–but those are not a part of today’s story.

tick() works by:

Leaving the clock at zero and dumping the design state to a VCD file (if so enabled).
Setting the clock to one, and dumping the design state to a VCD file again.
Setting the clock back to zero, and dumping the design state again.

This time, though, the VCD trace is flushed to disk.
The module is then allowed to read any inputs that may have changed, and adjust any outputs that may need to be changed.
Return to step one and repeat until the simulation is done.

This works great for synchronous designs with only one clock. Using this method I can not only test my own design, but also incorporate co-simulation tests: Serial port, I2C, video, you name it, all of that can fit in this context.

The problem is that this tick() method works great for designs with only one clock, but it is entirely insufficient when dealing with multiple clocks. It’s not that Verilator is somehow insufficient. It’s not. Verilator can handle multiple clocks easily–as long as you can properly drive them. Verilator’s interface requires the caller to generate inputs at whatever rate they wish to do so. This was what I needed to do.

Fig 3: TBCLOCK enabled Verilator simulation structure

My first step was to create a class to describe a clock to my test bench. I call this class TBCLOCK, or “test bench clock”. Its purpose is primarily to help me reason about time, and about one specific clock. To understand the next step, let’s first take a moment to understand this class and its methods. We can then look at how TBCLOCK can help us adjust our TESTB with multi-clock aware information.

TBCLOCK

TBCLOCK has four basic methods: time_to_edge, returning the number of picoseconds to the next clock edge, advance, which advances the clock by some number of picoseconds, and rising_edge which can be used to tell if the clock is currently on its rising edge. The fourth method, falling_edge is identical to rising_edge, but for falling edge clocks.

Put together, these three methods work like this: the TESTB object queries the TBCLOCK objects to determine the amount of time to skip forward to get to the next clock edge. This looks sort of like Fig 4 below.

Fig 4: Time to next clock edge

TBCLOCK compares the current time to when the next edge will take place, and returns that amount of time in picoseconds. (Why picoseconds? It was an arbitrary decision based upon the reality that nanoseconds wasn’t enough for the application(s) shown above, and femptoseconds were overkill.)

The TESTB enhanced logic then advances all of the TBCLOCK objects to the time of this next edge, adjusts the clock input(s) and calls Verilator’s eval() function to update any logic dependent upon that clock.

When viewed across three separate clocks, the result might look like Fig 5.

Fig 5: Multiple clocks

You can see the resulting step sizes as events in the bottom trace in Fig 5. As a result, Verilator doesn’t step forward uniformly by the minimum common denominator of all clock steps, but rather in a non-uniform fashion–so that it is only ever called to evaluate logic following a clock edge.

Creating a TBCLOCK is fairly straight forward. Or, rather, it should be. I got it wrong many times over while just trying to get the basics below right. To create an object of this class, just declare one with the number of picoseconds per clock tick.

class	TBCLOCK	{
	// ....
public:
	TBCLOCK(unsigned long increment_ps) {
		// ....
	}

The initialization routine uses increment_ps to create an internal stepping interval m_increment_ps which is half of the original increment_ps. This allows the TBCLOCK object to reason about both positive and negative edge going clocks.

The next capability the test bench clock offers is the ability to return the number of picoseconds until the next clock tick. This was what Fig 4 was showing above. We’ll use this in the next section in our inner clock loop. The next clock edge will come m_increment_ps picoseconds after the last clock edge. If you subtract this future time from the current time, you’ll get a value of how many picoseconds remain until the next clock edge.

	unsigned long	time_to_edge(void) {
		if (m_last_posedge_ps + m_increment_ps > m_now_ps)
			// Next edge is a negative edge
			return m_last_posedge_ps + m_increment_ps - m_now_ps;
		else // if (m_last_posedge_ps + 2*m_increment_ps > m_now_ps)
			// Next edge is a positive edge
			return m_last_posedge_ps + 2*m_increment_ps - m_now_ps;
	}

Once the clock generator has been queried for the time to the next edge, the test-bench driver can then determine which clock edge comes next. From here, each clock can be advanced until that next edge. That’s the purpose of the advance() function: given a step size (in ps), advance the global clock time maintained within this test bench support clock.

Well, not quite. advance() has one other purpose. It also returns the value of the clock, either 1 or 0, at this new time instant.

	int	advance(unsigned long itime) {

		m_now_ps += itime;

		if (m_now_ps >= m_last_posedge_ps + 2*m_increment_ps) {
			// Advance to the next positive edge, and return
			// a positive valued clock
			m_last_posedge_ps += 2*m_increment_ps;
			m_ticks++;
			return 1;
		} else if (m_now_ps >= m_last_posedge_ps + m_increment_ps) {
			// Negative half of the clock's duty cycle
			return 0;
		} else
			// Positive half of the clock's duty cycle
			return 1;
	}

In the next section, we’ll use the result of advance() to set the clock input value to the main Verilog test bench function.

There are two other helper functions to determine if the current time is a rising or a falling edge, but that’s the basics of the first part.

	// Return true if this is a rising clock edge
	bool	rising_edge(void) {
		if (m_now_ps == m_last_posedge_ps)
			return true;
		return false;
	}

	// Return true if this is a falling clock edge
	bool	falling_edge(void) {
		if (m_now_ps == m_last_posedge_ps + m_increment_ps)
			return true;
		return false;
	}

The primary work in this class is done within the time_to_edge method. We’ll see how this helps in the next section.

Updating the inner testbench class, TESTB

The TBCLOCK class we discussed above is only a helper in the scheme of things. Most of the actual logic takes place within the updated tick() function found within the test bench object, TESTB, used to drive the Verilator inputs.

As you may recall, I started creating a test bench class wrapper once I noticed that I kept using the same code for every Verilator based test bench. The code to open a trace file was the same. The code to capture data to that trace file was the same. The code to toggle the clock was the same. I found myself copying these pieces of code from one simulation wrapper to another. Rather than just duplicate the same code, I created the test bench wrapper class, TESTB.

One of the primary functions of the test bench wrapper object is to advance the clock. Verilator requires that the clock toggle from low to high in order to call the positive edge logic within your design. The clock needs to then return low, and all of these transitions require calls to the Verilator tracing methods if you want a VCD file when you are done.

I found this cumbersome, so I wrapped all of that logic with a tick() method. This is the same tick() method I discussed above. The tick() method of TESTB would capture inputs to the core in a trace,

	virtual	void	tick(void) {
		m_tickcount++;

		// Step one--don't skip this one!
		// This step is necessary to make certain any combinatorial
		// logic settles prior to the positive edge of the clock, and
		// following any adjustments to design's inputs
		//
		// m_core->i_clk = 0; // (This is implied)
		eval();
		if (m_trace)
			m_trace->dump(10*m_tickcount-2);

toggle the clock high,

		// Step two
		m_core->i_clk = 1;
		eval();

capture the results in a trace,

		if (m_trace)
			m_trace->dump(10*m_tickcount);

then toggle the clock low

		// Step three
		m_core->i_clk = 0;
		eval();

and capture the results in the trace again–this time flushing the trace file. (Flushing is important–I’ve had too many designs fail some C-assertion in their associated logic, and without the flush you may not get the state of your variables at that last clock.

		if (m_trace) {
			m_trace->dump(10*m_tickcount+5);
			m_trace->flush();
		}
	}

Before moving on, let me foot-stomp here that all three calls to eval() are essential!. While it may look like the last step and the first step are identical since they both leave the clock at zero, they are not the same. Between these two steps, co-simulation logic might change inputs to the design. Unless you call eval() following any co-simulation updates to design inputs, combinational logic depending upon these inputs may not settle. This is a painful bug to search for, so I recommend you learn the lesson here.

In this single clock paradigm outlined above, I could read any outputs and adjust any inputs after calling this one tick() method. I could also call the C assert function if something had gone wrong–the flush() command above guaranteed that the relevant portion of the trace was in the file. This approach was simple enough, and I’ve used this pattern for many of my designs. (You can read more about it here.)

Sadly, this initial approach didn’t work when dealing with multiple clocks.

Instead, let’s walk through how this tick() method can be updated to deal with multiple clocks. In the example below, drawn from the VideoZip project, I have four clocks: hdmi_out. hdmi_in, net_rx_clk, and my default clk.

The first step when calling tick() is to check the number of picoseconds till the next clock edge. This is the minimum time to the next edge among all clocks.

	virtual void	tick(void) {
		// m_clk describes the system clock
		unsigned	mintime = m_clk.time_to_edge();

		// m_hdmi_out_clk describes the HDMI output clock
		// This is at 148.5MHz for this design
		if (m_hdmi_out_clk.time_to_edge() < mintime)
			mintime = m_hdmi_out_clk.time_to_edge();

		// m_hdmi_in_clk describes the HDMI input clock
		// This is identical to the HDMI output clock in this design
		if (m_hdmi_in_clk.time_to_edge() < mintime)
			mintime = m_hdmi_in_clk.time_to_edge();

		// m_net_rx_clk describes the 125MHz ethernet RGMII interface
		// clock
		if (m_net_rx_clk.time_to_edge() < mintime)
			mintime = m_net_rx_clk.time_to_edge();

Once we know this amount of time, we’ll call eval() once out of an abundance of caution. This makes sure, before any clock edges change, that all of the combinational logic associated with any potentially changed input wires has settled.

		eval();

Once done, each of the various clock objects may be advanced by this amount of time, and our global estimate of the current time can advance as well.

		m_core->i_hdmi_out_clk = m_hdmi_out_clk.advance(mintime);
		m_core->i_hdmi_in_clk = m_hdmi_in_clk.advance(mintime);
		m_core->i_clk = m_clk.advance(mintime);
		m_core->i_net_rx_clk = m_net_rx_clk.advance(mintime);

		m_time_ps += mintime;

Finally, using these new clock values, we can call Verilator to evaluate our design in this new interval–adjusting any edge triggered logic.

		eval();

If we are recording a trace at this time, we’ll then call Verilator to dump the current state of the design to a trace file.

		if (m_trace) {
			m_trace->dump(m_time_ps);
			m_trace->flush();
		}

Don’t forget to flush it! There’s been more than one time when I’ve checked the outputs of a core after ticking the clock, decided their was a problem and aborted, only to find the relevant signals hadn’t ended up in the trace file.

Finally, we’ll call any external simulation logic depending on clock edges. In my single clock designs, I do this about mid-way through the low period of the clock, so you can “see” the transformation. I also did it between calls to tick(). This doesn’t work with multiple-clocks, since peripherals are often defined by the clock the logic is associated with. For this reason, we’ll have to call separate functions for each clock to allow these co-simulations to update. We’ll do this on the falling edges of their respective clocks. This includes possibly updating the video simulation, checking for simulated network packets, and more.

		if (m_hdmi_out_clk.falling_edge()) {
			m_changed = true;
			sim_hdmi_out_clk_tick();
		}
		if (m_hdmi_in_clk.falling_edge()) {
			m_changed = true;
			sim_hdmi_in_clk_tick();
		}
		if (m_net_rx_clk.falling_edge()) {
			m_changed = true;
			sim_clk_tick();
		}
		if (m_clk.falling_edge()) {
			m_changed = true;
			sim_clk_tick();
		}
	}

For example, in my spectrogram demo project, the sim_clk_tick() function advances the A/D simulation and so updates i_adc_miso, and the sim_pixclk_tick() advances the simulated video on the screen using the outgoing pixel, and the various outgoing synch signals. (Ref)

The conclusion here is that if you want to use this technique, you’ll want to copy the TBCLOCK class (or build your own), and then create a test bench wrapper that references your TBCLOCK objects and gets all the pieces right.

Alternatively, you could use AutoFPGA to handle all of this busy work for you.

Using AutoFPGA to build the testbench

If you are not familiar with AutoFPGA, then in quick sum: it is Verilog-based code generator based upon a copy and paste concept with minimal substitution capability. You specify the code snippets associated with each design component or peripheral in an AutoFPGA configuration file, and then when you call AutoFPGA specifying that configuration file (among many others), AutoFPGA will create your top level (device dependent) design, your main design (device independent) file, and several other bus related files associated with the peripherals you are making or using.

If you are interested in this, consider reading about AutoFPGA’s design goal’s, or the primer on how to connect simple register-based components to a debugging bus using AutoFPGA.

The neat thing about using AutoFPGA for a purpose like this one, is that when you no longer need the extra clock or the logic that uses it, you can just remove the reference to the configuration file describing those components of your design from the AutoFPGA command line. If you want to see how this works, consider examining a project that uses AutoFPGA, and then looking in the AutoFPGA configuration file directory for the Makefile. In there, you’ll find some lines similar to:

DATA := global.txt bkram.txt buserr.txt clock.txt                       \
	dlyarbiter.txt flash.txt rtclight.txt   rtcdate.txt             \
	pic.txt pwrcount.txt                                            \
	version.txt busconsole.txt zipmaster.txt sdspi.txt

AUTOFPGA := ../../../autofpga/trunk/sw/autofpga

.PHONY: data
data: $(AUTOFPGA) $(DATA)
	$(AUTOFPGA) -o . $(DATA)

This captures, in the $(DATA) variable a list of configuration files that are given to AutoFPGA.

Then in the main project Makefile the created code files will be copied to their various parts of the project tree if running AutoFPGA had changed them–but not otherwise. As an example from Zbasic, these Makefile lines would look like:

.PHONY: autodata
autodata: check-autofpga
	$(MAKE) --no-print-directory --directory=auto-data
	$(call copyif-changed,auto-data/toplevel.v,rtl/toplevel.v)
	$(call copyif-changed,auto-data/main.v,rtl/main.v)
	$(call copyif-changed,auto-data/regdefs.h,sw/host/regdefs.h)
	$(call copyif-changed,auto-data/regdefs.cpp,sw/host/regdefs.cpp)
	$(call copyif-changed,auto-data/board.h,sw/zlib/board.h)
	$(call copyif-changed,auto-data/board.ld,sw/board/board.ld)
	$(call copyif-changed,auto-data/rtl.make.inc,rtl/make.inc)
	$(call copyif-changed,auto-data/testb.h,sim/verilated/testb.h)
	$(call copyif-changed,auto-data/main_tb.cpp,sim/verilated/main_tb.cpp)

and a little later, you’ll see the definition of this copyif-changed function.

define  copyif-changed
	@bash -c 'cmp $(1) $(2); if [[ $$? != 0 ]]; then echo "Copying $(1) to $(2)"; cp $(1) $(2); fi'
endef

Basically, if files $(1) and $(2) differ, then $(1) is copied on top of $(2). This keeps make from rebuilding things that depend upon files that haven’t changed.

But that’s not my point here and now.

What I want to share right now is how easy it is to teach AutoFPGA about your multiple clocks.

First, you’ll want to define each of your clocks. A clock, in terms of AutoFPGA, has three components: a name, the name of the wire that contains this clock, and the frequency of the clock in Hz. For example, you might have a clock clk contained in the wire i_clk, that runs at 100MHz. You’d then define this as:

CLOCK.NAME= clk
CLOCK.WIRE= i_clk
CLOCK.FREQUENCY= 100000000

This alone is all that is needed to create the clock in the AutoFPGA generated TESTB file.

What about simulating a component requiring this clock?

Let’s consider simulating a video display. You can find a video display simulator here. Let’s assume your design has outputs o_vga_vsync, o_vga_hsync, o_vga_red, o_vga_grn, and o_vga_blu–such as this one does. Then, you’d want to declare a VGA simulator VGA simulator in your Verilog design component,

SIM.DEFNS=
	VGASIM	*m_vga;

You’d then want to initialize this component. Here, we’ll set it up for an 800x600 display mode.

SIM.INIT=
		m_vga = new VGASIM(800,600);

We can then call this co-simulation component on every clock tick, with,

SIM.TICK=
		(*m_vga)(m_core->o_vga_vsync, m_core->o_vga_hsync,
			m_core->o_vga_red, m_core->o_vga_grn,
			m_core->o_vga_blu);

Don’t forget to define the clock! For an 800x600 display mode, you’ll need a 40MHz clock.

CLOCK.NAME= pixclk
CLOCK.WIRE= i_pixclk
CLOCK.FREQUENCY= 40000000

Ideally, you could just add this updated configuration file to your design to add this component, or remove it from your design to remove the component. At this point, this would work for a Verilator simulation. If you wanted to go beyond simulation, you’d need to actually add and configure the PLL in the toplevel design component. You’d use the TOP.INSERT AutoFPGA tag for that purpose. AutoFPGA would then copy the contents of that tag into your toplevel.v design file. No AutoFPGA doesn’t configure the PLL itself (yet)–you still have to give it the code for that (with the TOP.INSERT). Still, AutoFPGA will put that code in place for you, making reconfiguration simpler.

Conclusion

Perhaps that seems like a lot of work. It’s not really. We’re primarily talking about 20-40 lines of code in total. It’s just a different way of thinking. The only sad and complicated part is that all of these lines of code take place over many design files. Having AutoFPGA manage this for me has helped to keep all of the changes to support multiple clocks within one or two files only.

In the end, we now have a Verilator, based design that runs using multiple clocks. Not only that, you can generate a VCD file showing all of these various clocks and their respective traces.

While this capability does not (yet) allow the generation of multiple clocks with a known phase relationship, such as one might use with an ISERDES or an OSERDES, upgrading the tools to do so would be fairly trivial. I’m sure I’ll get around to that when I have a need for it.

Perhaps some of you are wondering to yourselves, “Verilog offers a capability to generate multiple clocks already. Why aren’t you using Verilog’s test bench capability to do this?”

My answer to that is simple: I know how to interface a C++ module with my computer’s Windowing system using GTKMM. I don’t know the Verilog system call to do that.

What can you use this for? I’ve already mentioned video, Ethernet, and audio applications. There’s no reason why you can’t use this for custom applications as well. For example, I’m still looking forward to completing the differential pmod challenge … but that’s really another topic for another day.