Breaking all the rules to create an arbitrary clock signal

Have you ever needed a clock speed that wasn’t easy to generate? What if you wanted to build and run your own PLL? Within an FPGA??

As an example, many of my favorite FPGA boards have 100MHz clocks coming into them. What if you wanted to output an audio signal via I2S at 48kHz? 48kHz is a common audio sample rate associated with broadcast audio, sort of like 44.1kHz is associated with CD audio. I have an ADAU1761 24-bit Audio Codec available to me on my Nexys Video board, so I should be able to use it to generate quality sound. This chip, however, requires an incoming clock signal of 49.152MHz in order to produce samples at 48kHz. Any suggestions on how you might multiply a 100MHz signal up to somewhere between 800 and 1600 MHz, and then divide it down in order to get 49.152MHz?

There’s no way to do it.

Worse, what if you had to create a clock that tracked the audio sample rate of an incoming signal–but only when it was present?

As another example, I can use a PMod GPS to measure the clock rate of the board I’m using. I should be possible to use this signal to create a true 49.152MHz clock–true enough to be used as an audio frequency standard.

That’s audio, but what about video? Another common example of when you might need to create a clock at an arbitrary frequency would be when trying to generate a pixel clock for video. With modern monitors, the video driver is expected to query what video modes the monitor is capable of accepting via an EDID protocol transaction (a form of I2C bus. The video driver is then expected to generate a pixel clock based upon what the monitor is capable of receiving. Without external clock generation hardware, how can you create an arbitrary pixel clock?

Yes, I know you can often get away with being “close enough” in many of these examples. For example, if you wanted to feed a monitor wanting a 25.172MHz pixel clock you might still be able to drive it with a 25MHz pixel clock instead. (I’ve done it.) But what about 88.75MHz? How might you generate that signal?

Given that there’s a reason to need something like this, let’s discuss today how you might generate a clock at an arbitrary frequency when using an FPGA.

Breaking all the rules

In order to generate a clock at an arbitrary rate, we’re going to need to break all of the rules. Specifically, I wrote in my rules for new FPGA designers: never use a logic generated clock.

Build your design with only one clock.

Do not transition on the positive (rising) edge of anything other than your system clock.

I discourage anyone from using a logic generated clock for a few basic reasons:

Logic generated clocks tend not to be placed onto the system clock backbone

This will cause significant skew in your clock from one end of the chip to another. This skew can easily be bad enough to make your design fail for seemingly inexplicable reasons.
Beginners tend not to realize that you still need to use a proper clock domain crossing from the clock domain that generated the clock and the generated clock domain.
Most FPGA tool chains don’t know how to handle logic clocks, assuming that they are recognized at all.

This leads to logic that isn’t properly constrained to guarantee operation at the clock rate of interest.
If your clock isn’t generated via a flip-flop, there might be glitches on it

My rule has always been: Clocks should only ever be created or adjusted within an FPGA using a hardware device clock management resource, such as a PLL.

Today, we’re going to break this rule.

We also want to break it “safely”, so that this step won’t keep our logic from acting “normally”.

To do this, I built and experimented with the architecture shown in Fig. 1 below.

Fig 1. Arbitrary clock generation, hardware setup

Let’s walk through the steps of how this might work.

The first step is to generate a new clock. I used a basic fractional clock divider for this purpose.

	always @(posedge i_clk)
		counter <= counter + i_delay;

	assign	o_clk = counter[MSB];

Well, not quite, but that’s the basics of the ideal initially. We’ll come back and improve upon this in a moment.

There are several problems with using a basic fractional clock divider divider such as this one. One of the worst problems is the phase noise. Imagine you wanted to divide your clock by three. You would add to your counter some number on every clock tick in an effort to get a divide by three. If your clock was N bits wide, then clearly after 2^N clock ticks, this pseudo clock generator would have wrapped some integer number of times. If we make this integer close to 2^N/3, we can get close to a division by three.

Perhaps a picture would help. Suppose we used a 4 bit counter, to which we add a delay value of 5 to it–in order to get close to 1/3, while also picking off the top bit for our new clock. If you plotted this out, you might see a trace similar to Fig. 2 below.

Fig 2. A Fractionally Generated Clock Signal

If you ignore the fact, for a moment, that this may be about the ugliest clock signal you’ve ever seen, you’ll notice that this clock signal is high for five periods out of 16, which is a rough divide by three.

If we add more bits to our counter and step, we’ll be able to represent more frequencies For example, if we had used a 32-bit counter, we might step by 32'h55555555. No, that’s still not quite 1/3rd, but its much closer than we were before.

While that’s better, the clock still looks awful–it just looks awful about a frequency closer to the one we want.

We need a way to clean this up.

Enter the reason for using an OSERDES in Fig. 1. By using an OSERDES, we can get closer to the clean clock we wanted. For example, the same clock from Fig. 2 above, now upsampled by a factor of 8, would produce a waveform looking closer to Fig. 3 below.

Fig 3. Upsampling the fractionally generated clock

That is starting to look like a clock signal.

Of course, we’ll still have jitter on even this upsampled clock signal. While our upsampling helped, it could only do so much. We’ll always have a signal that’s going to be within a “sample” of the right value. This rounding to the nearest sample will always create phase noise on our clock. Fixing this is the purpose of the PLL in Fig. 1 above.

That leaves only a couple of key details remaining.

For example, why did we leave the FPGA on a clock capable pin only to immediately come back in again? I did this for a couple of reasons. First, we needed the OSERDES, that 8:1 serializer, in order to create the cleaner clock signal. OSERDES components are only found connected to the I/O pins going directly off-chip. Second, many FPGA’s require that you must enter the chip from a clock-capable pin in order to get into the clock infrastructure within the FPGA. Doing otherwise will result in a design error on many architectures.

The last question is, what clock rate do we tell the tools this input has since we can vary it as often as we want? For this, we’ll use the maximum clock rate it can have. I’ll leave the decision of what this maximum rate is to you. I’ve used 200MHz, 100MHz, and 50MHz successfully.

Finally, let’s return to our reasons never to use a logic generated clock. Have we dealt with all of the reasons “why not” so that we now can?

Logic generated clocks tend not to be placed onto the system clock backbone

By starting with a clock capable pin, we go directly into the clock infrastructure on the chip.
Beginners tend not to realize that you still need to use a proper clock domain crossing from the clock domain that generated the clock and the generated clock domain.

We’ll be smart and use proper clock domain crossing techniques, right?
Most FPGA tool chains don’t know how to handle logic generated clocks

In this case, the tools will treat this as an externally generated clock, at the maximum rate it can produce, so we’re good here too.
If your clock isn’t generated via a flip-flop, there might be glitches on it

We’ve solved this by generating our clock using (several) flip-flops, and the OSERDES helps as well.

Really, the only issue left is whether or not this clock will be “clean enough”. For that, we’ll need to build it and test it.

Building the 8:1 Fractional Divider

Building this divider is really easy. It’s basically,

	localparam	UPSAMPLE = 8;

	reg	[BW-1:0]	counter [0:(UPSAMPLE-1)];

	always @(posedge i_clk)
	begin
		counter[1] <= counter[0] +     i_delay;
		counter[2] <= counter[0] + 2 * i_delay;
		counter[3] <= counter[0] + 3 * i_delay;
		counter[4] <= counter[0] + 4 * i_delay;
		counter[5] <= counter[0] + 5 * i_delay;
		counter[6] <= counter[0] + 6 * i_delay;
		counter[7] <= counter[0] + 7 * i_delay;
		counter[0] <= counter[0] + 8 * i_delay;
	end

Notice that each of these counters is created by adding an offset to a single counter, counter[0]. That keeps them all synchronized with each other.

Finally, the MSB from each counter is used as the outgoing clock signal.

	always @(posedge i_clk)
		o_word <= {
			counter[1][BW-1], counter[2][BW-1],
			counter[3][BW-1], counter[4][BW-1],
			counter[5][BW-1], counter[6][BW-1],
			counter[7][BW-1], counter[0][BW-1]
		};

There’s really not much more to it than that. We’ll still do a walk through the actual code below.

module	genclk(i_clk, i_delay, o_word, o_stb);
	parameter	BW=32;		// The bus width
	localparam	UPSAMPLE=8;	// Upsample factor
	input	wire				i_clk;
	input	wire	[(BW-1):0]		i_delay;
	output	reg	[(UPSAMPLE-1):0]	o_word;

One of the things we haven’t discussed is how you might synchronize this clock with operations carried out on the current clock. For this, I’ve envisioned using a strobe signal, o_stb shown below.

	output	reg				o_stb;

	reg	[(BW-1):0]	counter [0:(UPSAMPLE-1)];

This is how I would normally handle generating an internal signal at a different rate–something I could use without ever needing a clock domain crossing.

While there will be a fairly uncontrolled delay between o_stb and the outgoing clock, o_stb will at least maintain the proper clock to clock relationship.

Coming back to the basic implementation above, perhaps you’ve noticed that the big problem with this implementation is the requirement for the multipliers. Let’s see if we can get rid of them. Multiplication by 1, 2, 4, and 8 is easy–they can be accomplished with a simple left shift. What about multiplication by 3, 5, or 7? Those will be harder. Six is easy, though, if we can already multiply by three.

In this case, we’ll cheat since all of these values can be created with some creative addition–sparing us the multiply.

Multiplying i_delay times three, for example, is just a matter of adding it to itself times two.

	always @(posedge i_clk)
		times_three <= { i_delay[(BW-2):0], 1'b0 } + i_delay;

Multiplying i_delay times five is the same as adding it to a copy of itself times four.

	always @(posedge i_clk)
		times_five  <= { i_delay[(BW-3):0], 2'b0 } + i_delay;

For seven, we can subtract i_delay from i_delay times eight.

	always @(posedge i_clk)
		times_seven <= { i_delay[(BW-4):0], 3'b0 } - i_delay;

Of course, it will take a clock in order for these values to be valid. To keep things consistent, let’s also delay i_delay by one clock tick as well.

	always @(posedge i_clk)
		r_delay <= i_delay;

The rest is just book keeping.

	always @(posedge i_clk)	// Times one
		counter[1] <= counter[0] + r_delay;

	always @(posedge i_clk)	// Times two
		counter[2] <= counter[0] + { r_delay[(BW-2):0], 1'b0 };

	always @(posedge i_clk) // Times three
		counter[3] <= counter[0] + times_three;

	always @(posedge i_clk) // Times four
		counter[4] <= counter[0] + { r_delay[(BW-3):0], 2'b0 };

	always @(posedge i_clk) // Times five
		counter[5]  <= counter[0] + times_five;

	always @(posedge i_clk)
		counter[6]  <= counter[0] + { times_three[(BW-2):0], 1'b0 };

	always @(posedge i_clk)
		counter[7] <= counter[0] + times_seven;

	always @(posedge i_clk) // Times eight---and generating the next clk wrd
		{ o_stb, counter[0] }  <= counter[0] + { r_delay[(BW-4):0], 3'h0 };

Our final result is just the collection of all of the most-significant bits.

	always @(posedge i_clk)
		o_word <= {	// High order bit is "first"
			counter[1][(BW-1)],	// First bit
			counter[2][(BW-1)],
			counter[3][(BW-1)],
			counter[4][(BW-1)],
			counter[5][(BW-1)],
			counter[6][(BW-1)],
			counter[7][(BW-1)],
			counter[0][(BW-1)]	// Last bit in order
		};
endmodule

How wide should BW be? I’ve chosen to make it 32-bits wide. Why? Well, because my Wishbone bus implementation is 32-bits wide. It’s kind of an arbitrary choice. You’ll get more frequency accuracy (relative to the system clock) the more bits you have, although I tend to think 32 bits is enough. With a 32-bit counter, you can generate an arbitrary clock with frequency control in steps of SYS_CLOCK_FREQUENCY / 2^32, or about 23 mHz for a system clock of 100MHz. (Yes, that is milliHertz!) I figure that’s good enough.

Xilinx Specific I/O

In general, I try to keep this blog hardware agnostic, while just discussing Verilog design and verification. This particular design, however, needs some help from the hardware, so let’s take a look at how we might handle the I/O architecture for a Xilinx 7-series device.

The first rule is to separate anything that is vendor specific into its own section of the design. This will allow us to keep using Verilator on the rest of the design.

module	xgenclk(i_clk, i_hsclk, i_word, io_pin, o_clk, o_locked);
	parameter	[0:0]	USE_PLL=1'b1;
	input	wire		i_clk;
	input	wire		i_hsclk;
	input	wire	[7:0]	i_word;
	inout	wire		io_pin;
	output	wire		o_clk;
	output	reg		o_locked;

	wire	[5:0]	ignored_data;
	wire	[1:0]	slave_to_master;

	wire	pll_input, w_pin, high_z; // fb_port;

Note the USE_PLL parameter above. Since this clock generator can generate anything up to the system clock rate in sub Hz resolution, it can generate clocks so slow that Xilinx’s PLLs can’t lock onto them. For this reason, I have a flag to select whether or not to use a PLL or not. I suppose it’s not strictly necessary, but there will be a bit of jitter on the resulting clock.

The OSERDESE2 itself is something of a black box, present only on Xilinx 7-series parts. If you are working on another part, check your documentation. You’re likely to find other SERDES capabilities on other FPGAs, although they are likely to go by different names and have different interfaces. Most of the setup below is fairly boiler plate. (You’ll find other SERDES capabilities on other FPGAs.) Even so, it’s possible to get it wrong. Perhaps you remember when I was struggling to get this design to work?

Sadly, the only way I know how to debug output primitives like this OSERDESE2 is to read the fine manual, use an oscilloscope, read the manual again, fiddle with the setup, check the oscilloscope, read the manual some more, and finally fiddle with the setup until it works. There is one other way that I know of, and that is to find an online example (such as this one) and to compare it to your design to see what you might be missing.

So let’s look at how this is configured.

	// Verilator lint_off PINCONNECTEMPTY
	OSERDESE2	#(

The first noteworthy item is that OSERDESE2 needs to be set up for DDR output mode. While we might use SDR mode, we’d be limited to a high speed clock of only 600MHz, whereas when using the DDR mode you can go up to 950MHz.

This also means that our high-speed clock, i_hsclk, need only be 4x the speed of the our system clock. Of course, the two clocks, i_hsclk, and i_clk, must also be generated by the same PLL. This creates a bit of a hassle when working with Xilinx’s Memory Interface Generator (MIG) generated cores, since they produce a system clock for you to use and applying any PLL to this clock will require a clock domain crossing to move between the MIG clock and the newly generated one. This is in spite of the reality that the MIG uses an internally generated 4x clock that would be perfect for our purposes here–it’s just not an output of the MIG core.

		.DATA_RATE_OQ("DDR"), // DDR goes up to 950MHz, SDR only to 600
		.DATA_RATE_TQ("SDR"),
		.DATA_WIDTH(8),
		.SERDES_MODE("MASTER"),
		.TRISTATE_WIDTH(1)	// Really ... this is unused
		) lowserdes(
			.OCE(1'b1),
			.TCE(1'b1),	.TFB(), .TQ(high_z),
			.CLK(i_hsclk),	// HS clock
			.CLKDIV(i_clk),	// Divided clock input (lowspeed clock)
			.OQ(w_pin),	// Data path to IOB *only*
			.OFB(),	// Data path output feedback to ISERDESE2 or ODELAYE2

The next big confusing question is over which bit gets transmitted first. I’ll admit, I got this wrong at first. The fact that the Xilinx xSERDESE2 components swap which bit is first between them only makes things more confusing. I was able to generate the following ordering using an Oscilloscope. In this case, D1 goes “first”, then D2, etc.

			.D1(i_word[7]),
			.D2(i_word[6]),
			.D3(i_word[5]),
			.D4(i_word[4]),
			.D5(i_word[3]),
			.D6(i_word[2]),
			.D7(i_word[1]),
			.D8(i_word[0]),
			.RST(1'b0),
			.TBYTEIN(1'b0), .TBYTEOUT(),
			.T1(1'b0), .T2(1'b0), .T3(1'b0), .T4(1'b0),
			.SHIFTIN1(), .SHIFTIN2(),
			.SHIFTOUT1(), .SHIFTOUT2()
		);
	// Verilator lint_on  PINCONNECTEMPTY

Yes, many of these pins are not used. I’ve kept these unused pins within my own code more to remind myself of them than anything else.

What about that // Verilator lint_on PINCONNECTEMPTY comment? Yes, I have tried to Verilate this code. No, I don’t have a Verilator model for an OSERDESE2, but I was hoping to use Verilator’s linting capabilities to find bugs when things weren’t going well.

The last item to notice of this OSERDESE2 configuration is that the output is placed into a w_pin wire. This wire now needs to be placed through a bi-directional I/O buffer, while holding the high-impedence flag, T, low. This makes certain that the output of this pin will always come back in on the input.

	IOBUF	genclkio(.I(w_pin), .IO(io_pin), .O(pll_input), .T(1'b0));

Finally, we can use a PLL clock resource to clean up any mess we’ve left behind.

	generate
	if (USE_PLL)
	begin
		wire	pll_fb, pll_fb_unbuffered, pll_locked, pll_output;

		// Verilator lint_off  PINCONNECTEMPTY
		PLLE2_BASE	#(
				.BANDWIDTH("LOW"),
				.CLKFBOUT_MULT(32),	// 800 MHz
				.CLKFBOUT_PHASE(0),
				.CLKIN1_PERIOD(25),	//  40 MHz
				.CLKOUT0_DIVIDE(16),
				.REF_JITTER1(0.19)	// Sim only parameter
				) pll (
				// .CLKOUT5_DIVIDE(1),
				.CLKFBIN(pll_fb),
				.CLKFBOUT(pll_fb),
				.CLKIN1(pll_input),
				// .CLKIN1(io_pin),
				.LOCKED(pll_locked),
				.PWRDWN(!r_ce[1]),
				.RST(1'b0),
				.CLKOUT0(pll_output),
				.CLKOUT1(),
				.CLKOUT2(),
				.CLKOUT3(),
				.CLKOUT4(),
				.CLKOUT5());
		// Verilator lint_on  PINCONNECTEMPTY
	
		
		BUFG pllbuf(.I(pll_output), .O(o_clk));
	
		reg	[4:0]	r_locked;
		always @(posedge i_clk)
			r_locked <= { r_locked[3:0], pll_locked };
		always @(posedge i_clk)
			o_locked <= (&r_locked[4:2]);

Success for my experiments with this core was indicated when this PLL locked. I used that as a binary indicator that the quality of result was “good enough”.

Finally, if the clock frequency needs to be so low that we cannot use the PLL, then we’ll just place the I/O pin as in input directly into a clock buffer and move on.

	end else begin
		always @(*)
			o_locked <= 1'b0;
		BUFG clkbuf(.I(io_pin), .O(o_clk));
	end endgenerate
	
endmodule

I might come back and update this core later to optionally remove the BUFG elements with a parameter setting. These elements are important in order to place your newly generated clock into the clock circuitry of the FPGA. Without them, you should be able to skip the PLL and instead go into a PLL you configure external to this module.

Conclusion

No, this design has never been formally verified. Sorry. Were I to run this through the formal tools, I’m sure I would discover the lack of a whole lot of initial statements–but this will still work without those. It will just have a bit of a glitch on start-up.

Instead, this design was verified using a Digital Discovery logic analyzer, a Nexys Video board, and a lot of patience. Further, the “clock capable pin” that I used was the output bit used to control the fan. (My board has a heat sink and no no fan, so this pin is otherwise unused.)

A better test might’ve measured the quality of this clock using dedicated clock measurement hardware. I haven’t done this. I only know that I can generate a clock within an FPGA and then run this same clock through a PLL. The PLL locks, and I can then use the new clock within my design.

I personally draw two conclusions from this work:

Sometimes you need to use an oscilloscope.
Sometimes you can break all the rules–and still get away with it.