Breaking all the rules to create an arbitrary clock signal
Have you ever needed a clock speed that wasn’t easy to generate? What if you wanted to build and run your own PLL? Within an FPGA??
As an example, many of my favorite FPGA boards have 100MHz clocks coming into them. What if you wanted to output an audio signal via I2S at 48kHz? 48kHz is a common audio sample rate associated with broadcast audio, sort of like 44.1kHz is associated with CD audio. I have an ADAU1761 24-bit Audio Codec available to me on my Nexys Video board, so I should be able to use it to generate quality sound. This chip, however, requires an incoming clock signal of 49.152MHz in order to produce samples at 48kHz. Any suggestions on how you might multiply a 100MHz signal up to somewhere between 800 and 1600 MHz, and then divide it down in order to get 49.152MHz?
There’s no way to do it.
Worse, what if you had to create a clock that tracked the audio sample rate of an incoming signal–but only when it was present?
As another example, I can use a PMod GPS to measure the clock rate of the board I’m using. I should be possible to use this signal to create a true 49.152MHz clock–true enough to be used as an audio frequency standard.
That’s audio, but what about video? Another common example of when you might need to create a clock at an arbitrary frequency would be when trying to generate a pixel clock for video. With modern monitors, the video driver is expected to query what video modes the monitor is capable of accepting via an EDID protocol transaction (a form of I2C bus. The video driver is then expected to generate a pixel clock based upon what the monitor is capable of receiving. Without external clock generation hardware, how can you create an arbitrary pixel clock?
Yes, I know you can often get away with being “close enough” in many of these examples. For example, if you wanted to feed a monitor wanting a 25.172MHz pixel clock you might still be able to drive it with a 25MHz pixel clock instead. (I’ve done it.) But what about 88.75MHz? How might you generate that signal?
Given that there’s a reason to need something like this, let’s discuss today how you might generate a clock at an arbitrary frequency when using an FPGA.
Breaking all the rules
In order to generate a clock at an arbitrary rate, we’re going to need to break all of the rules. Specifically, I wrote in my rules for new FPGA designers: never use a logic generated clock.
Build your design with only one clock.
Do not transition on the positive (rising) edge of anything other than your system clock.
I discourage anyone from using a logic generated clock for a few basic reasons:
-
Logic generated clocks tend not to be placed onto the system clock backbone
This will cause significant skew in your clock from one end of the chip to another. This skew can easily be bad enough to make your design fail for seemingly inexplicable reasons.
-
Beginners tend not to realize that you still need to use a proper clock domain crossing from the clock domain that generated the clock and the generated clock domain.
-
Most FPGA tool chains don’t know how to handle logic clocks, assuming that they are recognized at all.
This leads to logic that isn’t properly constrained to guarantee operation at the clock rate of interest.
-
If your clock isn’t generated via a flip-flop, there might be glitches on it
My rule has always been: Clocks should only ever be created or adjusted within an FPGA using a hardware device clock management resource, such as a PLL.
Today, we’re going to break this rule.
We also want to break it “safely”, so that this step won’t keep our logic from acting “normally”.
To do this, I built and experimented with the architecture shown in Fig. 1 below.
Let’s walk through the steps of how this might work.
The first step is to generate a new clock. I used a basic fractional clock divider for this purpose.
Well, not quite, but that’s the basics of the ideal initially. We’ll come back and improve upon this in a moment.
There are several problems with using a basic fractional clock divider
divider
such as this
one. One of the worst problems is the phase noise. Imagine you wanted to
divide your clock by three. You would add to your counter some number on
every clock tick in an effort to get a divide by three. If your clock
was N
bits wide, then clearly after 2^N
clock ticks, this pseudo clock
generator would have wrapped some integer number of times. If we make this
integer close to 2^N/3
, we can get close to a division by three.
Perhaps a picture would help. Suppose we used a 4
bit counter, to which
we add a delay value of 5
to it–in order to get close to 1/3, while also
picking off the top bit for our new clock. If you plotted this out, you might
see a trace similar to Fig. 2 below.
If you ignore the fact, for a moment, that this may be about the ugliest clock signal you’ve ever seen, you’ll notice that this clock signal is high for five periods out of 16, which is a rough divide by three.
If we add more bits to our counter and step, we’ll be able to represent more
frequencies For example, if we had used a 32
-bit counter, we might step
by 32'h55555555
. No, that’s still not quite 1/3
rd, but its much closer
than we were before.
While that’s better, the clock still looks awful–it just looks awful about a frequency closer to the one we want.
We need a way to clean this up.
Enter the reason for using an OSERDES
in Fig. 1. By using an OSERDES
, we
can get closer to the clean clock we wanted. For example, the same clock from
Fig. 2 above, now upsampled by a factor of 8, would produce a waveform looking
closer to Fig. 3 below.
That is starting to look like a clock signal.
Of course, we’ll still have jitter on even this upsampled clock signal. While our upsampling helped, it could only do so much. We’ll always have a signal that’s going to be within a “sample” of the right value. This rounding to the nearest sample will always create phase noise on our clock. Fixing this is the purpose of the PLL in Fig. 1 above.
That leaves only a couple of key details remaining.
For example, why did we leave the FPGA on a clock capable pin only to
immediately come back in again? I did this for a couple
of reasons. First, we needed the OSERDES, that 8:1 serializer, in order
to create the cleaner clock signal. OSERDES
components are only
found connected to the I/O pins going directly off-chip. Second, many FPGA’s
require that you must enter the chip from a clock-capable pin in order to get
into the clock infrastructure within the FPGA. Doing otherwise will result
in a design error on many architectures.
The last question is, what clock rate do we tell the tools this input has since we can vary it as often as we want? For this, we’ll use the maximum clock rate it can have. I’ll leave the decision of what this maximum rate is to you. I’ve used 200MHz, 100MHz, and 50MHz successfully.
Finally, let’s return to our reasons never to use a logic generated clock. Have we dealt with all of the reasons “why not” so that we now can?
-
Logic generated clocks tend not to be placed onto the system clock backbone
By starting with a clock capable pin, we go directly into the clock infrastructure on the chip.
-
Beginners tend not to realize that you still need to use a proper clock domain crossing from the clock domain that generated the clock and the generated clock domain.
We’ll be smart and use proper clock domain crossing techniques, right?
-
Most FPGA tool chains don’t know how to handle logic generated clocks
In this case, the tools will treat this as an externally generated clock, at the maximum rate it can produce, so we’re good here too.
-
If your clock isn’t generated via a flip-flop, there might be glitches on it
We’ve solved this by generating our clock using (several) flip-flops, and the OSERDES helps as well.
Really, the only issue left is whether or not this clock will be “clean enough”. For that, we’ll need to build it and test it.
Building the 8:1 Fractional Divider
Building this divider is really easy. It’s basically,
Notice that each of these counters is created by adding an offset to a single
counter, counter[0]
. That keeps them all synchronized with each other.
Finally, the MSB from each counter is used as the outgoing clock signal.
There’s really not much more to it than that. We’ll still do a walk through the actual code below.
One of the things we haven’t discussed is how you might synchronize this clock
with operations carried out on the current clock. For this, I’ve envisioned
using a strobe signal, o_stb
shown below.
This is how I would normally handle generating an internal signal at a different rate–something I could use without ever needing a clock domain crossing.
While there will be a fairly uncontrolled delay between o_stb
and the
outgoing clock, o_stb
will at least maintain the proper clock to clock
relationship.
Coming back to the basic implementation above, perhaps you’ve noticed that
the big problem with this implementation is the requirement for the
multipliers. Let’s see if we can get rid of them. Multiplication by 1, 2,
4, and 8 is easy–they can be accomplished with a simple left shift. What about
multiplication by 3
, 5
, or 7
? Those will be harder. Six is easy, though,
if we can already multiply by three.
In this case, we’ll cheat since all of these values can be created with some creative addition–sparing us the multiply.
Multiplying i_delay
times three, for example, is just a matter of adding it
to itself times two.
Multiplying i_delay
times five is the same as adding it to a copy of itself
times four.
For seven, we can subtract i_delay
from i_delay
times eight.
Of course, it will take a clock in order for these values to be valid. To
keep things consistent, let’s also delay i_delay
by one clock tick as well.
The rest is just book keeping.
Our final result is just the collection of all of the most-significant bits.
How wide should BW
be? I’ve chosen to make it 32-bits wide. Why? Well,
because my Wishbone bus implementation is 32-bits wide. It’s kind of an
arbitrary choice. You’ll get more frequency accuracy (relative to the
system clock) the more bits you have, although I tend to think 32
bits
is enough. With a 32-bit counter, you can
generate an arbitrary clock with frequency control in steps of
SYS_CLOCK_FREQUENCY / 2^32
, or about 23 mHz
for a system clock of 100MHz.
(Yes, that is milliHertz!) I figure that’s good enough.
Xilinx Specific I/O
In general, I try to keep this blog hardware agnostic, while just discussing Verilog design and verification. This particular design, however, needs some help from the hardware, so let’s take a look at how we might handle the I/O architecture for a Xilinx 7-series device.
The first rule is to separate anything that is vendor specific into its own section of the design. This will allow us to keep using Verilator on the rest of the design.
Note the USE_PLL
parameter above. Since this clock generator can generate
anything up to the system clock rate in sub Hz resolution, it can generate
clocks so slow that Xilinx’s
PLLs
can’t lock
onto them. For this reason, I have a flag to select whether or not to use a
PLL
or not. I suppose it’s not strictly necessary, but there will be a bit
of jitter on the resulting clock.
The OSERDESE2
itself is something of a black box, present only on Xilinx
7-series parts. If you are working on another part, check your documentation.
You’re likely to find other SERDES
capabilities on other FPGAs, although
they are likely to go by different names and have different interfaces.
Most of the setup below is fairly boiler plate. (You’ll find other SERDES
capabilities on other FPGAs.) Even so, it’s possible to get it wrong. Perhaps
you remember when I was struggling to get this design to
work?
Sadly, the only way I know how to debug output primitives like this OSERDESE2
is to read the fine manual, use an oscilloscope, read the manual again, fiddle
with the setup, check the oscilloscope, read the manual some more, and finally
fiddle with the setup until it works. There is one other way that I know of,
and that is to find an online example (such as this one) and to compare it
to your design to see what you might be missing.
So let’s look at how this is configured.
The first noteworthy item is that OSERDESE2
needs to be set up for DDR
output mode. While we might use SDR
mode, we’d be limited to a high speed
clock of only 600MHz
, whereas when using the DDR
mode you can go up to
950MHz
.
This also means that our high-speed clock, i_hsclk
, need only be 4x the
speed of the our system clock. Of course, the two clocks, i_hsclk
, and
i_clk
, must also be generated by the same
PLL.
This creates a bit of a hassle
when working with Xilinx’s Memory Interface Generator (MIG) generated cores,
since they produce a system clock for you to use and applying any
PLL
to this clock will require a
clock domain crossing to
move between the MIG clock and the newly generated one. This is in spite of
the reality that the MIG uses an internally generated 4x
clock that would
be perfect for our purposes here–it’s just not an output of the MIG core.
The next big confusing question is over which bit gets transmitted first. I’ll
admit, I got this wrong at first. The fact that the Xilinx xSERDESE2
components swap which bit is first between them only makes things more
confusing. I was able to generate the following ordering using an Oscilloscope.
In this case, D1
goes “first”, then D2
, etc.
Yes, many of these pins are not used. I’ve kept these unused pins within my own code more to remind myself of them than anything else.
What about that // Verilator lint_on PINCONNECTEMPTY
comment? Yes, I have
tried to
Verilate
this code. No, I don’t have a
Verilator
model for an OSERDESE2
, but I was hoping to use
Verilator’s
linting capabilities to find bugs when things weren’t going well.
The last item to notice of this OSERDESE2
configuration is that the output
is placed into a w_pin
wire. This wire now needs to be placed through a
bi-directional I/O buffer, while holding the high-impedence flag, T
, low.
This makes certain that the output of this pin will always come back in on
the input.
Finally, we can use a PLL clock resource to clean up any mess we’ve left behind.
Success for my experiments with this core was indicated when this PLL locked. I used that as a binary indicator that the quality of result was “good enough”.
Finally, if the clock frequency needs to be so low that we cannot use the PLL, then we’ll just place the I/O pin as in input directly into a clock buffer and move on.
I might come back and update this core later to optionally remove the
BUFG
elements with a parameter setting. These elements are important in
order to place your newly generated clock into the clock circuitry of the FPGA.
Without them, you should be able to skip the
PLL
and instead go into a
PLL
you configure external to this module.
Conclusion
No, this design has never been formally verified. Sorry. Were I to run this through the formal tools, I’m sure I would discover the lack of a whole lot of initial statements–but this will still work without those. It will just have a bit of a glitch on start-up.
Instead, this design was verified using a Digital Discovery logic analyzer, a Nexys Video board, and a lot of patience. Further, the “clock capable pin” that I used was the output bit used to control the fan. (My board has a heat sink and no no fan, so this pin is otherwise unused.)
A better test might’ve measured the quality of this clock using dedicated clock measurement hardware. I haven’t done this. I only know that I can generate a clock within an FPGA and then run this same clock through a PLL. The PLL locks, and I can then use the new clock within my design.
I personally draw two conclusions from this work:
-
Sometimes you need to use an oscilloscope.
-
Sometimes you can break all the rules–and still get away with it.
Whoso keepeth the commandment shall feel no evil thing: and a wise man's heart discerneth both time and judgment. (Eccl 8:5)