Clocks for Software Engineers

If you have a software background and you want to pick up digital design, then one of the first things you need to learn about early on is the concept of the clock. To many software engineers turned beginning Hardware Description Language (HDL) designers, the concept of a clock is an annoyance. Without using a clock, they can turn HDL into a programming language–with $display’s, if’s, and for loops like any other programming language. Yet the clock that these beginning designers ignore is often the most fundamental part of any digital design.

This difficulty is never more present then when reviewing some of the first designs that beginning HDL developers produce. I’ve now talked with several of these individuals who have posted questions on the forums I participate within. When I’ve then drilled down into what they are doing, I’ve had to cringe at what I’ve found.

As an example, one student came to me struggling to understand why no one on-line seemed to think much of his Advanced Encryption Standard (AES) HDL implementation. I’ll spare him the embarrassment of being named, or of linking to his project. Instead, I’m just going to call him a student. (No, I’m not a professor.) This “student” had created a Verilog design to do not one round of AES encryption, but every round, all in combinatorial logic with no clocks in between. I can’t remember if he was doing AES-128, AES-192, or AES-256, but AES requires between 10 and 14 rounds. As I recall, his encryption engine worked perfectly in the simulator, yet it only used one clock to encrypt or decrypt his data. He was proud of his work, but couldn’t understand why those who looked at it told him he was thinking like a software engineer, and not like a hardware designer.

Fig 1: Software is Sequential

Indeed, I’ve now had the opportunity to counsel many of these software engineers, new to HDL, like this “student”. Many of them like to treat HDL like just another software programming language. Having programmed before, they go and look for the basics in any software programming language: how to declare variables, how to make an if statement, a case statement, how to write loops, etc. They then write their code like a computer program–where everything needs to run sequentially (Fig 1), yet completely ignoring the reality of digital design which is that everything runs in parallel.

Sometimes these programmers will find a simulator, such as Verilator, iverilog, or the EDA playground. They’ll then use a bunch of $display commands in their logic, treat them like sequential “printf”s and use them to get their code to work–without using a clock. Their design then “runs” in the simulator using combinatorial logic alone.

These students then describe their designs to me, and explain to me that their design “works without a clock”.

Say what?

The reality is that no digital logic design can work “without a clock”. There is always some physical process creating the inputs. These inputs must all be valid at some start time–this time forms the first clock “tick” in their design. Likewise, the outputs are then required from those inputs some time later. The time when all the outputs are valid given for a given set of inputs forms the next “clock” in a “clockless” design. Perhaps the first clock “tick” is when the set the last switch on their board is adjusted and the last clock “tick” is when their eye reads the result. It doesn’t matter: there is a clock.

The result is that someone who claims that their design “has no clock” is either stating that he is using the simulator in an unrealistic fashion, or that the design has an external clock setting the inputs and reading the outputs–which is another way of saying that the design really does have a clock.

If you find yourself struggling to understand the necessity of having a clock when working in digital logic, or if you know someone who might be struggling with this concept, then this post is for you.

Let’s spend a moment or two discussing the clock, and why it is so important to build and design your logic around a clock.

Lesson #1: Hardware design is parallel design

The first and perhaps most difficult part of learning hardware design is to learn that all hardware design is parallel design. Things don’t take place serially, as in one instruction after another (Fig 1), like they do in a computer. Instead, everything happens at once, as in Fig 2.

Fig 2: Hardware logic runs in Parallel

This changes a lot of things.

Fig 3: A software loop

The first thing that changes needs to be the developer. You need to learn to think in parallel.

Perhaps a good example of this difference would be a hardware loop.

In software, a loop consists of a series of instructions, as Fig 3 illustrates. These instructions create a set of initial conditions. Logic is then performed within the loop. Then a loop variable is used to make and define this logic, and it is often incremented each time through the loop. Until the loop variable reaches the termination condition, the CPU continues to repeat the instructions and logic within the loop. The more times the loop runs, the longer it takes to run the program.

HDL based hardware loops are not like this at all. Instead, the HDL synthesis tool uses the loop description to make several copies of the logic all running in parallel. The logic used to create the loop, such as the index, to increment that index, to check the index against the final condition, etc., doesn’t need to be synthesized–so it is usually removed. Further, since the synthesis tool is creating physical wires and logic blocks, the number of times through the loop cannot change after synthesis time. After that time, the amount of hardware is fixed and can no longer be changed.

The structure that results, shown in Fig 4 below, is very different from the structure of a software loop in Fig 3 above.

Fig 4: An HDL generated loop

This has several consequences. For example, loop iterations can’t necessarily depend upon the output of prior loop iterations like they could in software. As a result, it’s hard to run a loop of logic across all of the data in a set have an answer in the next clock.

But … now we’ve come back to the concept of the clock again.

The clock is central to any FPGA design. Everything revolves around it. Indeed, I would argue that all of your logic development should start with the clock. It’s not an afterthought, but rather the clock forms the structure of how you think about digital design in the first place.

Why the clock is important

Step one is to understand that everything within a digital logic design takes time to do in hardware. Not only that, but different operations take different amounts of time. Traveling from one part of the chip to another also takes time.

Perhaps the way to visualize this is with a chart. Let’s place the inputs to our algorithm on the top, the logic in the middle, and the outputs on the bottom. Time, as an axis, will run from top to bottom, from one clock to the next. The result of this visualization might look something like Fig 5, below.

Fig 5: Logic takes time, three operations

Fig 5 shows several different operations: an addition, a multiply, and several rounds of AES–although for discussion purposes it could be several rounds of any algorithm. I’ve used the size of the operation boxes, in the vertical direction, to indicate notionally how much time each operation might require. Further, operations that depend upon other operations stack up. Hence, if you want to do many rounds of AES within one clock, you’ll need to know that the second round cannot begin until the first is complete. Fitting this logic in, therefore, will increase the amount of time between clock ticks and slow down your overall clock rate.

Now let’s look at the pink boxes.

The pink boxes represent the wasted capacity in your hardware circuit–times when you might have been able to do something, but since you had to wait for the clock, or perhaps wait for your inputs to be processed first, you couldn’t do anything. For example, in our notional diagram above the multiply doesn’t take as long as one round of AES, neither does the addition. However, you can’t do anything with the results of those two operations while the AES calculations are taking place since those operations need to wait for the next clock to get their next inputs. This is what the “pink” boxes represent in Fig 5: idle circuitry. Further, because all of the AES rounds are pushing the next clock into the distance, there’s a lot of idle circuitry presented in Fig 5. This design, therefore, will not run as fast as the hardware would allow.

If all we did was pipeline the AES algorithm, so that one round could be calculated on every clock, we could then get the entire design to run faster with less wasted capacity.

Fig 6 shows this idea.

Fig 6: Breaking up the operations speeds up the clock

As a result of breaking our operation up into smaller operations, each of which could be accomplished between clock ticks, our design now has much less wasted capacity. Even better, instead of encrypting only one block of data at a time, we can pipeline the encryption algorithm. The resulting logic won’t encrypt a single block any faster than Fig 5 above, but if you can keep the pipeline full you should be able to increase your AES encryption throughput by somewhere between 10-14x faster.

This is therefore a better design.

Can we do better? Indeed we could! If you are familiar with AES, then you know that each round of AES, has discrete steps within it. These can be broken up, allowing us to increase our clock speed until the AES round logic takes less time than the multiply. This will increase the number of adds and multiplies you can do, as well as micro-pipelining your encryption engine so that you can run even more data through it on a per clock basis.

Not bad.

Fig 6 above, though, shows a couple of other things as well.

First, let’s consider the arrows to be routing delays. (The figure is not drawn to scale. It is an illustration for a notional discussion only.) Every piece of logic needs to have the results of the last piece of logic routed to it. This means that even if a piece of logic requires no time to accomplish–such as if it just reorders wires or some such, moving the logic from one end of the chip to another will still take time. Hence, even if you make your operations as simple as possible, there will still be delays for moving the data around.

Second, you may notice that none of the arrows actually started at the clock tick. Neither did any of them go all the way right up to the next clock tick. This was to illustrate the concept of setup and hold timing. Flip-flops, the structures that capture and synchronize your data to the clock, need some amount of time prior to the arrival of the clock, where the data is already constant and determined. In addition, despite the fact that the clock is often considered to be instantaneous, it never is. It arrives at different parts of your chip at different times. This again requires a bit of a buffer between operations.

So what conclusions can we draw from this lesson?

Logic takes time to accomplish
More logic takes more time
Your clock speed is limited by the amount of time it takes to accomplish whatever logic you place between clocks ticks (plus routing delays, setup and hold timing, clock uncertainty, etc.)

The more logic you stuff between your clocks, the slower your clock rate will be.
The speed of your fastest operation will be limited by the clock speed required to accomplish your slowest operation.

This was the example of the addition above. While it could run faster than the multiply and any single round of AES, the add was slowed down by the rest of the logic within the design.
There is a hardware defined limit to clock speed. Even operations requiring no logic still take time.

Hence, a balanced design tries to place roughly the same amount of logic between clocks all the way across the design.

How much logic to place between clocks?

So now that you know that you have to deal with a clock, how should you modify or build your design in light of this information? The answer is that you should limit the amount of logic between clock ticks. But, by how much, and how will you know that answer?

One way is to know how much logic you can do between clock ticks is to set your clock speed to an arbitrary rate, and then to build your design within a tool-suite that understands your hardware. Anytime your design fails to meet its timing requirements, you will need to either go back and split up the components within your design, or slow down your clock rate. You should be able to use your design tools to find this longest path.

If you do this, you will learn for yourself a series of heuristic rules that you can then use to figure out how much logic you can place between clocks on the hardware you are designing for.

For example, I’ve tended to build my designs for 100MHz clock rates within the Xilinx 7-series parts. These designs then typically run at about 80MHz within a Spartan-6, or 50MHz within an iCE40–although these are not hard and fast relationships. What works on one chip may have excess capacity on another, or it might fail its timing checks on another.

Here are some rough heuristics I’ve used regarding clock usage. Since they are only heuristics, they are not likely to be appropriate for all designs:

I can usually do a 32-bit addition, together with a mux of 4-8 items within a clock.

Were you to use a faster clock, such as a 200MHz clock, you may then need to separate the addition(s) from the multiplexer.

The ZipCPU’s longest path actually runs from the ALU output to the ALU input.

This sounds simple enough. It even matches the heuristic above as well.

The problem the ZipCPU struggles with, at higher speeds, is routing this output back into the ALU.

Let’s trace that path for a moment: Following the ALU, the logic path first goes through a 4-way multiplexer to decide whether the ALU, memory, or divide output needs to be written back. This write-back result is then fed into a bypass circuit to determine if it needs to be immediately fed into the ALU as one of its two inputs. Only at the end of this multiplexer and bypass path does the ALU operation and multiplexer take place. Hence, all of these logic steps can stress the path through the ALU. However, because of the ZipCPU’s construction, any clocks placed into this path will likely slow the ZipCPU proportionally. This means that this longest path is likely to remain the ZipCPU’s longest path for some time.

Were I interested in running the ZipCPU at a higher speed, this is the first logic path that I would attempt to break up and optimize.
16x16-bit Multiplies take one clock.

Sometimes, on some hardware, I can get 32x32-bit multiplies to take place on a single clock. On other hardware, I need to break these in pieces. For this reason, if I ever need a signed 32x32-bit multiply, I use a pipelined routine I built for that purpose. The routine contains several multiplication approaches within it, allowing me to select from the options appropriate for the hardware I’m currently working on.

Your hardware may also support 18x18-bit multiplies natively as well. Some FPGAs also supports a multiply and accumulate within one optimized hardware clock. Know your hardware, and you’ll know what you can do here.
Any block RAM access takes one clock. Avoid adjusting the index during that clock period if you can. Likewise, avoid doing anything with the output during this clock as well.

While I’m going to argue that this is a good rule, I have violated both parts of it successfully and without (serious) consequence at 100MHz on a Xilinx 7-series device. (iCE40 devices have problems with this.)

For example, the ZipCPU reads from its registers, adds an immediate to the result, and then selects from between whether the result should’ve been the register plus the immediate, the PC plus the immediate, or the condition code register plus the immediate–all in one clock.

As another example, for a long time the Wishbone Scope determined the address for reading back from within it’s buffer based upon whether or not a read from memory was taking place on the current clock or not. Breaking it from this dependency required adding another clock of latency, so the current version doesn’t break this (self-imposed) rule any more.

These rules are no more than heuristics that I have used over time to gauge how much logic can take place between clock ticks. They are device and clock speed dependent, so they may not work for your application. My recommendation would be that you develop your own heuristics for what you can do between clock periods.

Next Steps

Perhaps the best closing advice I can offer to any new FPGA developer is to recommend that you learn HDL while practicing on real hardware rather than just within a simulation. The tools associated with actual hardware components are known for their ability to check your code and the timing it requires. Further, while building your design for a high speed clock is good, it isn’t the end-all for hardware design.

Remember, hardware design is parallel. It all starts with the clock.

Finally, feel free to write me and let me know if this helps you understand HDL better, or even if it leaves you more confused. That will help me know if I need to come back and address this topic again at a later date. Thanks!