One of the things that new FPGA students struggle with is the fact that everything in digital logic takes place in parallel.
Many of these students come from a computer science background. They understand how an algorithm works, and how one thing must take place after another in a specific sequence. They tend to struggle, though, with the idea that every step in an algorithm occupies a piece of digital logic that will act on every clock tick–whether used or not.
One solution to sequencing operations is to create a giant state machine. The reality, though, is that an FPGA tends to create all the logic for every state at once, and then only select the correct answer at the end of each clock tick. In this fashion, a state machine can be very much like the simple ALU we’ve discussed.
On the other hand, if the FPGA is going to implement all of the logic for the operation anyway, why not arrange each of those operations into a sequence, where each stage does something useful? This approach rearranges the algorithm into a pipeline. Pipelining tends to be faster than the state machine approach for accomplishing the same algorithm, and it can even be more resource efficient, although it isn’t necessarily so.
The difficult part of a digital logic pipeline is that the pipeline runs and produces outputs even when the inputs to the pipeline are not (yet) valid.
So, let’s discuss several different strategies for handling the signaling associated with pipeline logic. In general, there’s no one size fits all strategy. The strategy you pick will depend upon the needs of your algorithm, and its data source (input) and destination (output).
We’ll work our way through several different strategies from the simplest to most complex.
The global valid signal for sampled data
The first strategy for handling pipelining that we’ll discuss is to use a
global valid signal.
At each stage, the data coming into the pipeline is valid when the
global valid signal is true. Likewise, each stage may take no more clocks
to finish then there are between valid signals. I like to use the
“clock enable” signal to represent this valid logic. Hence,
Fig 1 shows a block diagram of this sort communication.
The basic rules are this:
There is a global
CEline, synchronous with the clock. This line is true anytime new data is ready.
Logic is allowed to transition only when this
CEline is true
This leads to a trace diagram similar to Fig 2 below.
The neat thing about this approach is that there’s no real pipelining logic required per se. Each stage just waits for the global valid signal to be true, and then applies its logic.
You may find very quickly, though, that this approach doesn’t handle all pipelining needs. However, while it doesn’t handle all pipelining needs, it does handle the needs of a crucial problem set: signal processing.
In a typical
application, data either comes into an
from an analog to digital (A/D) converter, or it goes out of the
to a digital to analog (D/A) converter, or perhaps it does both. The sample
rate defines when the
CE signal needs to be high, and any transient
states are flushed through the system.
The digital logic engineer’s job is to manipulate the samples along the way.
Because the data moves through the FPGA at a fixed rate, and because it never suddenly changes speeds, the global valid signal works quite well for it.
Applications include digital filtering, digital phase lock loops, numerically controlled oscillators, and more. Indeed, anything that works at a fixed data rate is usually a good candidate for this method of pipelining.
Why, even our resamplers have worked off of the concept of a global valid signal, they’ve just had to deal with two different valid signals: one that holds for one clock per input sample, and another that holds for one clock for every output sample.  
The traveling CE to reduce latency
The global valid signal we discussed above, though, has two basic problems.
The first problem is that there’s no way to know if an output sample is
“valid” or not”.
The second is that the whole operation depends upon a uniform clock creating
CE signal. What
happens if the data is produced in a bursty fashion, and you want to know
not only when the output is valid but also if the output is valid? In
this case, another approach is required.
I’m going to call this second approach the “traveling CE” approach. Basically, each stage in the pipeline propagates the CE forward, as in Fig 3.
The basic rules to this approach are:
CEsignal is true, the data associated with it must also be valid.
At the end of every stage of processing, a
CEsignal must be produced, together with the output data for that stage.
CEsignal must be initialized to zero. Further, if any reset is to be used, the
CEmust be set to zero on any reset. (The data is a don’t care on reset, but the
CEline must be set to zero.)
Nothing is allowed to change except on a
CEsignal. Hence, the only time the incoming data is referenced is when
CEline to a pipeline stage) is high.
Finally, every piece of logic must be ready, at all times, for a new value to enter into the pipeline. This particular pipeline strategy cannot handle stalls.
A trace of this type of logic might look like Fig 4.
In Verilog, the approach looks like:
This approach works very well when the pipeline can be separated into stages that take no more than a single input valid signal. Likewise, it works well when none of the stages depends upon any feedback from future results. In other words, if nothing ever needs to wait, then this approach to pipelining works fairly well.
Applications of this pipelining approach include logic multiplies, Fourier transform processing (example), video processing, gear-boxes, and more. Indeed, we used this approach within our hexbus debugging bus to hook the input processing chain together. You may notice, however, that this approach didn’t work on the output processing chain. The problem there was that the UART transmitter took longer than a single clock to transmit a character, and so another pipeline signaling approach was needed–one that allowed the end of the pipeline to control the rate of the pipeline.
The simple handshake
The biggest problem with the traveling CE approach to pipelining is that there’s no way to handle the case where the listener isn’t ready. To use a UART transmitter as an example, you can create a pipeline to fill the transmitter, but what do you do when the transmitter is busy? This requires a simple handshake approach, one that I will describe here in this section.
The basic hand shake relies on a pair of signals–one from the current
device and another from the next one in the pipeline. We’ll call these
STB (or valid) and
BUSY, although the wires go by a variety of
other names depending upon the interface. Fig 5 shows a simple pipeline,
having only two stages, with the handshaking signals working through it.
The basic rules associated with the simple handshake are:
A transaction takes place any time the
STBline is true and the
BUSYline is false.
The receiving pipeline stage needs to be careful not to ever lower the
BUSYline, unless it is ready to receive data on the next clock.
STBline should be raised any time data is ready to send. The data source must not wait for
BUSYto be false before raising the
The issue with not waiting for
BUSYto be true has to do with avoiding deadlocks. By setting
BUSY, the dependence between the two is removed.
BUSYline should idle in the not busy condition.
While many AXI demonstration implementations idle with the
AXI_*READYline false (their equivalent of a
BUSYline being true), this will only slow down your interaction by an unnecessary clock. Remember, one of the goals of pipelining logic is speed. Making
BUSYtrue when it doesn’t need to be will slow down the pipeline.
STBis raised, the data being transferred cannot be changed until the clock after the transaction takes place. That is, use
(STB)&&(!BUSY)to determine if things need to change.
The data lines are in a “don’t care” condition any time
BUSYlines must be initialized to zero. If you have a need for a reset or a clear pipeline operation, these signals need to be returned to zero on either of these signals.
Since the data lines will be placed into a “don’t care” condition, they don’t need to have any value on reset.
If you look at this handshake from the standpoint of the logic involved, a trace would look like Fig 6.
Pay close attention to the “(Transaction)” line. This line is the key
to understanding the trace. It is formed from the combinational
(STB)&&(!BUSY). This line is the analog of the
CE line in the
CE approach from before. When the (Transaction) line is high, the
data is valid (since
STB was high), and the processing can step forward one
I’ve used this approach many times when building controllers for slow hardware. In those cases, the receiver generally looks like:
As you might have guessed by now, my UART transmitter uses this approach. You can also find several examples of interacting with such a transmitter among the bench tests for the UART transmitter. Perhaps closer to home, you may find this approach to pipelining used by the transmit half of the hexbus module.
The problem with the UART example is that it doesn’t really capture the logic required in any mid-point pipeline stage, only the final stage.
At the midpoint, there are two choices for how to handle things. You can
either register the
BUSY signal and suffer a pipeline stall in between
any two transactions, or you can create a
BUSY signal using combinational
logic. The combinational
BUSY (shown in the example code below) has the
problem that the time required for
BUSY determination accumulates as you
move backwards through the pipeline. This can slow down your logic, so
when this combinational path approaches your clock period it becomes
undesirable. On the other hand, if it’s a
or any other slow peripheral at the end of the logic pipeline
then you might not care about any clocks lost in the
BUSY signal calculation.
All that is to say, here’s an example of how to build a pipeline component with this handshake as both the input and the output to the component:
The final step is to set the output busy line,
o_busy. We use
to record any time our own component is busy. The final busy is set up so
that no empty delay cycles will be necessary, even though it requires some
combinational logic (i.e. borrowed clock time) to do.
This example is easily modified to to remove the combinational accumulation
by simply setting the
o_busy line via the
This will create an idle cycle between pipeline stages, but it will also fix the combinational time accumulation problem.
Examples of this type of handshaking abound. For example, the
has a form of interaction that uses this form of handshaking, although it
changes the signal names a touch. While the
STB name remains the same, the
STALL as the name for its
BUSY line. Likewise, the
AXI bus specification
uses this form of handshaking. Indeed, it uses this form across
five separate hand-shaking channels. AXI uses the terms
*AXI_*READY instead of
We’ve also already discussed the transmit half, i.e. return processing
chain, of the main dbgbus module as
The buffered handshake
If you are trying to go for high speed, for example if you wished to run the
ZipCPU at 200MHz instead of its more
natural 100MHz speed, then the simple handshake method can suffer from severe
timing problems as the pipeline grows in length. This problem is twofold.
First, any time a
BUSY signal needs to pass combinationally from the
stage back to the first stage,
the time required increases at each stage. This can slow down your logic.
The logic chain line can cross large sections of digital logic, incurring
timing delays from one end of the chip to another, while the logic elements
along the way that this signal needs to pass through just contribute to
the pain. Second, while it is
possible to slow the
BUSY signals down, by inserting a stall between
pipeline stages, this can slow the pipeline down by a factor of two.
Another approach is needed.
The way to mitigate this problem to set the
BUSY value with a clocked
register. This means that when a subsequent pipeline stage isn’t ready
(a pipeline stall), it will cost a clock until the
BUSY line can be true.
To avoid losing any data, the data that arrived before the
could go high will need to be stored in a buffer.
For this reason, I’m going to call this a “buffered handshake”.
The “buffered handshake” is going to use the exact same signals as
we showed in Fig 5 above. Further, although Fig 6 shows what the signaling
might look like, it doesn’t capture the concept of the
propagating from the end to the beginning, with the data bunching up in the
middle like an accordion.
Perhaps some pictures would help this explanation. Consider a four stage pipeline, such as the one shown in Fig. 7.
This figure shows four separate pipeline stages, from the data source or
generator, to the ultimate consumer of the data. The pipeline in this
figure is currently full, with each stage having a payload value within it.
As a result, each of the
STB signals is valid going into the next block.
Fig 8 shows what happens if the final stage in this pipeline stalls on the next clock.
Were this the simple handshake we discussed above, all of the
would go true at once.
In the case of the buffered handshake, only one
BUSY flag becomes true.
This means that
Stage hasn’t had an opportunity to set it’s
It has no choice but to ingest
payload #4 or risk dropping it (this would
be bad). Therefore,
Stage has a buffer which it uses to store
payload #4 on the next clock, as shown in Fig 9.
Stage in the position
Stage was in on the last clock.
It cannot push
payload #5 forward, and yet it hasn’t had the opportunity to
BUSY line. It must accept
payload #6. It does so by placing it
into its buffer.
If at this time the data sink now becomes available, it will lower its busy line, yielding an image looking like Fig 10.
As this pipeline clears,
payload #3 and clears its
BUSY flag. It’s now ready to transmit
payload #4 on the next clock,
as well as to received
payload #5 on that same clock, as in Fig 11.
At this point,
Stage can now flush its buffer and the pipeline is clear
again as in Fig 12.
You may notice that this pipeline uses the same
that we’ve used for the simple handshaking approach to pipelining. The
difference with this approach is that the
BUSY signal is registered,
and must wait on a clock to propagate.
The rules defining this behavior are very similar to those for the simple handshake above:
A data transfer takes place any time
(STB)&&(!BUSY)whether the given pipeline stage is ready for it or not.
If the output
(STB)&&(BUSY)are true, but the input
(STB)&&(!BUSY)is true, the data must be stored into a buffer.
BUSYis true on the input, but
BUSYisn’t true on the output, then the buffer’s values can be released and sent forwards and we can set
BUSYfor the incoming data to be false.
So, now that you know the concept, how shall we set up the logic necessary to implement this? We’ll look at how to design the logic for one stage in this pipeline only, since the other stages will use similar logic.
The following logic is rather confusing when it comes to naming, since both
input and output ports share names. I’ll use the
i_ prefix to reference
a wire coming into a stage, whether the
i_stb line coming from the previous
stage or the
i_busy line coming from the subsequent stage. In a similar
manner, I’ll use the
o_ prefix to reference logic leaving this stage,
o_stb sent to the subsequent pipeline stage to indicate that
this stage has something to pass on, or the
o_busy line to send to the
previous stage to indicate that this stage is now busy. We’ll also use the
r_ prefix to reference values within our register, both
r_stb to indicate
that something valid is in it as well as
r_data to indicate the value of
what’s in it.
The first requirement is that the pipeline be empty on any reset.
Next, let’s deal with the case where the next or subsequent stage isn’t
This should be the normal pipeline flow case.
Under normal flow, we’ll want to copy the input strobe
i_stb to the output
o_stb. Further, some data may need to be applied to
create the output
o_data. We’ll use the notation
indicate this. The
logic() function is not intended to be valid Verilog,
but rather to convey the concept of what’s taking place.
If, on the other hand, some data was in the buffer, then
also have been true on this clock.
o_stb must also be true and
(o_stb)&&(!i_busy), a transaction has taken place and
can be copied to
o_data–flushing our buffer.
The next case is the case where
o_stb is false (and
i_busy is true). This
case wasn’t shown in
the diagram series above. It’s basically the case where a pipeline stage
has no data payload within it at all. In that case, we’ll keep
we’ll accept any data, and then set the output
o_stb value to indicate to
the next stage that we have something ready to be read. Who knows, the
i_busy flag might be de–asserted on the next clock and we might not need
The last case to deal with is the case where
i_busy is true,
indicates we have a payload loaded, and we now need to store our input into
our buffer. Hence, we’ll set
r_stb and mark this stage of the pipeline
That ends our giant always block, but we still have one value that we haven’t
r_data needs to be set based upon the input data,
If you need to apply a
logic() transform to
i_data, you can do that here to
That was a lot harder than the simple handshake, now, wasn’t it?
Be aware, the code above hasn’t been tested. Although I copied it from a (working) data width bus translation module, I found some bugs and made some changes along the way. Hence, if you try this and find any bugs, then please please write me at the e-mail address in the postscript below.
You may also notice that
o_busy above are the same signal. I’ve
kept them separate for conceptual understanding, but these two can be
combined into a single signal.
The overall approach, though, is a clear example of how logic resources can be traded to achieve pipeline speed and throughput. Indeed, it is only one of many examples, but its a worthwhile lesson to take away from this exercise. While I haven’t done so, I think that if you count the LUTs used by this routine, you’ll find that all this extra logic has at least doubled the number of LUTs required.
Although this is a useful approach to pipelining, it may easily be more logic than your problem requires. Indeed, very few of my own routines have ever needed to use this buffered handshaking approach. The routines that have needed to use it tend to be wishbone peripherals with complex logic within them–such as my SDRAM controller, my attempt at a DDR3 SDRAM controller, or a wishbone bus width expansion module that I put together for an HDMI video project.
We’ve now walked through several examples of the signaling associated with
pipeline logic. These examples have gone from the simple global
all the way to a buffered handshake approach. Which type of pipeline signaling
you use will be specific to your problem and your needs. However, these
approaches should handle most of the problems you might have.
Let’s come back to this topic of pipelining at least one more time, though, and look at how the ZipCPU handles its pipeline signaling within the CPU proper. This is a more complicated environment, as lots of events can stall the CPU along the way. Indeed, handling the pipeline needs of a CPU can be quite a challenge.
That will then be our next post on this topic.
For the time is come that judgment must begin at the house of God: and if it first begin at us, what shall the end be of them that obey not the gospel of God? (1Pet 4:17)