Building a Skid Buffer for AXI processing
I’m currently in the process of building and verifying several AXI cores, primarily for the purpose of proving that my AXI formal Verification IP core works. Some examples of these cores include:
- Xilinx’s demonstration core
- My own AXI slave core
- My own AXI crossbar
- A Bus Fault Isolator
- A WB to AXI converter, and
- An AXI to WB converter. This latter one will actually be two converters, AXI-write to WB bridge and AXI read to WB bridge just like the AXI-lite to WB converter was also split into an AXI-lite (write) to WB converter and an AXI-lite (read) to WB converter. The two will then be connected together with a WB arbiter, just like it was with the AXI-lite to WB converter.
Most of these cores have already passed a formal verification check. None of these, however, have passed the FPGA check save perhaps the Xilinx core–which I’m assuming others have used although I haven’t used it myself.
These cores are all currently kept in my Wishbone to AXI bridge repository. They aren’t there because they really belong there, but rather for a lack of a better place.
I’ve already blogged about formally verifying Xilinx’s AXI demonstration core. I’ve even blogged about formally verifying Xilinx’s AXI-lite demonstration core, as well as demonstrating how to build a bug free AXI-lite core. I’d like to do the same for my AXI (full) slave core as well.
Indeed, I’d like to blog about some or all of these other cores. They each have some very fascinating and useful features.
-
For example, the AXI slave core is designed to be able to sustain 100% throughput on both read and write channels. Xilinx’s core, for comparison, was only able to achieve just less than a 50% read throughput, and something close to 100% on the write channel, although it didn’t quite get there.
-
The AXI crossbar is unusual in several respects. First, it is unusual in that it is a public, open source, formally verified formally verified crossbar. Every tried to simulate both halves of a design, both the master and the slave side of Xilinx’s interconnect? An open source AXI crossbar, capable of being Verilated, would be very powerful for this purpose.
Further, if either Xilinx’s AXI demo core, or their AXI-lite demo core is any indication, then this crossbar will have over twice the throughput. Similarly, both of these cores had latent bugs within them, having not been formally verified. Xilinx’s crossbar may have similar latent bugs or limitations within it. While I’d love to know, I don’t have access to the logic within their crossbar to find out.
-
Finally, my brand new bus fault isolator will allow you to connect an unverified AXI design to a larger system, knowing that the bus fault isolator will identify any descrepancies between your core’s AXI interface and the formal properties we’ve discussed that can be used to verify and arbitrary slave, and return a bus error in the case of any error.
Just think about that for a moment. When I worked with the Cyclone-V, I had a bug in my own design where two Wishbone bus responses got collapsed into one. The ARM on the Cyclone-V then hung waiting for that response. It never came. No matter what I tried, I couldn’t get access into the design to see what had happened. Had I had this bus fault isolator, the fault in my broken design would have been detected and a bus error returned. I could have then used logic to dig into what was going wrong to find the bug. Even better, the bus fault isolator now has a recovery mode, allowing access to the slave after a reset period.
Does this sound like a “get out to FPGA-Hell for free” card at all?
There is one key component, however, in all of these designs. Without this key, I wouldn’t be able to make any high performance AXI designs. That key component is the skid buffer.
I know, I called these “double buffers” some time ago, but I am really starting to like the term “skid buffer”. It captures the idea much better, and so I’m going to switch terminology and start calling these things “skid buffer” from now on.
If you are going to build or otherwise work with an AXI design, you really need to understand a basic skid buffer. Indeed, that’s really the whole occasion for this post: I was going to post about my AXI slave core, and I realized that I was either going to need to spend a long section explaining skid buffers, or I would need to separate that material into its own post.
The basic concept of the skid buffer
So, just what is a skid buffer? A skid buffer arises from the need to create a stall signal in a registered data only context.
Just to illustrate with an example, the ZipCPU doesn’t use skid buffers. As a result, if the CPU needs to wait on a long instruction, such as a memory load or a divide, then the read-operands stage needs to stall lest an instruction be lost. If the read-operands stage stalls, then the decode stage needs to stall. If the decode stage needs to stall, the prefetch needs to stall and so on.
Because the ZipCPU uses combinatorial stall signals internally, by the time the stall signal gets to the prefetch stage there’s not much slack left in the signal before the next clock edge. Indeed, this was one of the problems I had when I tried to run the ZipCPU at higher clock rates. (It wasn’t the only problem …)
If I had used skid buffers instead, the stall signal could have then been registered, breaking the timing accumulation.
The problem is that if the stall signal is registered, then the previous processing stage in the pipeline doesn’t know about the stall until it finishes its processing and registers its values at the next set of flip-flops. At this point, the data needs to go somewhere or get dropped.
Enter a skid buffer, such as the one shown in Fig. 1 at the right.
The goal of the skid
buffer
in Fig. 1 is to bridge the divide between combinatorial logic on the one side
and the registered logic on the other–given that the outgoing stall signal
(i.e. !o_ready
) can only be a registered signal.
In this case, I’ve used the AXI
signaling convention, so this skid
buffer
has VALID
and READY
signals on both incoming and outgoing interfaces.
There are two big criteria this
skid buffer
must meet. First, if ever
VALID & !READY
, the respective data values must remain constant into
the next clock cycle. Second, no piece of data may be lost along the way.
The first of these is very easy to express formally.
In all of my AXI cores, I have wanted to use the data as soon as possible. This means that I don’t want to add any more buffers or path logic to my incoming data than necessary.
Hence, when all is going well and nothing is stalled, then the skid buffer needs to operate like a pass through device, as illustrated in Fig. 2.
In this case, both the incoming valid and data signals pass through the core and go directly to the output.
If the output port is stalled however, then we need to copy everything to an
internal buffer, i.e. the “skid” buffer, lest the input data values get lost
on the next cycle. This is the meaning of Fig. 3 on the right. In this
figure, the incoming valid and data lines are copied directly to the buffer.
This gives the buffer its own internal valid and data lines. Further down,
when we start discussing the implementation of this
core
we’ll name these r_valid
and r_data
respectively–but I’m getting ahead of
myself.
On the next clock cycle, the
core
can output the incoming values from the last
cycle–the ones it just buffered. Further, the incoming interface may
move on to its next value, but by now the incoming o_ready
signal has
fallen, so the module feeding this one now knows that it needs to wait.
This is shown in Fig. 4 on the left.
The neat thing about all of this is that the logic necessary to implement a skid buffer is fairly straightforward, so let’s take a peek at what it takes build one of these today.
Optional Enhancements
After building this first skid buffer, I quickly realized there was a need for a similar skid buffer, but with the output sides reversed. That is, could a skid buffer be created where the incoming side was combinatorial and the outgoing side was registered?
Our skid buffer might then look like Fig. 5.
In this case, the biggest difference are the set of flip-flops used to register the outputs.
That sounds simple enough. But what if I wanted both? What if I could use
combinatorial logic to create the outgoing interface whenever the
skid buffer
was used on an incoming AXI
channel, and combinatorial logic on the incoming interface whenever the
skid buffer
was used to drive an outgoing AXI
channel? To support both purposes, I created a parameter which could be used
to select between them, OPT_OUTREG
. When this parameter is set, the
skid buffer
will register all outputs. When it isn’t set, the outputs can
then be combinatorially driven. In all cases, the upstream stall signal
will be registered.
Surely that’s simple enough to build, right?
Not quite. I wanted more.
I’ve noticed that with a lot of my
Wishbone bus
designs, that the
Wishbone signals
have a very high fan out. They send data all across the chip. When such
high fanout signals get set in a manner so as to minimize LUTs, they might
end up transitioning often–even when the valid signal (STB
for
Wishbone)
is low and so no reader is listening on the other end.
Further, every one of these high-fanout transitions takes power.
What if I instead wanted to force any unused data lines to be zero when the valid signal indicated the output was inactive? This might prevent unnecessary transitions, and perhaps even lower the power usage of my designs. (This remains to be determined, but this design is part of my preparation for an experiment or two to see if this is so.)
For this, I assigned another options, OPT_LOWPOWER
. OPT_LOWPOWER
is
really defined by a set of
formal
properties. If OPT_LOWPOWER
is set, then anytime !o_valid
is true then
o_data
should be zero.
Supporting both of these options also means that we’ll be essentially designing four separate skid buffer designs, depending upon the options chosen. These options will be aintermingled throughout the design to control logic generation as well. Therefore, when it comes to verification we’ll also need to make certain that we verify all four of the designs found within this code.
Verilog code
As is my normal practice, I’ll skip most of the introductory code, and jump right into the skid buffer example itself.
Or perhaps we’ll just mostly skip this introductory code.
I would like to point out that both OPT_LOWPOWER
and OPT_OUTREG
are
single bit parameters, making true/false logical tests simpler and keeping
us from accidentally trying to set a value to something other than true (1)
or false (0).
The internal skid buffer
itself is captured by two signals. The first, r_valid
, just
indicates that the internal buffer has valid data within it. This is shown
in Fig. 3 above, which shows data going into the internal buffer, as well as
in Fig. 4 above, which shows data coming out of the internal buffer.
The basic logic for this r_valid
signal is that we want it to go high any
time there’s a valid incoming signal but the outgoing path is stalled.
We can then return to normal operation once the incoming ready signal returns to normal, acting again as a pass through device.
The data logic is even simpler. Any time the outgoing combinatorial side is ready, we’ll just quietly copy the incoming value into our buffer.
However, the logic above doesn’t preserve our OPT_LOWPOWER
property, shown
below, that r_data
should be zero whenever r_valid
is false.
To implement this low power logic, we’ll need to make certain that r_data
starts at zero. Not only that, but any time the design is reset then r_valid
will be reset to zero, and so we’ll need to set r_data
to zero in both of
those cases–but only if OPT_LOWPOWER
is set.
This also means that any time the outgoing side isn’t stalled we’ll need
to hold r_data
at zero as well.
Finally, we can copy the data any time the outgoing/upstream side isn’t stalled, just like before.
Or rather, we can’t because that’s not quite right. If we are in both
OPT_LOWPOWER
mode, and we are registering our output, then we need to make
certain we only set this value when i_valid
is true. Otherwise if
OPT_LOWPOWER
is true, the input and r_valid
properties would force the
output to be zero.
While I could write this as,
I prefer expanding the logic out using De Morgan’s laws. The condition below, therefore, captures the same logic.
There’s also one very profound key feature to a skid buffer implementation that I missed for the first several years I used them: the output stall signal is given by the internal buffer’s valid signal. The two are signals completely equivalent. Ok, I’ll admit I didn’t believe it myself until I ran the formal proof, but that’s beside the point. In this case, since we are using AXI READY/VALID notation, this means that the outgoing READY (not stalled) signal is the opposite of our VALID signal.
I built and implemented many skid buffers before realizing this. Even once I first saw this equivalency, it still took some time (and formal proofs) to believe it. That said, it nicely simplifies any implementation.
Now that we’ve dealt with the internal buffer, we can move on to the outgoing interface. We’ll need to split this logic into two sections, though: One section for the simpler case where the outgoing registers are not buffered, and another for the case where they are.
In the unregistered case, we’ll want our output port to be valid any time either the input port is valid, or if there’s data in our skid buffer.
This is also the combinatorial side of the interface, so you may note the use
of the always @(*)
.
As for our output data, we’ll want that to come from the buffer any time the buffer is active, or be a pass through otherwise.
Well, almost. What if the incoming i_data
didn’t observe the low-power
property? In that case we’d need to only set o_data
to the incoming
i_data
value if i_valid
were also set, otherwise we’d want to force the
output to be zero.
Otherwise the outgoing interface logic seems simple enough. But what about the case where we register the outgoing data?
That one is just a touch trickier.
Perhaps the valid line isn’t any different, save for the reality that it can be reset.
That said, I’ve been burned by this kind of logic before, so I’ve gotten to
the point where I always build it with the structure above. Notice the
if (!o_valid || i_ready)
condition. This is the piece that’s caught me up
a couple of times. It’s basically the same as saying
if (!(o_valid && !i_ready))
but rewritten using De Morgan’s
laws, and so it describes
any time the outgoing interface is not stalled.
My problem is that I keep wanting to add other logic to channels like this, much like we discussed in the article about the most common AXI mistake.
Be forwarned: if you play with registered signals using this basic handshake, you will want to use this pattern and no more! How do I know this? Because every time I do something different, the formal tools correct me. This seems to be the only valid approach to signals subject to the rules of this type of handshake.
Does that mean that this format will apply to the o_data
signal as well?
Absolutely!
We start with resetting o_data
anytime OPT_LOWPOWER
is set, and then
we refuse any further logic if the output is stalled.
The key to making this work in a registered context is now found within the next tidbit of logic. First, if there’s something in the buffer, then that needs to move to the output port. If not, but if something is coming in on the input port, then we’ll set to that output instead.
As before, though, there’s an optimization we can take, but not if we are
in the OPT_LOWPOWER
mode.
Before I leave this topic, notice the key feature of how I’ve used
OPT_LOWPOWER
: If it isn’t set, then all of the OPT_LOWPOWER
logic
(save the initial statement) just goes away. Since OPT_LOWPOWER
is a
constant, the synthesizer can handle optimizing this logic away
if OPT_LOWPOWER
is ever clear. The same is also basically
true of OPT_OUTREG
, but there’s more going on with that signal.
That’s all there is to the implementation of a skid
buffer.
As you can see, the logic is really simple, and there are really only two
internal registers associated with it: r_valid
and r_data
. If the outputs
are registered as well, then o_valid
and o_data
will also be registered.
That leads us to the next step: proving that this implementation works and that it does what it is supposed to.
Formal Verification
We’ve already seen several of the formal properties above. I’ll repeat those again below in a moment. For now, let’s start with the reset properties.
Following any reset, all of the valid lines need to be cleared. We can assume this of our input signal.
In this case, I declared this as a named property–a feature of the SystemVerilog Assertion language. I’ll come back to this in a moment and either assert or assume it.
We’ll also want to assume that any time the incoming interface is stalled,
that is any time there’s valid data at the input but o_ready
is low, the
valid signal needs to continue into the next clock cycle and the data isn’t
allowed to change.
Now here’s why I’m declaring these as named properties: when I went to verify my AXI slave core using this buffer, I realized that the assumptions might void the proof. They had to be converted to assertions for that proof, while left as assumptions within.
To handle this, I created a SKIDBUFFER
macro to determine if the properties
should be assumed or asserted. Using this macro, I can choose to assume
or assert as necessary.
Those are the only two assumptions describing the incoming interface.
On the outgoing side, we’ll quickly repeat the reset property: following any reset, both valid signals need to be cleared.
We can now start walking through both our internal and output signals.
The big rule we want to preserve is that any time there’s an outstanding
request on the output port that’s stalled, i.e. o_valid && !i_ready
,
then the request must remain the same on the next clock.
The disable iff (i_reset)
just means that we won’t check this test if the
reset is ever high. Personally, I think this goes without saying, however,
the formal
tools have been known to disagree with me from time to time.
The second rule tries to capture the “no data shall be dropped” policy. Specifically, if there’s data on the incoming port, then it either needs to go to the output or it needs to be buffered.
What about the other cases? Well, if either !i_valid
or
i_valid && !o_ready
, then nothing happens on the input port that we need
to worry about. Since r_valid
is equivalent to !o_ready
, we know that
the only interesting case is the one in which r_valid
is low. If r_valid
is low and i_ready
is high, the core is a simple pass through and a quick
code inspection will prove that works. That leaves the case where r_valid
is
low and i_ready
is also low–the case we covered above.
This doesn’t quite capture everything, though. We’ve now discussed how information should flow through this design, but not how the design should return to idle. That’s important, and I’ve been burned by not checking the return to idle before. Hence we want to make certain that the design will return to idle.
So any time i_ready
is true on the outgoing interface then everything should
be cleared. On the next clock, o_valid
should only be true if i_valid
is
also true.
But what if we are registering the ports on our outgoing interface?
In that case two rules shall apply. First, any time an input is accepted,
then o_valid
should be high on the next clock.
Second, any time i_ready
is true and there’s nothing on either the input
or in the buffer, then o_valid
should clear on the next clock.
That checks both the rise and the fall of o_valid
. Seems simple enough.
But what about r_valid
?
Well, if r_valid
is ever true while the outgoing port is i_ready
, then
the skid buffer gets copied to the outgoing port and r_valid
must be
deasserted on the next clock. This is the case from Fig. 4 above.
What if something was also coming in on the incoming interface? It won’t
happen. Remember, o_ready = !r_valid
. Therefore if r_valid
is high,
the incoming interface is stalled and so we can ignore it.
That leaves the two special low power properties that we discussed above. We
only want to enforce those if OPT_LOWPOWER
is set, and we want to ignore
them otherwise. Therefore, we’ll use a generate block to capture these
checks. That means that if OPT_LOWPOWER
isn’t set, the synthesis tool
(i.e. yosys) won’t even create the
logic to support these checks.
Those are all of the properties we need to know this works, but does it really work?
For that, we’ll turn to cover.
Cover
Unlike the safety (assert/assume) properties above which can be proved true if no trace can be found that makes an assert false while keeping all of the assumes true, cover only succeeds if at least one trace can be found. Cover is very useful for finding faults in your assumptions, proving that particular operations can take place and more.
It’s also very valuable when you just want a trace showing that the design works.
So let’s build such a trace that starts and ends with the core idle.
In the middle, we’ll insist that the i_ready
line toggle from high to low
twice before settling back on high.
Oh, and did I mention that we only want to check cover if this unit is
verified in isolation? Otherwise it may be that the parent module doesn’t
ever make this cover()
statement true–something that wouldn’t be a fault.
The resulting trace is interesting, but it could be better. In particular, the incoming data is all zeros. While that’s valid, it’s not very revealing. I’d rather be able to “see” from the trace that the various data lines were properly progressing. Perhaps if we cound insist that the incoming data be a counter?
The easy way to do this is to add another register, and some logic associated
with it. Let’s call this f_changed_data
, and use it to indicate that our
data “properly” changes throughout the cover trace. That is, f_changed_data
will capture if the incoming data counts up–just so we can visualize
the trace and what is going on easier.
In many ways, this isn’t your typical “formal” property. It doesn’t use any
of the formal language features (except $past()
). However, if the whole
world starts to look like a Verilog problem, then the solution is easily a
simple piece of Verilog logic.
We’ll start by setting this f_changed_data
flag to true, and the clear the
flag if any rule of what we want to see for our cover statement changes.
i_data
is only allowed to change on the clock following
!i_valid || o_ready
.
In this case, we’ll just clear f_changed_data
any time i_data
doesn’t
increment.
Similarly, we want clear this value any time the input hasn’t been valid constantly.
Voila! We now have a wonderful trace in Fig. 6 showing how this core works.
Yes, this trace has been edited, but only minimally.
SymbiYosys Script
In most cases the SymbiYosys script used to drive a proof like this is very basic. In this case, however, there are a couple of key features of the script that are worth discussing.
First, you may recall above that I mentioned we would need to do four separate
proofs in order to try all of the combinations of our OPT_LOWLOGIC
and
OPT_OUTREG
parameters. Each of these can be separated into its own
proof using the [tasks]
section of the SymbiYosys configuration
file. We’ll
also create a single task for our cover proof.
If you’ve never seen a [tasks]
section of a SymbiYosys
file
before, then you are in for a treat. Each line within this section
defines a separate
formal
run. The first identifier on the line gives the name of the run, and the
subsequent identifiers are tags that are then applied to the run and useful
when configuring it.
The second section, the [options]
section, shows the first reason to use
tasks: All of our induction proofs can be accomplished within 3 steps, whereas
the cover pass will take 20. Here you’ll notice that every task above had
either the prf
tag or the cvr
name (also a tag). This allows us to
set different depths for each of the passes.
This design is simple enough that it doesn’t really matter what engine we use, so we’ll use the default.
The real action, however, is in the [script]
section. This section contains
a series of commands to be given to yosys
to control how the design is processed. That also means that you can use the
yosys help
command if you come across
anything confusing.
We’ll start the script off by defining the SKIDBUFFER
macro, and then
reading
our code into yosys.
Often my designs consist of many parts. For these designs, I’d spend one line here reading in each of the input files.
The next step, however, is to control the parameters externally.
yosys has a new extension to the hierarchy
command to make this easier.
Basically, the hierarchy
command finds the top level module and instantiates
all of the logic below it. In our case, we want to instantiate specific
logic depending upon the proof. Therefore we are going to use the chparam
option to hierarchy
to set these parameters.
This is different from the approach I discussed before. Before, I would have written,
This usage had some problems, and so it has now been deprecated. One of those
problems was that yosys would re-elaborate
the module with every call to chparam
. Any elaboration errors due to
incompatible parameter settings along the way might cause the whole process
to halt.
Instead, using the hierarchy
command, every parameter is set at once. Hence
we might want to use,
The problem with this approach is simple: what if you have twenty different tasks, all with different repeats of the same options?
In this case, SymbiYosys’ python interface can come to our rescue.
The following script will check our two parameters independently, and create
a string variable, called cmd
, containing the hierarchy
line with the
appropriate values in it. Then, when the output(cmd);
call is issued at
the end, the cmd
string will be written into the individual
yosys scripts driving each of their
respective proofs.
The final [files]
section is fairly unremarkable. It just lists the files
used in this proof. In this case, it is only the
skidbuffer.v
file.
SymbiYosys will copy this file into its processing directory prior to running the proof.
The whole proof can now be run using,
Alternatively, you can integrate it into your formal verification Makefile, and just run
Feel free to take a loot at the Makefile I’m using for these AXI projects should you need an example.
Conclusion
Skid buffers are very powerful, and very useful, especially when using AXI. Indeed, I’m using skid buffers throughout almost all of my various AXI designs. Over and over it’s the same logic, so it makes sense to create a single file to capture this logic and simplify the design. They really are just that useful.
I just wish I’d separated this logic into its own module earlier, since now I have many, many copies of the same logic that need to be maintained. To this end, I’d like to thank Eric LaForest for setting a better example for me to follow. I would also commend his blog article on skid buffers to you for further reading.
For those of you who are not familiar with the SystemVerilog’s concurrent assertion language, or who do not have access to the commercial SymbioticEDA Suite, you might find this discussion of alternatives and equivalents to concurrent assertions valuable.
The desire accomplished is sweet to the soul: but it is abomination to fools to depart from evil. (Prov 13:19)