I’ve now already shared the story of the contract I was on some time ago, where I failed to achieve the performance I promised because I placed too many bus bridges between the ARM and the FPGA that shared the same die.
The easy way around this problem would be to design an AXI interface to my data FIFO. At the time of the project, AXI was too complicated for me to understand. I had tried to build an AXI slave controller several times over, and failed every time.
I recently returned to the task, but this time using formal methods. My first step was to build a formal property file to describe an AXI4 interaction, similar to the one we built for AXI-lite together. I learned several things along the way, too:
According to the AXI specification, “On master and slave interfaces there must be no combinatorial paths between input and output signals.”
Fig 3. Reading from a data FIFO
If you aren’t following me on twitter, then you’ve missed the discussion regarding the meaning of this sentence. I have taken it to mean that all AXI4 signals must be registered. Others have since pointed out that this reading is probably too strict. Perhaps a better reading is just that “combinatorial paths between input and output signals” are disallowed.
Since I already started with the interpretation that all AXI signals must be registered, I’ll stick with it here. Indeed, registering all I/O’s is not a bad thing as Eric LaForest of GateForge Consulting wrote,
Although having a combinatorial path will work (I have seen it on some Xilinx AXI IP), the resulting round-trip time on a ready/valid handshake worsens designs: Shaving a cycle here is not worth the reduced Fmax and harder P&R.
So there are good reasons to do this, even if I was mistaken that the specification required it.
This is then unlike the Wishbone designs I’ve done where the stall signal isn’t supposed to be registered. This also means that it will be really hard to touch any AXI signal without losing a clock due to delay.
I also discovered the problem I had been suffering from: write transactions don’t return one
BVALIDper element written, but rather one
BVALIDresponse per burst transaction. In other words, for every
AWVALID & AWREADYor equivalently for every
WVALID & WREADY & WLASTthere should be one
BVALID. I had thought, erroneously, that it was supposed to be one
WVALID & WREADY.
BVALIDbehavior is very different from write-acknowledgments under Wishbone. It’s also different from the AXI read channel. In the case of the read channel, there is one
RVALID & RREADYfor every requested beat in any transaction.
I’ve also discovered that in spite of all of Xilinx’s training material, their own example core can’t handle high throughput: 1) reads take a minimum of two clocks per beat of data, 2) the core will crash if multiple subsequent requests are made, 3) it can’t handle any significant back pressure, and 4) their core won’t handle reads and writes at the same time.
This is crazy poor performance.
We’ll do better today.
I’ve also read several other blog articles about how to build AXI slaves. Most of these articles focus on building a state machine to handle one transaction at a time. While this kind of performance might be better than Xilinx’s, designing an AXI slave from a state-machine standpoint tends to introduce unnecessary stall signals. My application needed throughput.
I needed to do better. I needed something that worked, and worked well.
So let’s take a look today at what I managed to come up with and why.
I had four primary goals in this exercise. The first was that my new AXI slave core needed to be AXI compliant. Realistically, that should be a given. However, we’ve already discussed how even Xilinx’s example code wasn’t truly AXI compliant, so I needed something new. My next goal was that this new core had to have maximum throughput. This second goal was so important to me that my third and fourth goals were identical: I wanted throughput!
For me, high throughput means that the core can handle one data beat per clock in both directions at the same time. Further, I also wanted to make certain that this high throughput core would never stall any input requests unless it was already processing a transaction on the same channel.
Is it possible?
Somewhere around the bottom of my page of goals, I actually had another interface goal. My previous example/demonstration AXI-lite core wasn’t very easy to integrate into other parts of my design. Instead, I wanted something that was easier to integrate with other logic. So, for this design, I wanted to use this AXI core as a bridge to a simpler interface that would be kept in another file–the one that actually had my logic within it.
That meant I needed an interface that looked like Fig. 4.
This interface was designed to support a very simple slave that could handle every bus transaction in a single clock cycle, but had no capability of stalling.
How does this simplify the logic of the downstream non-AXI slave? Consider the simplest slave write logic I can think of,
and similarly for a read,
Could this get any simpler?
Indeed, there’s a large class of slaves that could handle an interface like this. Some examples include my SD card controller, PWM audio controller, GPIO controller, I2C controller, button/switch controller, real-time clock, interrupt controller, ZipTimer, and much more.
However, most buses include write strobes as well and AXI is no different. It would be a shame not to support them. So we’ll add support for the write strobes to our slave interface as well. To support these, the slave will need to update its logic to look something like the following.
Not all slave devices support write strobes, but we’ll leave that decision up to the downstream implementation and simply include them here.
Of course, the illustration above only demonstrates this interface applied to a block RAM. In actuality, I’d would expect to drive a series of control register and/or buffers instead of this example block RAM, but that’s a discussion to be had whenever we build whatever interface we’ll connect this to.
What this core doesn’t support are stalls on the downstream interface, nor does it support responses that take longer than a single cycle to determine. Those extra features would require FIFO support within the channel, and that’s going to be part of a different development, perhaps even an AXI to AXI-lite implementation.
So how shall we design this high speed AXI slave core?
Designing the waveform
As we’ve discussed before, AXI consists of five channels implementing between them both a write interface (3 channels) and a read interface. Unlike the Wishbone bus, the logic for the two interfaces can be written and processed separately. For now, we’ll start by looking at the read interface, just because it’s the simpler of the two.
Fig. 5 on the right shows an example of a single basic AXI read transaction that we can use for discussion.
The first key requirement of any high performance
AXI slave is that
ARREADY line must be high when the slave isn’t busy.
This keeps us from suffering from a stall signal when a read request is made.
Indeed, the AXI specification
AWREADY) high while waiting for a
request. This will also be our first departure from Xilinx’s demo AXI
The transaction then starts with the request on the read address channel,
as indicated by
ARVALID. This will tell us the address
we want to read,
ARADDR, and the number of items to be read,
Each of these items will take at least a separate clock period, and so
they are often called “beats”. Further, since combinatorial paths are
forbidden between inputs and outputs, we can’t start responding before
the next clock. However, we can respond as early as the next clock
tick. Then, after
S_AXI_ARLEN clock ticks, we’ll set
and return our last item.
Don’t miss this key fact: there are
ARLEN+1 beats in a read transaction,
AWLEN+1 beats in a write transaction. It’s not
AWLEN. That extra
+1 is important, and more than one individual
missed this key detail on one of my weekly formal verification
If we drew this read transaction out into a state machine diagram, we might get something looking like Fig. 6 below.
In this diagram, the read channel starts in the idle state where the
in the bottom corner of the
Idle box indicates that
ARREADY is true. Upon
receiving a burst read request, we’d move to either the middle state,
or the end state,
End, depending upon whether or not
ARLEN==0. In both of these states, the
V in the lower right corner
RVALID is true. From the middle state, once the next to
the last item has been transferred, we’d move to the end state where
RLAST is also true. From the final ending state we can
return to idle. Further, the design cannot advance from either the middle
or the ending state unless
RREADY is also high as noted by the
The neat thing about this design, as I’ve just outlined by the notes in the corners of the states, is that the finite state machine (FSM) signals are the AXI output signals. (See my tutorial if you would like more information on finite state machines.) This allows us to register the output signals without needing a separate set of state registers. Likewise the incoming signals from the AXI master are shown on the transitions.
Now let’s see if we can pack these operations together so that we can do a read with 100% return path utilization. What might that look like?
It would look like two of these read bursts jammed together until their various data beats are just touching with no overlaps, just like Fig. 7 on the left.
From an FSM
point of view, 100% throughput requires a couple of changes.
First, we’ll need to be able to loop from the end state back to the end
state any time we receive a request where
RREADY is set.
This risks the problem of what to do if the
RREADY input from the master
isn’t set, since we can’t change our
ARREADY signal after we notice
RREADY is low without making an output,
ARREADY combinatorially dependent
upon an input,
RREADY. In order to solve this problem, we’ll create a holding
RVALID is high but
ARREADY is low in order to deal with
this possibility. Other than this new
Hold state, our updated
diagram in Fig. 8 is (roughly) the same as Fig. 6 above.
We’ll know we’ve been successful if we can get our design to pass a cover check describing this transaction.
For those not familiar with the SystemVerilog Assertion (SVA)
states that we want to find some way to make the given sequence pass–to
cover() it as it’s called. The
disable iff (!S_AXI_ARESETN) on the
second line means that if, at any time during
the following sequence,
S_AXI_ARESETN becomes active (low), then the sequence
is disabled and any result doesn’t count. Following the
disable iff you
have a series of expressions, looking like
A ##1 B ##1 C etc.
This means that we want to see a clock cycle when
A is true, followed by
B is true, followed by one where
C is true. But what about that
[*3], as in
A ##1 B [*3] ##1 C? That refers to a repetition of
In this case,
B gets repeated
3 times before
C becomes true on the next
clock. So we’re asking the tool to find a time when
A is true, then
true starting on the next cycle but for three separate clocks, and finally
That should very much match our diagram in Fig. 7 above. Hence, when we can make this cover statement true, while still matching all of the formal properties required by the bus, then we’ll know we can run at high speed.
Even better, we can trace these steps through the states and transitions in our FSM diagram, shown in Fig. 8 above. Did you notice that the cover statement didn’t check that the design could properly enter (and leave) every state? Indeed, this cover statement is now starting to appear inadequate. A better cover statement should check every state and transition.
We may have to come back to that later.
We could also trust the formal property checker to check all of these paths instead.
For now, let’s turn our attention to the write channel. A basic AXI write transaction looks like Fig. 9 on the right.
There are four important sections of logic in this figure. First, when the
design is idle we’ll want
AWREADY to be high. That way we won’t need to wait
an extra cycle after
AWVALID goes high before
AWREADY & AWVALID are high
together like Xilinx’s design
did. Second, there’s the
write address transaction, where
AWVALID & AWREADY. Third, there’s a series
of “beats” where
WVALID is true and valid data are sent across the channel.
BVALID is set to true on the clock cycle after
WVALID & WREADY & WLAST.
You may notice from Fig. 9 that I’ve kept the
WREADY line low until after
AWVALID & AWREADY. That just simplified the design below. If I need to
press for lower latency later, I may need to come back and adjust this later.
If you were to place these states into an initial FSM, you might get something looking like Fig. 10 below.
In this figure, you can see all of the four stages of the transaction we just discussed above. Sadly, however, this FSM, diagram is far from complete.
Let’s work through some of the problems. First, my entire goal was
throughput. Fig. 9 illustrates a transaction that takes
N+2 clock periods,
and our state machine in Fig. 10 doesn’t allow back to back transactions. So
let’s start adding some more states and transitions.
- In order to avoid stalls, we’ll need to make sure to raise
AWREADYon the same clock that
WVALID & WLASTis true. Worse, in order to register everything, we’ll need to set
AWREADYon the clock before
WVALID & WLASTare true.
What should we do, though, if
AWREADY & WREADY & !WVALID?
That still leaves us with the condition where we’ll need to transition from
AWVALID & WVALID & WLASTto
WVALID & WREADY & BVALID(and possibly
WLAST). This will require two more states, one for
WREADY & BVALIDand another for
WREADY & WLAST & BVALID.
This last state, however, gives us another problem. What happens when
WVALID & !BREADY? In that case, we’ll need to store our new response in a skid buffer, drop
AWREADY, and wait for
I then tried to diagram out this state machine for you, to help you understand all the various parts and pieces of such a transaction. What I came up with was Fig. 12 below.
Sadly, it’s not very understandable, much less complete. Even worse, this state diagram wasn’t helping me. Instead, I gave up in frustration. There are just too many states and transitions to make this understandable, much less correct.
It’s certainly not simple enough to understand from examination.
So, let me let you in on a secret: I didn’t use a state diagram to build this core.
What?? Why not?
Because, in my mind, this core was never an FSM. It didn’t have the typical FSM structure. In my mind, it was always a series of pipeline stages, together with the logic necessary to handle the transitions from one stage to the next. As I designed this core, I envisioned three write stages, separated by a clock, as shown in Fig. 13 below.
The first stage was the
AW* stage where the
AW* signals would (or wouldn’t)
be valid. In Fig. 13, you can see how this stage will either take a clock
cycle or not, as provided by the
The second stage was the
WREADY stage, ending in the
Timing was driven by this write-data stage, and no more than one request was
ever allowed to be within it. Once
WVALID & WLAST were true, the response
would either go into the
r_bvalid in Fig. 13, or directly to the
BVALID output stage.
Further, a request could always be accepted any time:
r_bvalidstages were empty, a request could always be accepted.
Otherwise, accepting a request required several other conditions to be true together:
BVALIDwas false, or
BREADYwas true, indicating that nothing was blocking at the final stage, and …
b. If something was in the
S_AXI_WREADYstage, versus in the
r_bvalidskid buffer, then
WVALID && WLASTmust both be before accepting a new request.
Note that, in this design,
r_bvalidwill never be true unless
BVALIDis also true. This is a common skid buffer, property, but we can still use a formal assertion if we aren’t confident this will be true.
So, instead of using the
in Fig. 12 that left me more confused, I chose
to focus instead on a signal trace diagram to understand what needed to be
done. My goal, as you recall, was to pack all of these parts and pieces
of a write transaction together as tightly as possible, so that
WVALID being true without interruption across multiple bursts. The
result should look something like Fig. 14 on the left.
The second trick, which has already separated our approach today from many
other approaches to building an AXI
slave, is that we’ve chosen to use the various output signals in place of
any separate FSM
state. This will keep us from making the same mistakes Xilinx
made, when their logic
AxVALID & AxREADY & something_else.
The Skid Buffers
The key to high throughput when using the AXI bus is really the skid buffers. If you are going to do anything with AXI, whether the full AXI standard, AXI-lite, or even AXI stream, you need to understand skid buffers.
Why are skid buffers so important? Because unlike Wishbone, AXI output signals must not depend upon any combinatorial paths from the inputs according to the specification. This is great for maintaining high speed signaling when using high fanout signals, but it becomes a challenge for the designer and they can be an ugly challenge as well for the engineer who wants to verify his design with induction.
I’ve now discussed skid buffers on this blog several times. First, under the name of “double buffers” when discussing pipeline strategies, and then again when we discussed building a functioning AXI-lite slave. More recently, I devoted a whole article to the topic. Eric LaForest also has posted a wonderful article describing them under the name “skid buffers”.
For these reasons, I won’t discuss skid buffers further here, but I will encourage you to go back and examine one or more of these articles before reading further.
The write channel
Are you ready to dive into the code?
The first thing I do when building something to process
signals is to rename the parameters
Xilinx provides. This isn’t because I dislike their
naming convention, on the contrary, I like it: with the exception of the
their convention is quite descriptive. I understand therefore that
C_S_AXI_ADDR_WIDTH is the number of address bits associated with the
slave port to my design. It’s just that using this long identifier means
that either I need to extend the width of my text editor and thus lose the
monitor size gains of the last decade, or I need to abbreviate these values.
Here are my chosen abbreviations:
AW for address width,
DW for data width,
IW for transaction ID width.
These shorthand parameters are defined as
localparams, so Vivado
should still be able to recognize this file.
Further, you may remember that
are not like
WB addresses. In
the address doesn’t reference a word of memory, but rather an octet
of memory. This means
that there will almost always be some number of low order address bits that
aren’t really necessary to any word-level transaction. This is the purpose
LSB short-hand above. When we send our address
to our downstream non-AXI port, we’ll only send the
[AW-1:LSB] bits to
simplify the processing there. Sub-word handling by the external slave
will be done using the
The next step is to handle the incoming write address
I’m going to use the
m_aw prefix to reference the values coming out of
and going into this skid
signals going into the skid
buffer will maintain
their names, and denote a separate stage of processing.
m_aw* signals are the equivalent of their
with the exception being that
m_awready is combinatorially determined.
Similarly, there’s combinatorial logic within the
to generate the other
m_aw* signals. That way this
won’t cost us any clock delays.
The next piece of logic is, in many ways, the key to understanding the write logic. This is where we’ll control the ready lines for both the write address and the write data channel.
As mentioned above, we’ll idle with
AWREADY high and
Following Xilinx’s convention, I’m using the
axi_* as a prefix to
describe registered values that will then drive the outgoing
signals. The exception to this rule is
axi_awready will be the registered component of the ready signal,
m_awready, going into the combinatorial side of the
above. Therefore, imagine, as you go through this, that we are setting
m_awready anytime we set
axi_awready, but remember that we may also
override this registered value to set
m_awready high when accepting
subsequent packed bursts across the channel.
From there, the basic starting logic is simple: once
are true, move to the write data stage of our burst processor. That means
axi_awready until the the packet clears the
Do you remember the post about the most common AXI
It’s a common mistake
to check for anything other than
*ready at the step above.
Were we to do that, we might miss and therefore drop an incoming request.
axi_awready is set so we can accept another packet if this is the
last data value in the burst and if there’s an open position in our outgoing
We don’t need to check for two positions, just one. However,
WVALID & WREADY & WLAST will consume one of the two positions.
Similarly, we’ll drop
axi_wready and leave this processing stage once we
the last data value is accepted.
This gets a bit more interesting, though, in the next step. The
only looks like it is checking for
S_AXI_WVALID & S_AXI_WREADY. Remember,
this is the second
if in a cascaded
if block. Therefore, we are now
S_AXI_WVALID && S_AXI_WREADY and
!m_awvalid || !m_awready. How is this not violating the basic rule of AXI
Originally, when I wrote this code, it didn’t violate the rule because I
S_AXI_AWREADY would be false during the data stage of this
transaction. That was arranged by the code above. This worked well, but it
cost me a clock of throughput on every burst.
Now, and we’ll get to this in a moment, there’s the possibility that
m_awready might be true when
S_AXI_WVALID && S_AXI_WREADY. In particular,
it may be true when
WVALID & WREADY & WLAST, and so we’d be ready at that
time to accept another burst. Hence, I’m not violating the rule, rather
I’m combining two possibilities in the first part of this cascaded if.
We can now move on to the rest of this key logic block. At this point, though,
most of the critical work has been done. All that’s left is to make
axi_awready, once lowered to deal with the write data, is
properly raised again once done.
Specifically, we’ll want to keep
axi_awready clear as long as we are in the
write data stage (the
S_AXI_WREADY stage) of our processing or if the
outgoing skid buffer
is full, i.e.
r_bvalid is true.
r_bvalid will only be true, however, if
then the outgoing skid
buffer is stalled.
Remember, if the outgoing
is full, that is if the internal buffer within it has valid data within it,
S_AXI_BVALID signal at the output must also be true, so we
don’t need to check for it here.
In all other cases, we’ll clear the stall condition so we can accept another write address request.
We’ll also need to copy and cache the values we’ll need to calculate
the burst: the burst type, virtual bus width,
wsize, and the burst length.
In this logic,
next_wr_addr is created by applying our
AXI address logic to the current write address
to get the next address.
The write address
itself now becomes really simple, given the
module we built earlier.
m_awvalid & m_awready, we set
waddr according to the incoming address.
Then, on every accepted write value, it gets set to
All that’s left is to call
axi_addr to get that next write address.
Before we move on to the
for the return channel, let’s not forget to write this
AXI data to our external slave interface.
The neat thing about this is that, by this point in our processing,
this last bit is easy. We can just set these outputs to the
AXI inputs and our registered address,
waddr, and use one little piece of combinatorial logic,
WVALID & WREADY,
to indicate that it is time to write.
This should give our external AXI-unaware slave component plenty of slack to actually process this request. Indeed, by this point, we’ve removed all of the AXI complexity from what this external core needs to process.
It’s now time to start looking at the skid buffer for the return response.
Unlike the earlier
I share the
r_bid signal with the
S_AXI_WREADY stage, which forces this
to have a bit of a different structure–enough so that I’ve written this
write return skid buffer
into the module’s logic rather than referencing a submodule.
The critical part of this
is the valid signal, indicating whether or not something is within it.
In this case, we’re calling this signal
As might be expected,
r_bvalid is cleared on any reset.
r_bvalid is set, however, if two conditions are true, as shown in
Fig. 13 above. To follow, consider that any time
WVALID & WREADY & WLAST we want to set the outgoing
BVALID. The only time
when we cannot set
BVALID is if
BVALID is already set, but
clear (indicating a stall condition). In that case, we still need to note
that we are ready for a second response somewhere else–i.e. in the
skid buffer. Hence, the
is only activated any time a value would be placed into the outgoing
position, but when the outgoing position is already full and stalled..
Once the stall is clear, the contents of the
will always move forward to the outgoing position, so we can then clear the
internal valid line,
I broke the basic mold of the skid
with the transaction ID. We need to record the transaction ID anytime
m_awvalid & m_awready are both true. Since we are guaranteeing that there
will always be a place for any accepted write transaction in either the
skid buffer at the end
or the return response position, we can then just copy the ID from
to the skid buffer’s
data store. If the skid
isn’t valid yet, its data is at least still allocated as part of our scheme
to keep from needing to drop
We can even simplify this somewhat by only checking
m_awready instead of both
Next, anytime the return transaction position is idle we can forward this
transaction identifier to our return register,
This is one of those cases where the logic cannot be simplified further.
(I know, I tried.) What we want is to forward to the outgoing response
position any time
!(S_AXI_BVALID && !S_AXI_BREADY). If we only check for
S_AXI_BREADY, however, the bus master isn’t required to keep this ready
while waiting for a response. We might find that
BID then never gets set
You may also note that this handling of the
ID requires a minimum of two
three clock periods. On the first clock,
m_awvalid & m_awready will be
m_awid. On the second clock period,
r_bid will contain
the current transaction ID. This is also the earliest
WREADY will be true for this transaction. It may also be
the last period of the transaction, were
WVALID & WLAST to be true as well.
Then, on the third clock period,
axi_bid would contain the correct
write burst identifier.
Now that we have the
in please, we can discuss the return
We’ll want to set this signal true any time
WVALID & WREADY & WLAST.
BVALID were already true, then setting it again–even if the
channel were stalled–won’t hurt.
Otherwise, any time
S_AXI_BREADY is true, we can move the
r_bvalid, into the output position.
That leaves one big piece of write logic we haven’t yet addressed,
As I mentioned above, this piece of logic is built combinatorially off of
The one piece of combinatorial logic is required in order to be able to accept a new write address on the last clock cycle of the write data–especially because we can only do so if there is, or is about to be, an open position in the outgoing skid buffer.
First, we know that if
WREADY is true, then there’s at least one position
in the skid buffer–the
one for the write data once it completes. That was
one of the design rules we’ve chosen–one I tried to illustrate in Fig. 13
We’ll now need to make certain, if we set the ready line for the write address
channel, that a second output position is also available. Since our outgoing
buffer only holds only one position and that’s already taken,
the final outgoing
BVALID position must be clear or clearing in order to
accept a new transaction. Hence, we’ll need to check that the
current write transaction can fit in the
S_AXI_BVALID output stage on the
All that’s left of the write channel side is to clean up. Specifically,
following Xilinx’s convention, we’ve set register values
axi_* but the
actual I/O values are named
S_AXI_*. We’ll need to copy our values to
the actual I/O ones.
We also need to set the
BRESP return code.
AXI offers four return code possibilities: OKAY, EXOKAY, SLVERR, and DECERR. Since this simple slave doesn’t produce any errors, neither SLVERR nor DECERR make sense to return. EXOKAY is only allowed if this slave supports exclusive access, which we don’t.
exclusive access protocol is rather complex. It involves a first
request to read a value with
ARLOCK set. If the slave responds with
the slave then needs to record the
of the transaction. If
at some time later, the same register is written to with
AWLOCK high and
no intervening writes, then the slave may return
EXOKAY and adjust the
register. If the slave returns
OKAY instead, the register is not allowed to
Since we aren’t supporting this exclusive access portion of the protocol,
we’ll set the
2'b00 to indicate an
OKAY response. I may need
to come back and add exclusive access support later, but that will be as part
of a separate blog article.
That’s the write channel, and a high speed one at that! Remember, we managed to get 100% throughput (one write burst per clock) once fully loaded. That’s a bit of performance Xilinx’s demo code never achieved.
If this is performance you want in your design, you can find this AXI slave core here.
The Read Channel
Our basic approach to handling any read request will be to aggressively
read any time and as often as we can, pushing as much logic to the left
in this operation as we can. That also means we’ll issue a slave read
request on the same clock that
S_AXI_ARVALID is true.
We’ll start with
axi_arready. As with the write address ready, this one
also idles high.
Then on any request, we’ll drop
Well, not quite. If we did that, then we would require a minimum of two
clocks for every one-burst read: One clock for
ARVALID & ARREADY and a second
!ARREADY. This would prevent us from being able to do back-to-back
reads for short bursts.
So let’s adjust this so that on any read address request, we drop
if and only if
o_rd, our outgoing sub-slave read signal, is also true.
This sub-slave read signal is a combinatorial signal that will only be true
if the outgoing read data channel isn’t stalled. To support burst
transactions, we’ll need to only let
ARREADY stay high if the burst
length is one, or equivalently if
ARLEN==0. Remember, in
the burst length is always one more than
AxLEN, so we’ll go high
immediately on any single item burst request where
o_rd is also high.
Finally, any time the output channel is not stalled, we can check whether we are about to place the last value.
This really doesn’t make much sense without knowing what
axi_rlen is the name of a counter I’m using to store the number of items
currently remaining in this burst. It is initially set to
Ever after, on any read,
axi_rlen is decremented. Once
zero, the read is complete and
AXI_RVALID should be low.
That’s why we can check for
axi_rlen == 2 above. If
axi_rlen == 2, then
on the next cycle,
axi_rlen will be one and
RLAST will then be set. We
then need to be able to accept a new read request, as shown in Fig. 7 above,
or we won’t be able to continue our high speed transaction.
Here’s what the logic for
axi_rlen looks like. Because we are using this
to know our position in a read burst, it must idle at zero. That way
axi_rlen == 0 can also be used as an indication that we are no longer within
On any read request, we’ll set
axi_rlen to the number of items remaining to
be returned, or
Well, that’s the basic idea. That’s not quite it though. The problem is
that, if there’s already an item stalled in the return channel path, then
we still need to return that item too. We’ll
therefore merge that last value, the one that is waiting for
into our “to-be-returned”
This works because the value that’s stalled in the return position has already been read. It won’t get updated by this new read address request.
Any time a read value is returned, we’ll then simply decrement our counter.
Notice that we didn’t check whether
axi_rlen was greater than zero or not.
S_AXI_RVALID is equivalent to
axi_rlen > 0, and a basic assertion (below)
proves that to be the case. Hence we only need to check
How about our read address?
This one’s a little tricky. Because we stuffed our first read operation into
the same clock cycle as
ARVALID & ARREADY, the
we record needs to be not the address given, but rather the
we read on the same cycle. This also means that the values we feed to our
module will need to
depend upon whether the read address is sourced from
Actually calculating the next address requires registering and keeping track of several values from the AXI address packet: the burst type, the transfer size per beat, and the total number of beats. While we’re at it, we’ll copy the read ID as well.
Once we have all these values, we can then calculate the next read address.
This may be one of the more delicate parts of this operation. A quick yosys check,
reveals an image showing that the logic is four levels deep. Adding this address check, therefore, might make it difficult to meet timing. That means that I might need to revisit this step later if/when I ever run into timing problems here.
But I digress.
Now that we know
we’re able to drive the read ports of our (simplified) external slave. These
o_rd, the flag that indicates when to read, and
o_raddr to indicate the address to read from.
We’ll want to read on the same clock as any incoming request, or equivalently
ARREADY is false, indicating that we are still working through
the middle of a burst.
However, if the read return channel is ever stalled, such as when
!S_AXI_RVALID || S_AXI_RREADY then we cannot perform our read lest
we cause our outgoing data to change before it has been accepted.
We’ve also discussed generating the read address above. On any given clock
cycle, this address will be given by
raddr if we are in the middle of a
burst, such as when
!S_AXI_ARREADY, and then by
S_AXI_ARADDR on the first
beat of any burst. We can use
S_AXI_ARREADY to tell the difference between
these two choices.
Notice also that we’ve used
LSB above to drop the subword addressing.
I’ve done this for the simple reason that I don’t have any modules that
would use the subword address.
That gets us past accepting the read request from the address channel, storing the details of that request and updating those details from one beat to the next. We’ve also issued the read request from our external peripheral. We just haven’t set the read output return values yet.
Let’s do that now.
Typically, the most difficult part of any of these channels is either the valid or the ready signal–whichever one you have control over, whether master or slave. These signals are just the type of signals that you can’t afford to mess up. If you mess up the data, the bus will still work. If you mess up the response, your program might fail but the bus won’t lock up. On the other hand, if you mess up the valid/ready signals, the return ID signal, or even the last signal, you might well lock everything up hard.
So let’s keep this simple.
Any time we read from memory, we’ll set the outgoing valid signal.
Keep in mind, this isn’t quite as simple as it looks since we’ve already put
a some combinatorial logic into the
o_rd signal. One of the critical pieces
of logic there is the piece that forces
o_rd to be zero any time the bus is
o_rd is low, then we can clear the return valid signal any time
S_AXI_RREADY is true.
Let’s now turn our attention to the return ID signal next. This particular signal can only change if the return channel isn’t stalled.
In that case, the new ID can either be the ID from a newly accepted burst,
or alternatively the ID we saved in our buffer when the new burst was accepted.
In many ways, the
RLAST signal uses the same logic as the return ID. As
with the ID signal, it cannot change if the return channel is stalled.
The return channel will be stalled anytime
RVALID && !RREADY. The
above expression is just this stall logic negated according to De Morgen’s
If the return channel is not stalled, then we have a couple of cases to deal
with. The first is if we just accepted a request, indicated by
ARVALID & ARREADY. In that case, we’ll set
axi_rlast if the burst
only had a single beat to it.
Otherwise we need to set
axi_rlast so that
axi_rlast == (axi_rlen == 1).
Ideally, we’d just set
axi_rlast <= (axi_rlen == 1) but this ignores the
axi_rlen changing on this clock as well due to a return
value being accepted. Hence, we set
axi_rlen is about
to become one, or if the interface is stalled then we simply set it based
upon the current value of
As for setting
RDATA, the external sub-slave did that for us. Remember
how we insisted that it have the following read logic?
Notice here that the
i_rd signal is key! This allows us to just set the
RDATA output combinatorially based upon this
If the slave were to set
o_rdata on any other criteria, then it might
RDATA while the output channel was stalled, as indicated
S_AXI_RVALID & S_AXI_RREADY. This would be a protocol violation, so it
is important that the slave only set
RDATA whenever the
o_rd is also
The rest of the design simply involves setting our various output signals to their internally registered counterparts.
Well, yes, but the key question is, does it work? We’ve already seen that there were both several bugs in Xilinx’s design, as well as noting that it couldn’t handle any significant throughput. How can we be sure that this design is any better?
For that we’ll turn to formal verification.
The majority of the formal verification work we need will be done by a formal property checker. That will make a lot of this verification work easy.
Including the property file by itself is usually enough to find errors within the design. Indeed, just adding the property file alone to someone else’s core, with whatever strange logic within it that I’m not familiar with, has been all I’ve done to find most of the errors I’ve shared on twitter.
It’s not enough to prove that the design has no errors.
The following are several examples of the properties I’ve used to do this.
We’ll start with the write side.
The formal property file counts the number of outstanding address write requests. In the case of this design, there will never be more than two outstanding burst writes.
Now let’s look at each of those three possibilities, whether zero, one, or two outstanding write bursts.
If there are two outstanding requests, then one of them must be in the
On the other hand, if only one request is outstanding, then it must be in
either the return
BVALID pipeline stage or in the
data pipeline stage.
Finally, if nothing is outstanding on the write channel, then neither
WREADY should be true.
This same logic is then repeated to check burst IDs. We’ll skip that here and move on.
I mentioned above that any time a value was in the outgoing write
that is any time
r_bvalid was true, then
BVALID must also be true.
Let’s double check that here.
We’ll also need to check the
axi_awready signal. As you may recall, this
was the registered component of
m_awready, the incoming
read signal. Specifically, we want to make absolute certain that we’ll
be ready to read any new write address any time there’s nothing in the
write data stage, where
S_AXI_WREADY, or in the outgoing skid
as marked by
r_bvalid. Indeed, this property is so strong, it’s a two
sided implication that we’ll want to check both ways. For that reason,
I’m using an equality assertion to check it below.
One of the neat parts of the AXI formal property set is that one of the outputs from it is the current write address. You can use this to double check your own address computation. We’ll do that here, and so double check that our outgoing address truly does match the address it should be.
With that, we can now turn our attention to verifying the read side. As before, we’ll focus on those properties that are necessary to validate that this works when using induction, since the basic AXI formal properties will take care of the rest.
As with the write half, there will never be more than two outstanding read requests when using this core. Within the formal property set, there’s a counter capturing the number of read bursts–just like the counter for writes above. Here, we’ll just double check that this number is never out of bounds with the logic above.
Now we have to define what that means within our core. In this case, and
for this core, if the number of outstanding bursts is ever more than one,
then there must be a value in the output position waiting to be returned.
Not only that, it must be the last item from the previous burst. Further,
ARREADY signal must be low because we’ll only end up in this situation
if the response path were stalled on the last cycle.
Not only will the formal property set count the number of bursts that are
outstanding, it will count the total number of outstanding values that need
to be returned. This number should match the
axi_rlen value within our core.
Since the external sub-slave core will never stall in this design, we also
RVALID must be true any time any request is outstanding.
Similarly, if there’s anything outstanding, then either
low blocking any incoming requests, or
S_AXI_RVALID & S_AXI_RLAST are high
indicating that we are on the last beat of the burst.
Likewise, if nothing is outstanding, then we want to make certain that
S_AXI_ARREADY is set so that we’ll be ready for the next burst.
The property set also provides for some return ID checking, which I’ll skip
here. There’s also a read address check, which we can use to make certain
o_raddr is properly set. That check requires tracking the address
o_raddr to the return position where it would be checked.
Instead of working through those two tests, and for the sake of keeping things
short, I’d like to share one final read property. Specifically, any time
S_AXI_RLAST becomes true, that is if
S_AXI_RLAST & !$past(S_AXI_RLAST),
then we want
S_AXI_ARREADY to be high as well. We can use
this purpose to check this transition.
While I initially wanted to check for more than that, such as when
S_AXI_ARREADY is false when
S_AXI_RLAST is true, that much has
already been captured above.
Finally, we need one basic assumption, beyond those provided by the AXI property set.
We need to assume that the external sub-slave will behave properly. That is, it won’t change its value unless we’ve requested a read from it.
` That should give you a basic, yet brief, overview of the formal proof.
Normally I spend a section discussing cover properties. In this case, we’ve discussed one such cover property above. I figured it was important to point out that SymbiYosys generated a trace demonstrating that this cover point could be hit.
For some strange reason, it seems like I’m always fielding questions on whatever forum about the AXI-GPIO core provided by Vivado. In particular, there’s a lot of individuals complaining on-line about how slow this core is.
Consider for example, the trace shown in Fig. 16 below, drawn from a
property placed external to the GPIO
Now, consider this: if all you did was connect your own basic I/O core to the output of this core, you could create your own GPIO controller running six times faster than this one. Even better, your new GPIO controller could then even handle both reads and writes at the same time. That’s not all. If you used this core, you wouldn’t need to waste more clocks going through Xilinx’s AXI to AXI-lite bridge.
That is, it would be 6x+ faster if only Xilinx’s interconnect supported such a speed. We’ll save that discussion for another day. Until then, it’s worth knowing that faster AXI interconnects exist.
Sadly, very few of the AXI slave cores I’ve examined can handle this kind of throughput. It’s a shame, too, since most people purchase FPGAs in order to get high speed performance for their own special application–whatever it is. On the other hand, if the logic you use within your FPGA is going to cripple your performance, then why purchase such an expensive component or go through the pain of digital design?
Perhaps the difference is the formal verification software, SymbiYosys as provided by SymbioticEDA. Because of the formal tool, and more specifically because of the formal AXI property file, I was able to build this design with confidence–even though I couldn’t figure out how to build a proper FSM state transition diagram for it.
Yes, formal methods truly make digital design a lot easier.
For wisdom is a defence, and money is a defence: but the excellency of knowledge is, that wisdom giveth life to them that have it. (Ecc 7:12)