Since writing my first two articles on AXI-Lite, the first discussing how to verify the an AXI-Lite interface and the second how to build an AXI-Lite slave, I’ve had the opportunity to build not just one but several AXI-Lite slaves. (1, 2, 3, 4, 5, 6, etc.) The cool part is, I’ve come across some really easy ways to do it that I thought might be worth sharing.
Before we start, one warning:
If you are interested in building an AXI-Lite slave the easy way, don’t start with vendor IP! It’s broken. Xilinx’s AXI-Lite demo code has been broken since at least 2016. They’ve promised fixes in 2020, but I haven’t seen them yet. Intel’s designs are also broken (as is their forum or I might’ve reported the bugs).
No, the place to start is with a formal property file. From there, you can either use a skid buffer or not, your choice, depending upon the performance you want from your AXI-lite slave. In both cases, though, we’re going to look today at how easy we can make building an AXI-Lite slave.
As you follow along below, consider the chart showing the various AXI signal
names shown in Fig. 1 on the right. The chart is organized into columns by
channel: there’s the write address channel with signals prefixed by
the write data channel with signals prefixed by
W, the write return
channel with signals prefixed by
B, the read address channel with signals
AR and the read return channel with signals prefixed by
In our slave below, we’ll follow Xilinx’s example and add the additional
The top row of this chart shows the pair of handshaking signals,
*READY, required for controlling data flow on each channel.
The next row shows the AXI-Lite signals we’ll be working with today. The
three rows below that show AXI signals that aren’t a part of the AXI-lite
For the sake of today’s discussion, let’s allow our slave to have four
registers. We’ll call them
r0, r1, r2 and
Please, before we go further though, don’t embarrass me. If you copy this logic
for your own designs (and I expect you might), rename these registers! I’ve
just seen too many folks starting with Xilinx’s AXI-Lite demonstration
that then leave their registers named
stupid things like
slv_reg1, etc. While that might be great for
a demonstration design, it’s completely inappropriate for any practical designs.
Indeed, if you look at some of my own
examples, you’ll see I’ve given my own
that match closer to their meaning. For example,
cmd_abort (user has
commanded an abort),
r_busy (the core is busy working),
address to write to),
cmd_length_w (length command, in words),
(whether or not to increment the address), etc. Register values don’t
need to be 32-bits in length either. In one particular example, a register
is composed of many little bits of information,
r_continuous, and more. These fields are then
all concatenated together
w_status_word. For now, just do yourself tomorrow a favor today, by
making your code more readable than my example below.
The only reason why I’m using registers
r3 is because
I’m’ trying to create a generic example that will be applicable for all
purposes. (Yeah, I know, do as I say not as I do … but trust me on this one.)
Let’s give each of these four registers a default value of zero,
and allow them to be written to any time the signal
axil_write_ready is true.
The registers you set in your core may have some other default values. That’s okay.
We’ll come back and discuss two separate ways of setting
further down. This will be the signal we use internally to determine
when and if we actually want to write to one of our registers.
For now, note that
wskd_data is the data we wish to write to the register.
We’ll discuss how to set that later as well. For example, it might either be
the output of a skid
buffer, or be the
S_AXI_WDATA–but we’ll get to that in a moment. In a similar
awskd_addr is that portion of the write address that can be
used to distinguish between write registers.
You’ll might also note that we haven’t used the write
S_AXI_WSTRB. While I suppose we might ignore them, that’s probably not
the greatest idea, especially since the
specification states explicitly that
a master wishing to abort a transaction should set
S_AXI_WSTRB to zero.
Hence, our implementation should really support these strobe signals.
Sadly, the logic required to support a write strobe is … annoying. Inside Xilinx’s demo, for example, you find all this Verilog code per register:
This just looks complicated, and it’s certainly much harder to read.
Let’s clean that up instead, shall we?
Let’s instead create a Verilog function to apply our
to a prior piece of data, producing a new piece of data. Remember, if
S_AXI_WSTRB is true, then we want to adjust bits
S_AXI_WSTRB is true, then we’d want to adjust bits
15:8, and so on. If
none of the strobes are true, then nothing should be changed.
This little function below captures all of that.
As I’ve mentioned before, I don’t typically use Verilog functions. As with most things in hardware, functions in hardware don’t do the same things that they do in software. Just as loops in Verilog create more hardware, functions in Verilog specify how to create more hardware. Further, submodules can also be used for much the same thing–so functions aren’t really all that useful in Verilog contexts. They do have their place, and I think this one will help us quite nicely while still keeping all of our logic within a single module.
Once we have a function for updating our register available to us, applying the write strobes to a lot of registers gets a whole lot easier. Here, we’ll take a series of 32-bit registers, and apply the write strobes to each.
wskd_rN registers now contain what would be the result of
applying the write strobe to
rN on every clock tick. That means we can
wskd_rN instead of
wskd_data when setting our registers below.
That’s a whole lot simpler to understand than Xilinx’s demonstration code, now isn’t it? Admittedly, the comparison isn’t really all that fair, since my copy of Xilinx’s example implements 32 registers and the demonstration logic above only implements 4, but I still think the example above is a lot easier to read.
You may even find that this structure is too complex for your needs. Don’t
be afraid to split this logic block up into one block per register, such as
the code for
r0 below, if it’s appropriate to do so for your design.
Sadly, this only mostly covers the task of setting registers. You may still
have registers that can only be set if certain conditions hold. For example,
in one of my own cores,
I set an
Since the register makes a good example of what could be done, and how
following the script above isn’t always the right thing to do, let’s take
a really quick look at how this
r_err value was set. First, the
register is cleared on any reset.
r_err register was drawn from a stream data to memory copy
I wanted to know if the FIFO within was ever overrun–even when
isn’t busy. Hence, we’ll set the error on any overflow.
The next step is the hidden
write–matching our example code above: If
write address matches the address of the control register, the write
data for the byte containing
r_err is set, and the user writes a
to that bit to clear the error, then the error flag can be cleared.
or if the bus is so slow that it can’t keep up with the stream and data gets dropped.
No, this isn’t part of today’s easy AXI-Lite core. I’m just showing this tidbit of complex AXI-Lite logic to illustrate that this approach to setting and adjusting registers can be much more complex than we are showing here–and it’s not all that hard to do. Indeed, today’s basic “easy logic” lesson applies to equally to the more complex cores.
There’s one final step common to all
slave components: reading from the
registers. Now that we’ve written to our registers above, we can now read
from them. Today, we’ll read from the register indicated by
axil_read_ready is true.
Did you notice that we didn’t use our
axil_read_ready signal at all? It’s
not really required when reading. Instead, we adjusted our outgoing read
data any time the
allowed us to.
That’s not necessarily a low power solution. Wires that toggle when they
don’t need to consume unnecessary power, so let’s adjust this logic again
so that the outgoing read data is zero any time we aren’t reading. Further,
since not all designs need this sort of low-power treatment, we’ll create an
OPT_LOWPOWER, which (if set) will then be used to control whether
or not the read data should be zero whenever there’s no data being read.
This adjusted logic starts off a touch different, since we now need to clear our read data register on any reset–something we didn’t have to do before.
After the reset, though, our logic looks familiar again.
That is, it looks familiar until we get to the end.
Here, at the end, we’ll set our outgoing value to zero if ever
is true, and we aren’t currently reading (i.e.
OPT_LOWPOWER part of this is key. Since it’s a 1-bit parameter, if the
parameter is ever set to zero, the synthesis tool will quietly remove this
logic from our design–making it a no-cost “feature” when it isn’t used.
How much did this little
OPT_LOWPOWER excursion cost us? About 96 logic
elements out of a 51 element design. How much did it gain us? Well, the
juries still out on that one–I’m just adding in these tests to my cores now,
and I haven’t gotten to the point yet where I can verify that doing so is
valuable (or not).
We’ve now gotten to the point where we can write to and read from our four registers, except that we didn’t really handle the bus signaling yet. That’s next.
Let’s now turn our attention to that portion of a simple AXI-lite slave that
would be common between any of our implementations:
xRESP signals are easy: we’ll just leave them at
OKAY response. That means that there will never be any errors when
attempting to interact with this simple core.
From here, we’ll move on to the
BVALID signal. This signal needs to be
set following any successful write to our
and it needs to remain set until
S_AXI_BVALID && S_AXI_BREADY are both true
together. We can simplify clearing this register to just checking if
S_AXI_BREADY, which then gives us,
The read return handshaking logic is almost identical to the write logic. There are only superficial changes here, so this should look really familiar to what we just did above for the write return channel.
That’s all the easy work. It’s also the signaling that would stay the same no matter how you implemented the front-end of this AXI-Lite core: with or without skid buffers. Now it’s time to dive into the part that would change depending upon how you wanted to implement the front-end.
Without Skid Buffers
Let’s take a look at how we might handle the incoming valid/ready handshaking.
Specifically, this includes how to handle
S_AXI_ARREADY. These are also the signals
Xilinx messed up when they built their demonstration core.
The difficult part about these ready signals is backpressure.
If the master holds
BREADY low, the slave must know to lower
WREADY. This is also true if the master holds
RREADY low, then the slave
needs to know to lower
ARREADY. Because these aren’t cases people normally
think of, these signals are easy to get wrong when testing via simulation alone.
It’s just not a case you might think of when writing your simulation scripts.
In general, there’s two ways to deal with the incoming channels–both with and without skid buffers. With skid buffers, your slave will be able to achieve lower latency and higher throughput. Without the skid buffers, your slave will have less logic and only 50% throughput, but it will still be a valid AXI-Lite slave.
In this section, we’ll examine how to handle these handshakes the easy way–without using any skid buffers.
Let’s start with the write side again. We’ll follow (and fix) Xilinx’s
and only raise
S_AXI_AWREADY when both
are true. This will synchronize the two channels together–an important part
of any AXI slave.
A first draft of this logic might look like,
While this might work most of the time, it won’t work all the time.
Indeed, if we were to leave this logic like this, then we’d be making the same (rough) mistake that Xilinx made with their core. The problem is that we didn’t check for backpressure. So, let’s add that check in to our logic, and make certain that
axil_awready is low if ever the output channel
is stalled. That is, we aren’t ready if ever
BVALID is high while
BREADY is still low.
While this is closer to what we want, we’re still not there.
With this logic alone, it is still possible that
axil_awready might be true
on the same cycle that
BVALID && !BREADY were also true. (Remember,
axil_awready is registered, and so it has to be set one clock earlier!)
axil_awready to ever be true while
BVALID && !BREADY, a write
response would get lost and our
would hang–much like
Xilinx’s demo will hang.
Let’s fix this by throttling our writes down to one write every other clock
cycle. We’ll also clear
awready following any reset for good measure.
We can now set both of our write ready signals to be equal to this one, and know that they’ll be properly synchronized.
Next, in order to match our logic above and to be able to use the same
logic both with and without a
we’ll rename some of our signals
Let me back up for a moment and discuss
ADDRLSB. This is another one of
those values Xilinx got wrong.
It’s supposed to be equal to the lowest address bit of the word address. So,
for a 32-bit word, this should be bit
1:0 to be used to
specify which byte within the word a read is supposed to start from.
AXI supports sub-word
accesses nicely via the
Using that signal, we might be able to tell if a read or write was for 8-bits
AxSIZE==3'b000) or 16-bits (
AxSIZE==3'b001) instead of all 32-bits
AxSIZE==3'b010). AXI-Lite doesn’t have this signal. Instead, AXI-Lite only
WSTRB signal and even that only applies to writes. In other words,
these sub-32-bit address bits really aren’t that useful for us, so we can
simply drop them.
How many bits should be dropped? Given that AXI-Lite is only ever a 32-bit data width, the answer is an easy 2-bits. But what if you wanted a 16-bit data width, or a 64-bit width? Then you might consider writing something like Xilinx tried,
The only problem is that I’ve seen this code copied into
AXI (full) cores. That’s
right, into cores that don’t have a fixed 32-bit width where this calculation
doesn’t match reality. (In one particular example, some one used this
calculation on a 128-bit bus, only to struggle with the fact that his core
only ever wrote every other word …) The correct setting for
This will then evaluate to
2 for a 32’bit
bus, as we would want.
The address bits above the
ADDRLSB bit, bits
can now be used to specify which word we wish to transfer–whether
r3. These will be the address bits we focus on.
Finally, we need to create a signal to indicate that a value is ready to
axil_write_ready. The easy answer here is to use
the same signal we are using to accept the write request into our core.
This can then be used by all of our write logic above to tell
us when to write a new value to one of our registers.
Sometime after Vivado 2016.3, Xilinx fixed the write bug in their AXI-Lite demonstration core. (As of this writing, they have yet to fix the read bug.) Their updated core can handle one write every three clocks. You’ll find that logic above is much simpler, and it will even handle one write every two clocks–a nice throughput improvement–as shown in Fig. 5 below.
Simplicity is a good thing.
Reads are even easier to accomplish that writes. In this case, we can just
S_AXI_ARREADY to be the complement of
S_AXI_RVALID. This allows us
S_AXI_ARREADY high until a read request comes in, and then
immediately drop it until the read response has gone out.
The neat part about complementing logic like this on a LUT-based architecture, is that the complement can often (not always) be folded into the LUT that would read this signal, and so this becomes a zero cost signal.
You can see how well we can handle reads in Fig. 6 below.
Finally, we’ll read from this IP
S_AXI_ARVALID && S_AXI_ARREADY
are true, and we’ll read from an address given in
With Skid Buffers
The problem with the really simple AXI-Lite logic above is simply throughput performance. The most it will ever perform is one transfer every other clock tick. If you want performance from an AXI-Lite core, you’ll want to add skid buffers to your design.
You should also realize, however, that you’ll be fighting an uphill battle. Xilinx’s infrastructure isn’t built for AXI-Lite performance. Just fixing your AXI-Lite core won’t fix their crippling AXI to AXI-Lite bridge, but I still do it as a matter of pride in my own workmanship. That said, there are AXI to AXI-Lite bridges that will preserve the 100% throughput of AXI, and there are AXI-Lite crossbars if these are things you are interested in. You just have to know to where to look for and find them.
A skid buffer is a really simple piece of logic that converts a combinatorial ready signal to a registered one, as shown in Fig. 7 above.
The key to getting high performance from any AXI slave is to place a
on all the incoming channels,
R, as shown in Fig. 8 on the
left. As you may recall from our earlier
skid buffer discussion,
this allows the various
READY signals generated by
to be registered, even though the ready logic we need is combinatorial.
Here’s how easy this gets. First, place a
on both the
W channels. They’ll need to have an appropriate
width for the write address, write data, and write strobes.
Since we have two channels, and two sets of handshaking signals (one for each channel), we’ll need two skid buffers.
Then we’ll accept a write request (and unstall the skid buffers above) any time there’s both a write address and write data available. That is, unless the outgoing interface is stalled.
It’s that simple.
Now we can just repeat the process for the read channel. First, we’ll add the skid buffer.
Then we accept a read request any time one is present, and the outgoing
R channel isn’t stalled.
Here in this context, the
skid buffers seem like
less work than without. This isn’t quite the case. The reality is instead
that the skid buffers
hide the complexity of the AXI channel signaling within them making things
look simple here. As a result, instead of
51 logic elements, we’ll now be
using closer to
114. It’s still small beans, but it is over twice the size
How well does this core perform with the skid buffers attached? Check out the write performance in Fig. 9, where 4 write beats are accomplished in 9 clocks periods in spite of three stall clocks and the write data being misaligned by a clock period.
If you look carefully at Fig. 9 above, you’ll notice that certain values
disappear for a time. For example, the
A0 (white) value vanished for a
clock period before the white
BVALID response was generated. Similarly,
D2 values vanished while the yellow
was stalled. Those values were maintained for us within the
sure that we didn’t lose them in spite of the fact that the various
interfaces have stalled.
The read performance, shown in Fig. 10, is also similar in that 4 read requests are returned in 8 clock periods in spite of 3 stall cycles.
The write performance would have been as fast as this read performance, if I had chosen to issue the write address and write data to the core on the same clock cycle–something the master could easily choose to do.
How about verifying this core?
You should now be able to pass a bounded model check of any length.
How about an unbounded model check?
In this case, all we need to do is to correlate the three counters,
1) the number of write address requests outstanding,
2) the number of write data requests outstanding,
3) the number of read requests outstanding,
against what our logic expects.
For example, if we aren’t using the
there should never be more than one item outstanding. (We don’t have storage
for more …) Not only that, but any time
BVALID is true there should be
exactly one write address or write data item outstanding. The same is true
RVALID, making the proof easy.
The proof is a bit harder for the
case. In this case, we need to count what’s in the
against our number of counts. Hence, if the
ever drops the outgoing ready signal, then there’s an item sitting in the
waiting to be accepted by
(Feel free to check out Figs. 9 and 10 above to see this in action.)
We can count these extra items with something like
(S_AXI_AWREADY ? 0:1)–knowing that
AWREADY will only ever be low if
there’s something in the buffer. Other than that change,
the counter checks should look the same.
At this point, we’ve proven that the bus works. We haven’t really proven that our core works, so you might want to consider adding some logic to check that the design actually reads from your registers as you might like.
Finally, there’s one last check. That is, we wanted to make certain that
RDATA was zero if there was nothing to return–but only if
was set. This is easily checked and verified.
I’ve put the whole proof together into five
Two verify that the AXI-lite signaling is handled properly first without
and then with the
The next two double check that we can actually write four values to
and read four values from it–letting the solver pick
which four values. This was the purpose of the
above–specifying that we wanted to be able to check how fast four
values might be returned. The last part checks
It’s unfortunate that the vendor AXI-Lite examples are so broken, because building and verifying that this slave was protocol compliant wasn’t really all that hard to do. The trickiest parts involved handling any potential backpressure and guaranteeing that all outgoing signals were properly registered.
I realize I haven’t really used all of the AXI-Lite
For example, we haven’t used the low order address bits nor wave we used the
AxPROT signals. The reason why not is simply because we didn’t
need to. Indeed, there’s a strong argument to be made that AXI is way more
complicated than it needs to
we can leave that discussion for another day.
Until then, feel free to modify this core for your own purposes. Don’t forget to check out how easy it is to formally verify that it works along the way.
Lo, this only have I found, that God hath made man upright; but they have sought out many inventions. (Eccl 7:29)