For those who’ve been following me on twitter, you’ll know that I’ve been slowing working through understanding the AXI standard. My goal has been to create formal verification IP for either an AXI peripheral or an AXI master. I’ve made a lot of progress, and so I’d like to share some of that with you today.
My journey with AXI actually started some time ago, under a government contract. I needed to move data from the DSP code I had written within the FPGA side of an FPGA+ARM chip onto the Ethernet. I had software running on the ARM processor side that read from a FIFO within the FPGA logic, as shown in Fig. 1 above. Since it was easy to do, I connected to the low-speed interconnect of the Cyclone-V, and then converted that to an Avalon bus, then to a WB bus, and finally to a 64-bit WB bus in order to get to my peripheral. I was later surprised that my design wasn’t working nearly as fast as I wanted.
Once I finally managed to put a wishbone scope into the design to probe the bus interaction, I realized that every time a transformation took place from one bus protocol to another, another clock cycle was consumed to do it. Reading from my FIFO therefore took a clock to convert from AXI to Avalon, another one to convert from Avalon to WB, and then four to convert from my 32-bit WB bus to WB with a 64-bit data width. The return path was similarly slow. I lost a clock within my AutoFPGA generated WB interconnect, another clock clock converting from WB back to Avalon, and then another clock converting from Avalon back to AXI. Or rather, I assume there was a clock lost between AXI and Avalon and back, but I never had access to that part of the design in order to probe it. Either way, my code couldn’t meet the real-time requirements I had given it and I was wasting precious clock cycles in useless bus transformations.
Normally I don’t pay much attention to bus latency. What’s a clock or two
between bus components, right? Usually, this doesn’t hurt me. Since I use
pipelining on the WB bus,
I can send up to one request per clock cycle, and a two-clock latency on an
N item burst brings the total time to
N+2 clocks. This works nicely
for most applications.
Not this one.
In this case, the ARM processor was only ever reading one item at a time. Nothing was pipelined. Every read therefore took the full latency of about ten clocks to process.
No wonder I wasn’t meeting my real time requirements.
The easy way to solve this problem would be to get rid of all the bus bridges, and to switch to the low speed to the high speed AXI bus connection.
Prior to that project, I had tried to create an AXI slave myself. I spent several weeks on it, ending up unsuccessful. So, with government dollars paying for my time, I tried again. I never even made it through the design process, much less verification and simulation. Eventually I gave up rather than to impact the critical path of the project. Several weeks after giving up in frustration, I realized my confusion, but by then it was too late.
In the end, I never met my real-time promises to my customer.
Ever had something like that that just burns you up? Where you made a promise to deliver something to someone and then didn’t deliver?
So now I’m re-examining AXI again. Perhaps if I had a formal property file this time, it would help me through the design process?
So that’s what I’m working on, and that’s what led me to today’s post.
To test whether the property file “works” or not, I’m also building a variety of AXI peripherals. Perhaps you’ve read my ramblings on twitter about all of the bugs found in Xilinx’s example AXI slave? Perhaps you’ve seen the (formally verified) AXI demonstration slave I’ve been able to put together? Or the (formally verified) bus fault isolator, which can be used with a (possibly) faulty slave to keep from needing breaking the bus so badly that only a power cycle can rescue a design? These have been my early successes.
In all of this work, I’ve managed to get far enough along in my understanding of the AXI protocol to be able to discuss a piece of it with you today.
Specifically, I’d like to discuss how addressing works in AXI burst transactions.
Addressing made simple
If you know nothing more about AXI addressing, you need to know this: The AXI address represents the address of the byte, not the word. This is unlike Wishbone, where the address represents the word and not the byte within it. To convert from Wishbone to AXI, add zero bits. To convert from AXI to Wishbone, drop the low order bits.
The rest of the full AXI protocol isn’t quite that easy.
Let’s start at the top. an AXI transaction begins on the address write/read channel.
A write transaction begins when the bus master describes the burst of
information to be written on the write address channel. This includes the
starting address of the transaction, the length of the transaction, and more.
The master then sends the data associated with the transaction to the slave.
Once accomplished, the slave will return a single acknowledgment. You can see
an example of this, drawn from a
cover() statement in the proof of this
slave in Fig. 3.
Reads are similar, as shown in Fig. 4–also drawn from a
applied to the same
core, in that
they also begin on the read address channel. However, instead of being
followed by a channel of data from the master to the slave, the slave
responds instead by returning the data it has read. The last item of data
from the returned by the slave is marked with an
RLAST flag and it concludes
Since both the write address and read address packets are very similar, I’ll
describe both together and just use the
Ax prefix to refer to an address
signal that could be on one or the other of the two channels. When
is true, a transaction request has been placed on the channel. Several other
values associated with this request will tell you the size and length of
the requested transaction.
AxADDR. We just discussed this above. The AxADDR lines reference the byte within the burst.
Just for a fun reference here, I grew up thinking that the word “byte” meant 8-bits of data. It’s not. That’s the word “octet”. A “byte” is the smallest addressable unit of data on a bus. For POSIX compliant CPUs, that’s 8-bits. Other CPUs, to include the original ZipCPU, can have different sized bytes. In my case, I started out supported 32-bit bytes.
The AXI bus supports 8-bit bytes, and each byte can be read or written separately using the
WSTRBsignal–but we’re now getting ahead of ourselves. The key takeaway here is that bytes and octets aren’t necessarily the same thing, but we can use them interchangably when discussing the AXI protocol.
AxLEN. When I first examined AXI, the
AxLENfield appeared to be the biggest reason to use it. With a single request on the address channel, you can request anywhere between one and 256 values by just setting
AxLENto 0-255 respectively.
But how should the address of each of those values be calculated? That’s the purpose of the
AxBURSTis a two-bit value. It describes whether the address is to be fixed (
AxBURST == 2'b00), incremented (
AxBurst == 2'b01), or wrapping (
AxBurst == 2'b10).
In general, the address of each beat of the burst will increment by the number of bytes in a bus word. Only … it’s never that simple. Let’s come back to this in a moment.
AxSIZEis a three bit value referencing the size of the data transfer. The size can be anywhere between an octet,
AxSIZE == 3'b000, two octets,
AxSIZE == 3'b001, four octets,
AxSIZE==3'b010, all the way up to 128 octets when
AxSIZE == 3'b111.
The rule is that
AxSIZEcan only ever be less than or equal to your bus size. Since I tend to work with 32-bit busses, that means any
AxSIZEmust be less than or equal to
These are the four addressing signals we’ll look at today:
AxSIZE. Let’s walk through how to use these as a function
of the burst type.
Types of Burst Addressing
As we mentioned above, there are three basic types of burst addressing: FIXED,
INCREMENT, and WRAP. An
AxBURST value of
2'b11 is reserved, and so illegal.
Let’s look at each of these in turn.
This is perhaps the simplest of all of the addressing modes. The first address
in the burst is given by
AxADDR, and it never changes from there. This
means that, within a burst, we can calculate the next address, we’ll call this
o_next_address, from the last address, we’ll call that
i_last_addr and the
That’s pretty easy.
If I ever get a chance to do a similar contract again, this is how I’d do it.
Today, though, I’m working on formal verification properties. That kind of forces me to support every mode, so I’m not yet done.
Increment addressing is the kind of addressing you may be more familiar with. It’s the type you’d use when doing a memcpy. In its simplest form, we might write:
Only that’s not quite right. AXI allows you to transfer multiple bytes per transaction, and the AXI address references the first byte in each burst. Hence, if we have a 32-bit data bus, we’d want to increment our address by four bytes at a time.
Even this isn’t quite right. While AXI
AxADDR to be an aligned address, all of the subsequent
addresses must be aligned.
Aligning the addresses isn’t all that hard for a 32-bit bus. Indeed, we might simply write,
While we’re making good progress, it’s still not this simple. Remember the
AxSIZE determines our increment, not our bus size.
AxSIZE==0, the increment is one. When it’s
3'b001, we increment by
AxSIZE==3'b010 we increment by four and so on.
i_size to represent
AxSIZE in our pseudocode. We can then write:
That handles our increment. The problem now is that alignment is more painful,
since our ultimate alignment depends upon the
AxSIZE of the transaction.
Voila! We’ve enforced the alignment we need, and so we can now calculate the next address during an increment.
I discovered the problem with this approach when I tried to verify the
AXI example code from Vivado.
I built a slave with only 5 bits of address. In this case,
weren’t defined. So I added in a test of the address width.
The code now no longer referenced any undefined addresses, since the
check found and fixed that.
Only … one of my favorite (unnamed) simulation tools couldn’t handle this.
Perhaps this is because I’m using a very old version of the tool. For whatever
reason, I had to do something to keep this design from referencing bits I didn’t
have. I chose the following solution, therefore. It’s not pretty, but it
AW = 6.
It’s at least good enough that we can move on.
Many memories support address wrapping, and the AXI bus even has a burst type which will support wrapping addresses.
The idea behind address wrapping usually comes from cache accesses. An address request from the CPU might land anywhere within a given cache line. The CPU needs to invalidate and then fill the whole cache line. During this time, the CPU will be stalled until the value it needs is returned from memory. If the value lands towards the middle of the cache line, the CPU would have to be stalled during the entire read. On the other hand, if the memory can return the middle value of the burst first and then wrap around to return the first part of the burst, the CPU doesn’t need to be stalled nearly as long.
This is why address wrapping is so important. Indeed, many SDRAM chips will support address wrapping of some type.
Faster CPUs is a good thing, right?
Well … that may be true, but it certainly makes the bus more complicated.
The basic idea of wrapping is that the address bits act as though only some
bits increment, and others don’t. We can capture this with a bit-wise mask
I’m going to call
wrap_mask. Using this mask, we can write our address
wrapping code as,
Now the only trick is to figure out how to generate
There are also a couple of rules associated with address wrapping that play
into this as well. For example, when using address wrapping the first burst
must be aligned. Similarly, only burst lengths of 2, 4, 8, and 16 are
supported, corresponding to
AxLEN values of 1, 3, 7, and 15 respectively.
We’ll write these rules out in a moment later.
For now, we can calculate our wrap address via,
All put together
The problem with this code, however, is that it takes up way too much logic. You can see this by running:
If you look at the tail end of the output, you’ll find that this design requires the following logic elements:
We can do better. Indeed, we can do much better.
So, while preparing this article, I spent some time trying to optimize this address calculator to both make it simpler, and to lower the amount of logic it uses. Along the way, I made some amazing optimizations and managed to get this core down to fewer than 67 elements. After beating my chest for way to long, I went back back to try to formally verify some of my cores that used this address calculator.
Now I was stuck. Which optimization failed, and how do I roll back just the right ones?
It was time to turn to formal verification.
Formal verification in this context was a bit different from many of the other proofs I’ve done, primarily because what I wanted to do was to make sure that when I optimized my work, the result doesn’t change. Ideally, what I’d like to be able to do is something like comparing two cores, one optimized and one not, to prove that the optimized one still does the same thing as the reference design.
Yes, SymbiYosys can do this too.
The basic setup is shown in Fig. 7 on the right. First, I created a miter
with both the reference address calculation code as well as the optimized code
I was testing.
We’ll call these
ref for reference address
uut for the unit under
Let me pause here for a moment, since
DSZ is going to become a very important
part of our logic in a moment.
If you ever look through a piece of
code like this
you’ll see things like
might read these as the address width and data width respectively associated
with the slave interface. I personally try to avoid these parameter names
if I can, just because I try to fit my logic on an 80-column display and these
long names make my task more difficult. So, instead of using very expressive
terms like these, I regularly use
AW for address width and
DW for data
width. In this case, the two are the same. (Normally I’d using
reference a Wishbone
address width, which as we’ve discussed above doesn’t include the
subword address bits.)
DSZ is a value derived from the data width. It’s designed so that a
DSZ of 0 is equivalent to a
DSZ==1 is equivalent to a
DW==16 and so forth. It has the same representation as
i_size, but rather
describes the maximum
i_size the bus can hold. Two
DSZ values that will
be of high interest to me are
DSZ==2, corresponding to a bus width of
DSZ==3, corresponding to a bus width of 64-bits.
DSZ often in the discussion that follows.
and our uut, or unit under test.
At this point, I could’ve created my two assertions and been done.
The only problem with this quick approach is that the inputs need to be constrained. Many of the possible input values are disallowed by the AXI specification. What we really want to do is to assert that the two circuits produce identical results for valid AXI burst transactions values.
Let’s work through some possible restrictions.
We already known that bursts may be fixed, incrementing or wrapped. An
i_burst value of
2'b11, the third possibility, is illegal as per the spec.
In a similar fashion, the
i_size input can only ever specify a transaction
width equal to or smaller than the current bus size.
We also know that, for wrapping bursts, the length of the burst may only ever
be 2, 4, 8, or 16 beats. Since
AxLEN is one less than the total number of
transfers, this turns into a restriction that
i_len must be 1, 3, 7, or 15.
The specification also says that wrapped bursts must be aligned. So lets check for alignment on the incoming address. Remember the master rule of formal verification: assume inputs, assert any local state and outputs.
Now that we can tell whether or not the address is aligned, we can assume that any wrapped addresses are properly aligned on entry.
There’s one final bit of tweaking required. AXI addresses are not allowed to wrap over 4kB boundaries. That means that only the bottom twelve bits will ever be allowed to change.
This suggests a quick and valuable simplification, one that would force any address bits above 12 to be constant across a burst:
The problem with doing this is that my reference address calculator doesn’t have this code.
The reason why the reference doesn’t make certain that the top addresses are constant is that I’d like to use an assertion in my new AXI4 formal property file to make certain the address pointer never wraps across a 4kB. In other words, my reference requires that addresses be able to cross 4kB boundaries so I can detect a problem, whereas my optimized code won’t let things wrap.
A basic simplifying assumption will keep us from testing the 4kB wrap.
As we’ve written it, our address calculator is highly parameterizable across
a wide variety of bus widths,
DW. To capture this, we’ll define a series
of tasks within our SymbiYosys
We’re now ready to test any optimizations. So let’s take a new look at the code above to se if we can optimize the basic address calculator to use fewer than 392 logic cells, while maintaining the same logic.
Back at ORCONF 2017 in Hebden Bridge, I had the wonderful opportunity to meet Jan Gray, an individual who had managed to place many 600?-LUT RISC-V soft-cores inside a Xilinx FPGA. I asked him about how he managed to get his soft-cores so small. Among other pieces of advice he gave, one which we’ll work with today is to, “Make every LUT within your design justify its existence.”
So let’s apply that attitude to our code today, to see what or how we might optimize it.
Let’s start with the increment.
What if we knew that, in a valid transaction,
i_size would be less than
3'h2 for a 32-bit bus?
Why examine all three bits of
i_size, when you know the top bit will always
be zero? Similarly, if the bus size is already 8-bits, then you already know
the increment can only ever be one.
The following code captures some of these assumptions. While it takes more lines and appears more complex, the logic it generates may well be smaller.
DSZ is equal to
$clog2(DW)-3–essentially our current bus size
but in the units of
i_size. Further, as per the
AXI spec, we know that
i_size <= DSZ.
You’ll notice that if
DSZ is high then we use the same logic we had before.
Now let’s turn our attention to the wrap mask.
Since we’re only going to use this value if
i_burst == 2'b10, and ignore
it otherwise, we can set it for all cases.
My initial approach to set the lower mask bits,
wrap_mask[i_size:0] = -1,
didn’t work with all of the Verilog parsers I use, so I decided to try using
I normally discourage the use of
for loops within Verilog. Most beginning
designers who use them treat them like they would a piece of C/C++ code,
when in actuality the synthesis tool will unroll every loop creating more
logic with each
When used poorly, loops end up creating much more
hardware logic than a good design would in practice.
In this case, I wanted to create a different piece of logic to describe
each bit. Therefore, I tried using a for loop to define each of the bits in
After staring at this for a while, I realized there was no optimizations
within it for the maximum possible
i_size value. Since
DSZ is a parameter,
and since the loop index value,
iB, is essentially a constant once the loop
logic is implemented, I could add a check into the for loop.
DSZ > iB, would allow the synthesis tool to remove any of the
unused possibilities, leaving
wrap_mask unchanged in this case.
We still need to add in the next three bits to the mask, those dependent upon
the length of the burst. When I did this before, I used a series of
If you look a bit closer, though, you’ll notice that
i_len will only
be high if all of the other
i_len[2:0] bits are already ones. Similarly,
i_len will only be high if
i_len[1:0] is high, and so on down. That
means we can calculate the next several bits of the
wrap_mask by simply
i_len[3:1] up to the right location.
Why not check all of
i_len[3:0]? Because we already know that
will be true for all possible wrap bursts.
I also know that the addresses within a burst can never cross a 4kB boundary. These upper address bits must be constant.
Then I got to thinking … I already know that the initial address of any burst with wrap addressing must be aligned from start to finish, so it really doesn’t matter what the lower bits of this value are set to.
This allowed me to back up and remove the
for loop entirely. I then replaced
the shift line with,
Notice throughout these optimizations that we’re primarily using logic based upon constant values. This helps to keep us from increasing the amount of necessary logic, rather than decreasing it.
The next block actually calculates the next address, as we’ve discussed above. The first step, shown below, has really already been optimized.
This brings us to our alignment code, where we force the next address to be
word aligned based upon both
My first attempt to align the next address was to switch our cascaded if statement to a case statement.
This helped, and performed okay, but I thought I might be able to do better.
i_size must always be
i_size <= DSZ, I optimized this
for each of the possible ranges of bits in
i_size. For example, if
DSZ==0, the bus is then only
DW=8-bits wide, and nothing needs to be
done to align an address.
DSZ is less than two, then our bus size must be either
DW=8 bits or
DW=16 bits. In that case, alignment requires only checking
i_size and clearing
o_next_addr or leaving it alone.
While we’ll get to a point of diminishing returns soon, we can still apply this
optimization to both
DSZ < 4 and then just leave things as they were
The final couple of lines then are nearly unchanged. The only difference is
that instead of checking for whether or not
i_burst == 2'b10, we can simplify
this into just checking for
i_burst, since we know that
is already an invalid input.
What about the logic count? After all of this work tweaking and adjusting things, how did we do?
Given that we started with 392 cells, 49 is a tremendous improvement.
But how many LUTs
does this logic consume? A 7-series
has 6 input LUTs,
LUT3s, and so on. To make scoring logic usage a touch
more difficult, two
LUT5s can be combined into a
LUT6–provided that they
share the same five bit inputs. Therefore, I’ll often summarize a logic count
by adding the number of
LUT6s to the number of
LUT5s, to the maximum of
the number of
LUT1s, to the maximum of the number of
LUT2s. While this most certainly
overestimates the score, it does manage to capture some of the effects of
packing into our metric.
Here’s the final logic score, therefore: before optimization we needed 171 Xilinx 6-LUTs. After optimization, our new design can perform an AXI addresses calculation using only 24 LUTs on a bus having 32-data bits and 32-address bits. That’s a reduction of over 7x, so it was definitely worth our time.
AXI is certainly one of the most complicated buses I’ve ever worked with. It’s so complicated that I’m only discussing AXI addressing today. However, in order to properly respond to a bus request from an AXI master, a slave core will need to properly generate the addresses of any transaction it is responding to.
One fascinating discovery I made along the way is that Xilinx’s example AXI
doesn’t even implement all of this logic. Instead, it only implements
a 32-bit bus width and
i_size == 3'b010 addressing. You’d think for the cost
of only 14 LUTs,
the cost of implementing all of the
i_size options, they might have included
Of course, since AXI is so complicated, there’s still plenty more to discuss. If the Lord wills, I’d like to return and present the bugs I’ve found in Xilinx’s example AXI slave core, the formal property file’s I’ve managed to create in order to formally verify a (full) AXI4 slave core, as well as some of the other cores that became easy to build once I had the property files to work with.
These topics, however, will all need to wait for another day.
Remember now thy Creator in the days of thy youth, while the evil days come not, nor the years draw nigh, when thou shalt say, I have no pleasure in them (Ecc 12:1)