Understanding AXI Addressing
For those who’ve been following me on twitter, you’ll know that I’ve been slowing working through understanding the AXI standard. My goal has been to create formal verification IP for either an AXI peripheral or an AXI master. I’ve made a lot of progress, and so I’d like to share some of that with you today.
My journey with AXI actually started some time ago, under a government contract. I needed to move data from the DSP code I had written within the FPGA side of an FPGA+ARM chip onto the Ethernet. I had software running on the ARM processor side that read from a FIFO within the FPGA logic, as shown in Fig. 1 above. Since it was easy to do, I connected to the low-speed interconnect of the Cyclone-V, and then converted that to an Avalon bus, then to a WB bus, and finally to a 64-bit WB bus in order to get to my peripheral. I was later surprised that my design wasn’t working nearly as fast as I wanted.
Once I finally managed to put a wishbone scope into the design to probe the bus interaction, I realized that every time a transformation took place from one bus protocol to another, another clock cycle was consumed to do it. Reading from my FIFO therefore took a clock to convert from AXI to Avalon, another one to convert from Avalon to WB, and then four to convert from my 32-bit WB bus to WB with a 64-bit data width. The return path was similarly slow. I lost a clock within my AutoFPGA generated WB interconnect, another clock clock converting from WB back to Avalon, and then another clock converting from Avalon back to AXI. Or rather, I assume there was a clock lost between AXI and Avalon and back, but I never had access to that part of the design in order to probe it. Either way, my code couldn’t meet the real-time requirements I had given it and I was wasting precious clock cycles in useless bus transformations.
Normally I don’t pay much attention to bus latency. What’s a clock or two
between bus components, right? Usually, this doesn’t hurt me. Since I use
pipelining on the WB bus,
I can send up to one request per clock cycle, and a two-clock latency on an
N
item burst brings the total time to N+2
clocks. This works nicely
for most applications.
Not this one.
In this case, the ARM processor was only ever reading one item at a time. Nothing was pipelined. Every read therefore took the full latency of about ten clocks to process.
No wonder I wasn’t meeting my real time requirements.
The easy way to solve this problem would be to get rid of all the bus bridges, and to switch to the low speed to the high speed AXI bus connection.
The high speed bus on the Cyclone-V, however, uses AXI. It doesn’t use AXI-lite, it uses AXI. If I wanted to use AXI-lite, I’d need another bus translator.
Prior to that project, I had tried to create an AXI slave myself. I spent several weeks on it, ending up unsuccessful. So, with government dollars paying for my time, I tried again. I never even made it through the design process, much less verification and simulation. Eventually I gave up rather than to impact the critical path of the project. Several weeks after giving up in frustration, I realized my confusion, but by then it was too late.
In the end, I never met my real-time promises to my customer.
Ever had something like that that just burns you up? Where you made a promise to deliver something to someone and then didn’t deliver?
So now I’m re-examining AXI again. Perhaps if I had a formal property file this time, it would help me through the design process?
So that’s what I’m working on, and that’s what led me to today’s post.
To test whether the property file “works” or not, I’m also building a variety of AXI peripherals. Perhaps you’ve read my ramblings on twitter about all of the bugs found in Xilinx’s example AXI slave? Perhaps you’ve seen the (formally verified) AXI demonstration slave I’ve been able to put together? Or the (formally verified) bus fault isolator, which can be used with a (possibly) faulty slave to keep from needing breaking the bus so badly that only a power cycle can rescue a design? These have been my early successes.
I’m still working on an AXI to WB bridge, as well as an example (open-source, and formally verified) AXI crossbar interconnect.
In all of this work, I’ve managed to get far enough along in my understanding of the AXI protocol to be able to discuss a piece of it with you today.
Specifically, I’d like to discuss how addressing works in AXI burst transactions.
Addressing made simple
If you know nothing more about AXI addressing, you need to know this: The AXI address represents the address of the byte, not the word. This is unlike Wishbone, where the address represents the word and not the byte within it. To convert from Wishbone to AXI, add zero bits. To convert from AXI to Wishbone, drop the low order bits.
That’ll get you past any AXI-lite issues. It will also get you past any single-address transactions using AXI.
The rest of the full AXI protocol isn’t quite that easy.
Let’s start at the top. an AXI transaction begins on the address write/read channel.
A write transaction begins when the bus master describes the burst of
information to be written on the write address channel. This includes the
starting address of the transaction, the length of the transaction, and more.
The master then sends the data associated with the transaction to the slave.
Once accomplished, the slave will return a single acknowledgment. You can see
an example of this, drawn from a cover()
statement in the proof of this
slave in Fig. 3.
Reads are similar, as shown in Fig. 4–also drawn from a cover()
statement
applied to the same
core, in that
they also begin on the read address channel. However, instead of being
followed by a channel of data from the master to the slave, the slave
responds instead by returning the data it has read. The last item of data
from the returned by the slave is marked with an RLAST
flag and it concludes
the transaction.
Since both the write address and read address packets are very similar, I’ll
describe both together and just use the Ax
prefix to refer to an address
signal that could be on one or the other of the two channels. When AxVALID
is true, a transaction request has been placed on the channel. Several other
values associated with this request will tell you the size and length of
the requested transaction.
-
AxADDR
. We just discussed this above. The AxADDR lines reference the byte within the burst.Just for a fun reference here, I grew up thinking that the word “byte” meant 8-bits of data. It’s not. That’s the word “octet”. A “byte” is the smallest addressable unit of data on a bus. For POSIX compliant CPUs, that’s 8-bits. Other CPUs, to include the original ZipCPU, can have different sized bytes. In my case, I started out supported 32-bit bytes.
The AXI bus supports 8-bit bytes, and each byte can be read or written separately using the
WSTRB
signal–but we’re now getting ahead of ourselves. The key takeaway here is that bytes and octets aren’t necessarily the same thing, but we can use them interchangably when discussing the AXI protocol. -
AxLEN
. When I first examined AXI, theAxLEN
field appeared to be the biggest reason to use it. With a single request on the address channel, you can request anywhere between one and 256 values by just settingAxLEN
to 0-255 respectively.But how should the address of each of those values be calculated? That’s the purpose of the
AxBURST
value. -
AxBURST
is a two-bit value. It describes whether the address is to be fixed (AxBURST == 2'b00
), incremented (AxBurst == 2'b01
), or wrapping (AxBurst == 2'b10
).In general, the address of each beat of the burst will increment by the number of bytes in a bus word. Only … it’s never that simple. Let’s come back to this in a moment.
-
AxSIZE
is a three bit value referencing the size of the data transfer. The size can be anywhere between an octet,AxSIZE == 3'b000
, two octets,AxSIZE == 3'b001
, four octets,AxSIZE==3'b010
, all the way up to 128 octets whenAxSIZE == 3'b111
.The rule is that
AxSIZE
can only ever be less than or equal to your bus size. Since I tend to work with 32-bit busses, that means anyAxSIZE
must be less than or equal to3'b010
.
These are the four addressing signals we’ll look at today: AxADDR
, AxLEN
,
AxBURST
, and AxSIZE
. Let’s walk through how to use these as a function
of the burst type.
Types of Burst Addressing
As we mentioned above, there are three basic types of burst addressing: FIXED,
INCREMENT, and WRAP. An AxBURST
value of 2'b11
is reserved, and so illegal.
Let’s look at each of these in turn.
Fixed Addressing
This is perhaps the simplest of all of the addressing modes. The first address
in the burst is given by AxADDR
, and it never changes from there. This
means that, within a burst, we can calculate the next address, we’ll call this
o_next_address
, from the last address, we’ll call that i_last_addr
and the
burst type, i_burst
.
That’s pretty easy.
This would be the perfect addressing mode for reading from a FIFO as I described above. Since a FIFO can be represented with a single address, this is perfect.
If I ever get a chance to do a similar contract again, this is how I’d do it.
Today, though, I’m working on formal verification properties. That kind of forces me to support every mode, so I’m not yet done.
Increment Addresses
Increment addressing is the kind of addressing you may be more familiar with. It’s the type you’d use when doing a memcpy. In its simplest form, we might write:
Only that’s not quite right. AXI allows you to transfer multiple bytes per transaction, and the AXI address references the first byte in each burst. Hence, if we have a 32-bit data bus, we’d want to increment our address by four bytes at a time.
Even this isn’t quite right. While AXI
doesn’t require AxADDR
to be an aligned address, all of the subsequent
addresses must be aligned.
Aligning the addresses isn’t all that hard for a 32-bit bus. Indeed, we might simply write,
While we’re making good progress, it’s still not this simple. Remember the
AxSIZE
parameter? AxSIZE
determines our increment, not our bus size.
When AxSIZE==0
, the increment is one. When it’s 3'b001
, we increment by
two. When AxSIZE==3'b010
we increment by four and so on.
We’ll use i_size
to represent AxSIZE
in our pseudocode. We can then write:
That handles our increment. The problem now is that alignment is more painful,
since our ultimate alignment depends upon the AxSIZE
of the transaction.
Voila! We’ve enforced the alignment we need, and so we can now calculate the next address during an increment.
I discovered the problem with this approach when I tried to verify the
AXI example code from Vivado.
I built a slave with only 5 bits of address. In this case, o_next_addr[6:5]
weren’t defined. So I added in a test of the address width.
The code now no longer referenced any undefined addresses, since the AW>x
check found and fixed that.
Only … one of my favorite (unnamed) simulation tools couldn’t handle this.
Perhaps this is because I’m using a very old version of the tool. For whatever
reason, I had to do something to keep this design from referencing bits I didn’t
have. I chose the following solution, therefore. It’s not pretty, but it
worked for AW = 6
.
It’s at least good enough that we can move on.
Wrapping Addresses
Many memories support address wrapping, and the AXI bus even has a burst type which will support wrapping addresses.
The idea behind address wrapping usually comes from cache accesses. An address request from the CPU might land anywhere within a given cache line. The CPU needs to invalidate and then fill the whole cache line. During this time, the CPU will be stalled until the value it needs is returned from memory. If the value lands towards the middle of the cache line, the CPU would have to be stalled during the entire read. On the other hand, if the memory can return the middle value of the burst first and then wrap around to return the first part of the burst, the CPU doesn’t need to be stalled nearly as long.
This is why address wrapping is so important. Indeed, many SDRAM chips will support address wrapping of some type.
Faster CPUs is a good thing, right?
Well … that may be true, but it certainly makes the bus more complicated.
The basic idea of wrapping is that the address bits act as though only some
bits increment, and others don’t. We can capture this with a bit-wise mask
I’m going to call wrap_mask
. Using this mask, we can write our address
wrapping code as,
Now the only trick is to figure out how to generate wrap_mask
.
There are also a couple of rules associated with address wrapping that play
into this as well. For example, when using address wrapping the first burst
must be aligned. Similarly, only burst lengths of 2, 4, 8, and 16 are
supported, corresponding to AxLEN
values of 1, 3, 7, and 15 respectively.
We’ll write these rules out in a moment later.
For now, we can calculate our wrap address via,
All put together
We’ve now learned how to calculate the next address of an AXI transaction from the last one. If you put all of this code together, you’ll get something looking like the following:
The problem with this code, however, is that it takes up way too much logic. You can see this by running:
If you look at the tail end of the output, you’ll find that this design requires the following logic elements:
We can do better. Indeed, we can do much better.
Formal properties
So, while preparing this article, I spent some time trying to optimize this address calculator to both make it simpler, and to lower the amount of logic it uses. Along the way, I made some amazing optimizations and managed to get this core down to fewer than 67 elements. After beating my chest for way to long, I went back back to try to formally verify some of my cores that used this address calculator.
Now I was stuck. Which optimization failed, and how do I roll back just the right ones?
It was time to turn to formal verification.
Formal verification in this context was a bit different from many of the other proofs I’ve done, primarily because what I wanted to do was to make sure that when I optimized my work, the result doesn’t change. Ideally, what I’d like to be able to do is something like comparing two cores, one optimized and one not, to prove that the optimized one still does the same thing as the reference design.
Yes, SymbiYosys can do this too.
The basic setup is shown in Fig. 7 on the right. First, I created a miter
circuit–one
with both the reference address calculation code as well as the optimized code
I was testing.
We’ll call these ref
for reference address
calculator
and uut
for the unit under
test.
Let me pause here for a moment, since DSZ
is going to become a very important
part of our logic in a moment.
If you ever look through a piece of
Vivado
generated AXI
code like this
one,
you’ll see things like C_S_AXI_ADDR_WIDTH
and C_S_AXI_DATA_WIDTH
. You
might read these as the address width and data width respectively associated
with the slave interface. I personally try to avoid these parameter names
if I can, just because I try to fit my logic on an 80-column display and these
long names make my task more difficult. So, instead of using very expressive
terms like these, I regularly use AW
for address width and DW
for data
width. In this case, the two are the same. (Normally I’d using AW
to
reference a Wishbone
address width, which as we’ve discussed above doesn’t include the
subword address bits.)
DSZ
is a value derived from the data width. It’s designed so that a
DSZ
of 0 is equivalent to a DW==8
, whereas DSZ==1
is equivalent to a
DW==16
and so forth. It has the same representation as i_size
, but rather
describes the maximum i_size
the bus can hold. Two DSZ
values that will
be of high interest to me are DSZ==2
, corresponding to a bus width of
32-bits, and DSZ==3
, corresponding to a bus width of 64-bits.
You’ll see DSZ
often in the discussion that follows.
Now we can return to the miter circuit, and show the instantiations of our two designs, both reference,
and our uut, or unit under test.
At this point, I could’ve created my two assertions and been done.
The only problem with this quick approach is that the inputs need to be constrained. Many of the possible input values are disallowed by the AXI specification. What we really want to do is to assert that the two circuits produce identical results for valid AXI burst transactions values.
Let’s work through some possible restrictions.
We already known that bursts may be fixed, incrementing or wrapped. An
i_burst
value of 2'b11
, the third possibility, is illegal as per the spec.
In a similar fashion, the i_size
input can only ever specify a transaction
width equal to or smaller than the current bus size.
We also know that, for wrapping bursts, the length of the burst may only ever
be 2, 4, 8, or 16 beats. Since AxLEN
is one less than the total number of
transfers, this turns into a restriction that i_len
must be 1, 3, 7, or 15.
The specification also says that wrapped bursts must be aligned. So lets check for alignment on the incoming address. Remember the master rule of formal verification: assume inputs, assert any local state and outputs.
Now that we can tell whether or not the address is aligned, we can assume that any wrapped addresses are properly aligned on entry.
There’s one final bit of tweaking required. AXI addresses are not allowed to wrap over 4kB boundaries. That means that only the bottom twelve bits will ever be allowed to change.
This suggests a quick and valuable simplification, one that would force any address bits above 12 to be constant across a burst:
The problem with doing this is that my reference address calculator doesn’t have this code.
The reason why the reference doesn’t make certain that the top addresses are constant is that I’d like to use an assertion in my new AXI4 formal property file to make certain the address pointer never wraps across a 4kB. In other words, my reference requires that addresses be able to cross 4kB boundaries so I can detect a problem, whereas my optimized code won’t let things wrap.
A basic simplifying assumption will keep us from testing the 4kB wrap.
As we’ve written it, our address calculator is highly parameterizable across
a wide variety of bus widths, DW
. To capture this, we’ll define a series
of tasks within our SymbiYosys
file.
Each task corresponds to a different bus width, which we can set using the chparam command within the Yosys script section of our SymbiYosys configuration file.
We’re now ready to test any optimizations. So let’s take a new look at the code above to se if we can optimize the basic address calculator to use fewer than 392 logic cells, while maintaining the same logic.
Making Optimizations
Back at ORCONF 2017 in Hebden Bridge, I had the wonderful opportunity to meet Jan Gray, an individual who had managed to place many 600?-LUT RISC-V soft-cores inside a Xilinx FPGA. I asked him about how he managed to get his soft-cores so small. Among other pieces of advice he gave, one which we’ll work with today is to, “Make every LUT within your design justify its existence.”
So let’s apply that attitude to our code today, to see what or how we might optimize it.
Let’s start with the increment.
What if we knew that, in a valid transaction, i_size
would be less than
3'h2
for a 32-bit bus?
Why examine all three bits of i_size
, when you know the top bit will always
be zero? Similarly, if the bus size is already 8-bits, then you already know
the increment can only ever be one.
The following code captures some of these assumptions. While it takes more lines and appears more complex, the logic it generates may well be smaller.
Remember, DSZ
is equal to $clog2(DW)-3
–essentially our current bus size
but in the units of i_size
. Further, as per the
AXI spec, we know that i_size <= DSZ
.
You’ll notice that if DSZ[2]
is high then we use the same logic we had before.
Now let’s turn our attention to the wrap mask.
Since we’re only going to use this value if i_burst == 2'b10
, and ignore
it otherwise, we can set it for all cases.
My initial approach to set the lower mask bits, wrap_mask[i_size:0] = -1
,
didn’t work with all of the Verilog parsers I use, so I decided to try using
a for
loop.
I normally discourage the use of for
loops within Verilog. Most beginning
designers who use them treat them like they would a piece of C/C++ code,
when in actuality the synthesis tool will unroll every loop creating more
logic with each
iteration.
When used poorly, loops end up creating much more
hardware logic than a good design would in practice.
In this case, I wanted to create a different piece of logic to describe
each bit. Therefore, I tried using a for loop to define each of the bits in
the wrap_mask
.
After staring at this for a while, I realized there was no optimizations
within it for the maximum possible i_size
value. Since DSZ
is a parameter,
and since the loop index value, iB
, is essentially a constant once the loop
logic is implemented, I could add a check into the for loop.
This check, DSZ > iB
, would allow the synthesis tool to remove any of the
unused possibilities, leaving wrap_mask
unchanged in this case.
We still need to add in the next three bits to the mask, those dependent upon
the length of the burst. When I did this before, I used a series of if
s.
If you look a bit closer, though, you’ll notice that i_len[3]
will only
be high if all of the other i_len[2:0]
bits are already ones. Similarly,
i_len[2]
will only be high if i_len[1:0]
is high, and so on down. That
means we can calculate the next several bits of the wrap_mask
by simply
shifting i_len[3:1]
up to the right location.
Why not check all of i_len[3:0]
? Because we already know that i_len[0]
will be true for all possible wrap bursts.
I also know that the addresses within a burst can never cross a 4kB boundary. These upper address bits must be constant.
Then I got to thinking … I already know that the initial address of any burst with wrap addressing must be aligned from start to finish, so it really doesn’t matter what the lower bits of this value are set to.
This allowed me to back up and remove the for
loop entirely. I then replaced
the shift line with,
Notice throughout these optimizations that we’re primarily using logic based upon constant values. This helps to keep us from increasing the amount of necessary logic, rather than decreasing it.
The next block actually calculates the next address, as we’ve discussed above. The first step, shown below, has really already been optimized.
This brings us to our alignment code, where we force the next address to be
word aligned based upon both DSZ
and i_size
.
My first attempt to align the next address was to switch our cascaded if statement to a case statement.
This helped, and performed okay, but I thought I might be able to do better.
Remembering that i_size
must always be i_size <= DSZ
, I optimized this
for each of the possible ranges of bits in i_size
. For example, if
DSZ==0
, the bus is then only DW=8
-bits wide, and nothing needs to be
done to align an address.
Similarly, if DSZ
is less than two, then our bus size must be either
DW=8
bits or DW=16
bits. In that case, alignment requires only checking
i_size[0]
and clearing o_next_addr[0]
or leaving it alone.
While we’ll get to a point of diminishing returns soon, we can still apply this
optimization to both DSZ < 4
and then just leave things as they were
otherwise.
The final couple of lines then are nearly unchanged. The only difference is
that instead of checking for whether or not i_burst == 2'b10
, we can simplify
this into just checking for i_burst[1]
, since we know that i_burst==2'b11
is already an invalid input.
With each change, we can now run both Yosys, to see if the change reduced the amount of logic, as well as SymbiYosys to see if our change broke our design or not.
What about the logic count? After all of this work tweaking and adjusting things, how did we do?
Given that we started with 392 cells, 49 is a tremendous improvement.
But how many LUTs
does this logic consume? A 7-series
Xilinx
FPGA
has 6 input LUTs,
not LUT1
s, LUT2
s, LUT3
s, and so on. To make scoring logic usage a touch
more difficult, two LUT5
s can be combined into a LUT6
–provided that they
share the same five bit inputs. Therefore, I’ll often summarize a logic count
by adding the number of LUT6
s to the number of LUT5
s, to the maximum of
the number of LUT4
s and LUT1
s, to the maximum of the number of LUT3
s
and LUT2
s. While this most certainly
overestimates the score, it does manage to capture some of the effects of
packing into our metric.
Here’s the final logic score, therefore: before optimization we needed 171 Xilinx 6-LUTs. After optimization, our new design can perform an AXI addresses calculation using only 24 LUTs on a bus having 32-data bits and 32-address bits. That’s a reduction of over 7x, so it was definitely worth our time.
Conclusion
AXI is certainly one of the most complicated buses I’ve ever worked with. It’s so complicated that I’m only discussing AXI addressing today. However, in order to properly respond to a bus request from an AXI master, a slave core will need to properly generate the addresses of any transaction it is responding to.
One fascinating discovery I made along the way is that Xilinx’s example AXI
slave
doesn’t even implement all of this logic. Instead, it only implements
a 32-bit bus width and i_size == 3'b010
addressing. You’d think for the cost
of only 14 LUTs,
the cost of implementing all of the i_size
options, they might have included
the rest.
Of course, since AXI is so complicated, there’s still plenty more to discuss. If the Lord wills, I’d like to return and present the bugs I’ve found in Xilinx’s example AXI slave core, the formal property file’s I’ve managed to create in order to formally verify a (full) AXI4 slave core, as well as some of the other cores that became easy to build once I had the property files to work with.
These topics, however, will all need to wait for another day.
Remember now thy Creator in the days of thy youth, while the evil days come not, nor the years draw nigh, when thou shalt say, I have no pleasure in them (Ecc 12:1)