The hard part of building a bursting AXI Master
This article continues our series on building AXI based components. So far, we’ve discussed what it takes to verify and then build an AXI-lite slave, and then an AXI (full) slave. We’ve examined what it takes to calculate the next address within a burst, and looked at the most common AXI mistakes along the way. More recently, we discussed how to build a basic AXI master–one that could issue multiple outstanding singleton read or write requests.
What we haven’t discussed is how to build an AXI master that will issue burst requests.
It’s not easy.
Today, let’s dig into one of the harder challenges involved in building a
bursting AXI master: how to handle setting the various AxVALID
, AxADDR
,
and AxLEN
bus signals. In particular, the AXI bus
protocol imposes several
constraints on these signals that all need to be met at the same time. The
challenge is figuring out how to generate bursts, meeting all of these
constraints, without slowing down any transfers.
How then should an AXI bus
master
be built that can fill the bus with as many requests as possible?
So let’s start by looking at what these constraints are. I’ll then share several examples of open source AXI masters that can generate burst requests meeting these constraints, progressing from simple to more complex examples along the way. My examples will include a virtual FIFO, a memory-backed logic analyzer, a vide frame buffer reader, and a stream to memory DMA. Each of these designs solves the multiple constraint problem in a slightly different way, so they form a useful set of examples to learn from.
AxVALID, AxLEN, and what makes this difficult
Let’s start by taking a peek at the logic required when setting AxLEN
, and
then what’s required to set AxVALID
. Here, I’m using the Ax
prefix to
reference either the AW
(write address) channel or the AR
(read address)
channel interchangeably. Specifically, there are four challenging requirements
when driving AxLEN
, and then some other requirements when driving AxVALID
.
These requirements also impact the addresses we might choose to send, and so
also the AxADDR
signal. Getting all of these requirements right, without
impacting the maximum frequency of the design, tends to be one of the most
challenging parts of generating AXI bursts.
So let’s start out by looking over the four requirements of AxLEN
.
-
AxLEN
can’t be any larger than one less than your maximum burst size.For AXI4, the maximum burst length is 256 beats. Since
AxLEN
is one less than the requested burst size, that means AxLEN must be no greater than 255. AXI3 is similar, but with a maximum burst size of 16 beats, so theAxLEN
signal in AXI3 can be no larger than 15.I’ve tried to capture this basic protocol difference using a parameter containing the log of this maximum burst size,
LGMAXBURST
, which will be set to either8
, for a 255 beat burst, or4
, for a 16 beat burst, or to any user configurable value less than the protocol maximum. This allows us to express the maximum burst size as(1<<LGMAXBURST)
, and the maximumAxLEN
value as(1<<LGMAXBURST)-1
.To illustrate these various constraints, let’s build a draft of the logic necessary to calculate this
AxLEN
value, and then update it after each constraint. To start out, we’ll obviously like to move as much data as we can, so our first constraint would be that we want to set the nextAxLEN
value to the size of a full burst.
-
If you want to read or write a burst from or to a fixed address, then the size of the maximum length burst drops down from 256 beats to 16 beats for both AXI4 and AXI3.
This also applies to bursts using WRAP addressing as well, although I haven’t (yet) found a good use for them–and that includes even after building a CPU cache.
I suppose I could make all of my bursts 16-beats, but I’m also aware that
several vendor AXI components have a per-burst overhead of a couple of clock
cycles.
My goal when building any bus component is throughput, and so that requires
minimizing any overhead. That then means that I’ll need to use
256-beat bursts when writing
to incrementing (subsequent) addresses, and 16-beat bursts when writing
to fixed (identical) addresses. This forces the if
statement above.
-
It’s illegal in AXI to cross 4kB boundaries.
This comes directliy from the AXI4 specification. While I’m not quite certain why this is, my guess is that it’s to guarantee that bursts will not cross either MMU page or device boundaries. This certainly simplifies the design of any interconnect, since it prevents the interconnect from needing to check whether or not a particular burst needs to be split when crossing device boundaries.
Either way, this requirement is going to mean that we’re going to need to limit our
AxLEN
field again. This time it will need to be limited such thatAxADDR[11:ADDRLSB] + (AxLEN +1) < 4kB
At this point, we might be done if all bursts were to be multiples of the maximum burst size.
Sorry, but no. That’s not good enough. While that might work for the virtual FIFO, that strategy won’t work for the more common (arbitrary) DMA case.
-
You don’t want to request to transfer more memory than the total amount you want to transfer.
Yes, this sounds obvious. It is just about as obvious as it sounds too. However obvious it might be, though, we’ll still need to pay the logic required check this.
Basically, in many data moving application–to include two of our later examples, the amount of memory that needs to be transferred is chosen by the user at run-time. That means you are going to need a counter that counts down as each burst is requested, so you can always know how much data you have left to send. Then, using this counter, you can ask whether or not your burst length is less than the maximum burst length, and if so you would then only transmit what’s remaining. That is, if you have
LEN
words remaining in your transfer, you don’t want to transfer more than thoseLEN
words.We’ll use
remaining_transfer_length
as our value below. Further, for a full featured data mover, checking this value requires a comparison across as many bits as you are using to represent LEN. Since I tend to be a perfectionist, that can be a 32-bit comparison.
That’s our four criteria.
Now let’s add to this mess my rule of thumb that any 32-bit operation takes
one clock cycle. That means that calculating the next AxLEN
value alone
is going to require two clock cycles.
This is still unacceptable.
We have a second and similar problem with AxVALID
, although it’s not nearly
quite as bad. The problem with AxVALID is that you don’t want to set AxVALID
unless you have next_axlen+1
data items available to be sent, in the case of
writing, or next_axlen+1
spaces available in your FIFO if you are reading.
Sure, the AXI bus allows you to stall the bus in both directions if the data
isn’t quite ready yet, but do you really want to slow the rest of your design
down by this component? Indeed, the rule of thumb here should be that once a
burst has been requested then either WVALID
, for writes, or RREADY
for
reads, such remain asserted until all of the beats of the burst transfer are
complete.
So let’s build up some generic logic for starting a burst. We’ll call this
start_transaction
here, and we’ll make it combinatorial–since many things
might depend upon it. So, in general, we’ll start a burst as soon as we
either have data or space available for the transfer.
We’re going to need ot be careful not to start a new transaction while the last transaction request remains outstanding and stalled.
Moreover, we don’t want to start a transaction while a burst write is in process.
This will also align the write address and write data channels. While this isn’t specifically required by the AXI specification, it simplifies the masters: This way, you can go about generating the AXI length once, and use it for both channels without requiring a FIFO in between them to keep their lengths synchronized. Be aware, though, that AXI doesn’t require this in general, so while this is a nice place to start when generating bursts, it’s not something you can depend upon in a slave when processing them.
Let’s add in two other basic criteria as well. For example, you want to be able to abort any transfer on a soft reset, such as might be caused by an error or external user reset request.
You also want to be able to guarantee that you won’t start an operation until
the user has requested it. We’ll use an r_busy
here to indicate that the AXI
master is in it’s “transfer data” state. This yields another start condition.
If we’re not careful, all of this logic will eat into our timing slack, slowing down our over all data transfer rate. The worst offender in this chain is the check for whether or not the data (or space) available is greater than the amount we want to transfer. Worse, it depends upon knowing the amount of data to be transferred, so this test can’t take place until we’ve finished the two clocks above. That might slow our burst-to-burst issue time down from two clocks to three clocks.
Again, unacceptable.
If the goal is to be able to achieve 100% bus throughput, then we’re going to have to figure out a way to do better.
Simplifying the problen
Looking at these criteria, I wasn’t ready to settle for a three clock delay when building my AXI masters. So, I looked around to see if the problem could be simplified first. Sure enough, there are plenty of simplifications available to you.
AXI-Lite bursts
The easy way to handle this whole problem would be to use a burst length of 1. In this case, the logic would get really simple.
If you are going straight into a MIG generated memory controller, this would be good enough. If you have to go into an ARM, an AXI crossbar or Xilinx’s AXI block RAM controller, on the other hand, this might cripple any throughput you might’ve otherwise had. The AXI slave controller we built on this blog, as well as my own crossbar would be fine both ways.
I’ll admit, sometimes I wonder why the designers of AXI didn’t just leave everything that simple. It would’ve made the bus so much easier to work with. Indeed, under conditions like this I was able to generate and verify the stream to Wishbone master I mentioned above in just a half a day. It was that easy.
Alas, AXI is not so simple, so let’s look at some other ways to simplify this problem.
Burst Alignment
One suggestion I came across early on was to align every burst to the size of the maximum burst. If we did that, then the first burst would need an alignment check, but nothing following would need to check for crossing any boundaries.
Now, for example, the first bursts length computation would look like …
That gets us down to three constraints–down from four. Wait, though, it gets better. On subsequent bursts, we can now set axlen based upon the remaining transfer size alone.
Even though the initial length will still take two cycles, the subsequent length calculation might now just fit within a single clock cycle.
Hiding Computations
Let’s take a step back for a moment though. Our goal is to be able to maintain
100% throughput across our bus. That means that we should be able to transfer
one beat of information, whether WVALID && WREADY
or RVALID && RREADY
, on
every clock cycle. Our goal is also to issue bus requests as soon
as we either have the data available for any write requests, or alternatively
as soon as we have the space available for any read requests. If our burst
length is going to be anything more than 2 beats, does it really matter if we
take a clock or two to calculate these values as long as they are calculated
and issued early enough so as not to impact performace?
Let’s therefore take one clock cycle to start the transaction, and then a
second clock cycle between any two AxVALID
signals to do our work above.
Fig. 4 below shows what this might look like for a write process.
The burst would start by setting both AWVALID
and WVALID
at the same time.
The core could then take one clock cycle to be able to regenerate the next
AWVALID
. However, we wouldn’t set it until after sending WLAST
. Therefore,
as long as the burst is longer than two beats, we won’t suffer any loss.
I’ve tried to show this by making the AWVALID
signal in Fig. 4 an unknown,
just to mark that it could go high early, but the fact is that it isn’t an
unknown: it’s zero until the last beat is sent. There’s also the less likely
possibility of needing to send a burst that’s just one beat away from
a 256-beat boundary, but that’s a rare case and even then we’d only lose one
clock cycle in this setup.
Reads are similar, but having the primary difference that read requests are not synchronized with the read data. If we can keep our processing down to every other clock cycle, then we should be able to issue multiple read requests before the first result is ever returned. Further, after issuing some (user design dependent) number of read requests, we’d have to pause anyway to wait for uncommitted space available for more read returns.
As with the writes, however, the ultimate bus slave can only operate at some point on one beat at a time. Therefore, this will also only cost us a delay if the burst is less than two beats.
That could buy us some time.
What would happen, though, if the slave didn’t accept our AxVALID
signal
immediately? The answer is that it could impact our throughput if we waited
for the burst to be accepted before calculating the next burst’s parameters.
This is why I started using phantom
signals as I called them. You’ll see
them throughout all of my AXI bursting
master designs–all named something
like phantom_read
, phantom_write
, or perhaps even phantom_start
. The
idea is that all of our burst calculation logic can take place when the
phantom
signal is true. We can then hide this logic inside any potential
stall signals.
Let’s walk through how this might work.
-
First, we’d have our
start_transaction
signal–whatever it is. This is a combinatorial signal, and simply tells us that it’s time to start a new burst. -
Then, on the next cycle,
AxVALID
would be high–indicating a registered transaction start. On this same cycle, thephantom_write
orphantom_read
signal would also be high–but only for one cycle only. This would be a signal internal to the design that registered values (not the actual AXI protocol signals) can be adjusted as though the burst request had actually taken place. -
On the third cycle, the
phantom
signal would be low again–even thoughAxVALID
might stay high until the channel was no longer stalled. This is the clock we’ll take to recycle our addressing. It’s also the first cycle where thestart_transaction
combinatorial signal might be high again.This is partially what’s being shown in Fig. 6 above. The
phantom_read
signal is only high for one clock tick, but on the first clock of any new read request cycle. That way, if it takes a couple cycles for the read to be acknowledged, we’ll be ready for the next cycle at that time. -
Once the
AxVALID && AxREADY
indicates a request has been accepted, we then can setstart_transaction
on the next cycle and the process repeats until the transfer is complete.
I’ve found this phantom
starting signal to be quite useful. It decouples the
AxLEN
constraint logic from needing to wait on AxREADY
. Better yet, it
allows us to issue burst requests as fast as one burst request every other
clock cycle–even when we are requesting full length bursts.
That’s useful.
Space available
There’s also another criteria we haven’t discussed yet, and we’ll only touch on it below, and that is that you don’t want to start a transaction until you know you can finish it. For writes, this means you don’t want to start the transaction until you have the data available somewhere–likely in a local FIFO. For reads, you don’t want to issue the read request until there’s somewhere for the returned data to go.
In general, this means you need to keep track of a counter of either the data available to be written, or the space available to be read into.
Only … it’s not quite so simple. In particular, after issuing a burst request, even if nothing else in the FIFO changes, the amount of data (or space) available changes just due to the fact that we’ve requested the transfer.
That leads to something closer to the following (for writes).
Data then enters the FIFO any time fifo_write
is true. Once enough data
has accumulated to form a burst write request, the data_available
counter
is dropped by the length of that request–even before the data is read out of
the FIFO. That way we make certain we aren’t requesting writes based upon
data that’s already been committed to a prior write. Indeed, I’ll often
add the assertion to my design that,
just to make certain that I never request a data transfer for data that isn’t present.
You might think of this like a chemical production factory, as depicted in Fig. 7. As new product is created, it gets placed into a giant tank. As that product gets sold, the amount of product remaining in the tank that hasn’t yet been sold is the amount that’s available to be sold to the next customer. Hence, even if the tank is full, there might not be a full tank’s worth of product available to be sold. The amount available for sale, or in this case transfer, is what we keep track of.
A similar structure would work nicely for reads, as illustrated in Fig. 8. The
difference is that you’d be counting uncommitted empty space. That count
would start with the full FIFO’s size as space available, and would then be
decremented on any phantom_read
signal. Once the data was (later) read
out of the FIFO, you could return it to the count of uncommitted space
available.
This counter would be analogous to something like a coal bin at a power plant. One of the responsibilities of the staff at the power plant is to make certain that it never runs out of coal while in operation. They will therefore purchase train loads of coal to fill up the coal bin. It costs money, however, for the train to have to wait in order to unload. Therefore, you wouldn’t request a new trainload of coal until there’s room in the coal bin for not only the new trainload, but also for all other trains to empty in the bin that may have been previously ordered but not (yet) arrived. That’s the idea behind the “space available” calculation used with reads.
The phantom
signals also allows us to hide the calculation of the amount
of data (or space) available, similar to the way we handled our other
calculations. That way if AxVALID
is ever stalled, even by one clock
cycle, we’ll have this answer ready to request the next burst as soon as
it’s no longer stalled.
With a little help, we can also register whether two or more full bursts of data are present in this counter as well.
While not all of my examples below use this multiple_bursts_available
signal, it can be useful if the FIFO “size” is much larger than the maximum
burst size. In that case, a 32-bit comparison might be reduced to 9-bits.
With those preliminaries out of the way, let’s take a look at several example designs to see how these problems might be either solved or at least mitigated.
Example: VFIFO
Our first example is that of a virtual FIFO. You might also call this a “memory backed FIFO”. The idea is that it implements all of the capability of a basic FIFO, but also that it uses an external memory in case the block RAM available in your FPGA isn’t sufficient for the task at hand. Indeed, if all you need from your SDRAM is a FIFO, and you don’t care (that much) about the latency, then this might be the perfect capability for your application.
The cool thing about the virtual FIFO is that the problem definition solves most of our AXI burst logic generation problems for us. For example, because there’s no limit to the amount of data you might wish to transfer, we don’t have to check for the maximum data amount anymore. Better yet, we can keep all bursts at the same (power-of-two) length, which then means that our addresses will always be aligned and we don’t need to check 4kB boundaries at all.
Let’s take a look at how this might work. We’ll examine the write path alone below, just for simplicity, although the read path is quite similar.
The first step is to determe when we are ready to write a burst of data to
memory. This is the combinatorial flag, called start_write
in this design,
that takes place before the phantom
signal–the same signal we called
start_transaction
above.
We’ll start any write as soon as we have enough data in our incoming FIFO to
fill up a burst. Note here that even if our FIFO can hold (1<<LGFIFO)
elements, this comparison only requires LGFIFO-LGMAXBURST
bits. It’s
a nifty trick you can often get away with, but you’ll have to be aware
of the difference between >
and >=
to do it. (Using >
would’ve created
an LGFIFO
bit comparison, not an LGFIFO-LGMAXBURST
bit comparison.)
However, we can’t start writing if the entire FIFO–equivalent to the
size of the external SDRAM–is full. Yes, that would take a lot of data to
fill up most DRAMs–you’ll need
induction
to catch any problems here. Likewise, we don’t want to start writing if we
are still in the middle of a
soft reset.
(A soft reset
is where we reset the core without resetting the bus.) Similarly, we won’t
start any new burst on that same cycle where we’ve just issued a new
request–to give us a clock for the space_available
counter to adjust.
Further, if there’s no external/SDRAM memory space available for us to put this burst into, then we’ll just have to wait and start again once data is available.
How is this different from the vfifo_full
flag above? As I’ve currently
defined the
virtual FIFO,
it only has much space available as there is memory space. Practically, it
also has space for the two FIFO’s as well, but since I’m counting them
separately I need to check for them separately here as well.
Next, this mem_space_available_w
flag represents whether or not there’s a
full burst’s worth of space available in the
FIFO’s
memory or not. It’s not counting beats, but rather bursts. That way we
can check 20-bits of a memories address space instead of 32–assuming a
4GB memory (32-bit address), 256-beat bursts (8-bits of address), and a
128-bit wide memory bus (4-bits of address).
Coming back to the problem at hand, we don’t want to issue a new write command while the last one is either still in progress or stalled while being issued.
I think I mentioned above that I like aligning my write address channel request with the first data word of the first data beat. While the AXI bus protocol doesn’t require this, it simplifies the formal property check and so I require it of my designs.
Finally, this particular core will stop all transactions on any downstream bus error. Errors like these should never happen and are usually an indication that you don’t (yet) have your memory space set up properly. Given that I’ve been burned before by writing to a peripheral when I thought I was writing to memory, I’m careful to avoid this possibility if possible. (My flash memory has never been the same since …) Hence, the FIFO comes to a hard stop following any bus errors.
Address adjustments are fairly easy as well. On a reset, the address gets set to zero. On a soft reset, it only gets set to zero if there’s no outstanding (stalled) request. In all other cases, we just add one burst length (times the bus address width) to the address every time a burst has been accepted.
Well, there is one trick here. Specifically, since I know that every burst must be aligned, I’m going to make certain that all of the lower address bits remain zero on every clock cycle.
This has two purposes. First, it simplifies the synthesis optimization pass by making it crystal clear what I want–these lower bits will always be zero. Second, it keeps me from needing to write an assertion that these bits will be zero–since they’ll be set back to zero on every clock cycle.
The final step is to set the write address length to the size of one burst.
To see how this works, let’s take a peek at a nominal trace, shown in Fig. 10. This is a shortened trace for demonstration purposes only. Specifically, the burst length has been shortened to 4-beats, whereas it would normally be 256-beats per burst for better performance.
Here in this figure you can see incoming data coming into the FIFO. Once a
full burst has arrived, the start_write
flag is raised. This starts the
cycle whereby this burst gets written to the external RAM. Once the
BVALID
acknowledgment is returned, there’s then memory available to be read
so the start_read
flag gets set and a read transaction begins. Further read
transactions are then triggered every time there’s sufficient data in memory
to trigger them. The data is read into an outgoing FIFO, and then delivered to
any follow on AXI stream component from there.
Example: WBSCOPE
One common debugging component used during FPGA development is an internal logic analyzer of some type. Such an analyzer records data until some number of clocks following a trigger (defined externally), and then stops. This allows you to see what lead up to an event, or alternatively what happened after some event.
I like to use my own Wishbone Scope for this purpose. I’ve even got an AXI4-lite version of the same scope–just not (yet) an AXI4-lite version of the compressed scope. In this discussion, though, I’d like to discuss the idea of a similar scope, with a nearly identical user interface, but using AXI to be able to save the memory contents in an external (SD)RAM of some type.
I call this a MEMSCOPE–for lack of a better name. Fig. 11 above sort of shows the conceptual idea behind it.
I bring it up here because it’s just a little more complicated than the Virtual FIFO–not by much though. Much like the Virtual FIFO, everything about the MEMSCOPE is aligned–up until the last burst, which might need to terminate early. Unlike the Virtual FIFO, we don’t need to check if there’s space available in the memory–the MEMSCOPE just perpetually overwrites memory until it is told to stop.
Let’s take a peek at how this works.
We’ll start with the combinatorial start_transaction
signal, herein called
the w_phantom_start
signal.
As before, we only want to start if there’s data available in our local FIFO storage. Unlike before, we also have to check whether or not the scope has stopped recording and there is (potentially) a partial burst left to be written.
Of course, we can’t start a new burst if the write address channel is still stalled with the last burst.
Neither do we want to issue a new burst request if the data from the last burst hasn’t (yet) finished writing to memory.
During a soft reset, one where we reset this core without resetting the bus, or likewise after the scope has stopped collecting data, we’ll want to avoid writing anything more to the external RAM.
Finally, we insist on one clock between new burst requests to allow all of our registered counters to adjust to the new burst.
So far, this should look very similar to the Virtual FIFO.
The length handling, however, is subtly different. In particular, we’ll always send a full burst unless we’ve stopped collecting data and there remains a partial burst’s worth of data left.
The neat thing about this is that we don’t ever need to check against any 4kB boundaries–even though we permit an other-than-full-length burst.
The write address signal is only slightly more complex than the one for the
Virtual FIFO.
Here, we use the AWLEN value for the burst that has just been
completed to adjust the address–meaning that when we stop, the AWADDR
signal will point to the oldest address in memory.
Up until the scope stops, however, the lower bits will remain aligned.
Of course, we’ll start back at the beginning of RAM on any reset.
This core also issues word-aligned requests only.
So, that wasn’t so bad. As you saw, we were able to nicely simplify those
four AxLEN
criteria down to something usable, and so only lost one clock
between initiating one burst transaction and being able to initiate a
second–a clock cycle that wouldn’t be noticed anyway due to the fact that
we don’t issue write requests until the last one is complete.
Example: VDMA
Where things really start getting dicey is when the controller no longer has control over the start address or the length of any given burst. A classical example of this would be in the framebuffer reader capability found within my AXI video DMA. This core reads from a framebuffer and generates an outgoing video stream signal from it.
The AXI operations themselves are all line based reads, as shown in Fig. 13 below.
The reader starts reading from the first line in memory, located at a configurable frame address. Subsequent lines are all separated by a configurable “line step”, specifying the distance between lines. Then, within a given line, bursts start at the base address for the line and continue sequentially until the end of the line. Between one line and the next, there may be some unused space–it’s not required, but it is illustrated in Fig. 13 above.
To see how this works, let’s check out the line address generation logic below. It should match Fig. 13 above quite nicely.
- When the design is first activated, the VDMA always starts reading from the beginning of a frame.
- The internal addresses are only adjusted when a new read is issued.
This is separate and distinct from the
ARVALID && ARREADY
cycle where it is accepted by the bus.
-
Internally, line and frame control are driven by two control signals:
req_hlast
(the last burst request in a line has been issued), andreq_vlast
(we are processing the last line in a frame). If both of these signals are true, then we need to re-start processing from the base address of the entire video frame.In this case, the
cfg_
prefix denotes configuration values set by the user at run time.
- Otherwise, if this isn’t the last line of the frame, but we have already
issued the last burst of the line, then we need to step our burst
starting address, herein called
req_addr
, forward by one line. In order to maintain where the first line of the frame was, we’ll also step the “beginning of line” address forward by one line as well. Finally, we’ll charge a counter containing the remaining number of words to be read in this line to its full width.
-
In all other circumstances, we’ll step forward by the length of one burst–whatever that might currently be.
You might also notice the
ADDRLSB
value in the request below. As you may recall from above, this is the log (based two) of the bus width in bytes. It’s equivalent to AXI’sM_AXI_ARSIZE
and it’s simply used here to adjust our burst address by the number of bytes within a single bus word.One final point I want to bring out here: whereas the first burst of any line might be misaligned, subsequent bursts will always be aligned with a maximum burst sized boundary. For this reason, you can always clear the bottom
LGMAXBURST+ADDRLSB
bits, even though measuring the number of remaining words will take a bit more work.
All of that is background information describing how a burst’s address is determined. It’s important if for no other reason than it points out that we really have no control over the initial address of a given burst, nor do we have any control over length of any lines. All of that information is determined by the user at run-time.
That means we’re going to need to suffer the two clock loss when calculating the next burst’s starting information. In this case, it’s not as bad as it sounds, and here’s why:
-
While you might want to use 100% of the memory bandwidth while reading a line of pixels, video timing typically gives you several free clocks between lines.
-
We could still hide our calculations in the throughput of the bus in all cases other than the first burst of any line.
If we wanted to, we could pre-calculate some line statistics while running through the previous line–sort of pipelining the calculation if you will–I just don’t think that’s required here.
As a result, we have three clocks of interest:
-
There’s the first clock of the calculation. I’ll use the flag,
lag_start
, to identify this clock. -
There’s the combinatorial clock when
start_burst
is true. This clock cycle follows thelag_start
cycle, and forms the second of the two clocks between burst starts. -
Finally, there’s the clock cycle where
phantom_start
is true. This is the clock cycle where the actual read is first issued on the bus. -
This is followed by another
lag_start
clock and the cycle repeats.
Fig. 14 below shows how these signals all work together. This figure was built
under the assumption of a very large FIFO for the receive data. As such, more
requests are issued for memory than there are beats returned in this trace.
That’s actually a good thing–when the slave can’t handle any more requests,
it will drop ARREADY
. That’s a slave responsibility. It’s not the master’s
responsibility. The master should be holding ARVALID
high anytime it knows
it wants to request another packet. In the meantime, our requests will cross
the interconnect
while waiting for the slave to finish replying to our request. This means that
the read data, once produced, should be able to continue from one burst to the
next without hiccups–assuming the slave can handle that pace in the first
place.
Let’s walk through and discuss this figure, so you’ll follow what’s going on below.
-
The figure starts out on the far left with the user writing the number of words per line and the number of lines per frame to the core. On the next cycle, the user gives the core the frame’s base address. Finally, the user turns the interface on. From there on out,
cfg_active
is asserted indicating that the configuration is active. -
It takes us three clocks to get started once the configuration becomes active. On the last of these clocks, the
req_addr
is set. This will form the basics of the request that follows. -
A second clock is used to determine the maximum allowable burst length.
-
From this maximum allowable burst length we can now set
ARLEN
and issue a read command. -
Immediately after the read command, and possibly even before it is accepted by the bus, the
lag_start
signal goes high and we restart the process all over again.
Let’s now take a look into the design to see how all this was accomplished.
We’ll start by examining the start_transaction
combinatorial flag, which in
this core is called `start_burst. In this design, burst requests are issued
if there’s room in the FIFO to receive another burst’s worth of data.
What might not be apparent here is that I’m checking for whether or not there’s space enough for a full burst in the FIFO. This particular Video DMA core can’t (yet) handle single or even finite frame counts. It either outputs a stream of ongoing video data or nothing. For this reason, it doesn’t check whether or not there’s enough room in the FIFO for anything less than a full burst’s size.
That gets us started, unless … and there are a lot of “unless”es, just like the last core. In this case, we’ll start if we have room in our FIFO unless we are in one of our two burst data calculation clocks.
Similarly, we can’t request a new burst if we are still stalled waiting for the last burst request to be accepted.
Finally, if the user has turned off video production, i.e. if he has dropped
the cfg_active
flag, or if we are waiting to for a
soft_reset
to complete, then we refuse to start a new burst until after we come to a
complete halt.
How much logic did we
use? Well,
everything in this combinatorial path is one bit, save for the
no_fifo_space_available
signal which just checks the
top bits of how much space is available, so perhaps about 7-8 bits. This
is quite within reason.
Now we need to calculate the actual burst length, ARLEN
, we are going to
request. This is the hard part, and so we’ll do this in two steps.
For the first step, we check the remaining number of words to be read in the current line. If it’s greater than our maximum burst amount, then the maximum burst length would be a full burst, otherwise it would be limited by however much requested data is available.
This happens on the lag_start
clock cycle.
The key piece of this check is that req_line_words
, our signal holding the
number of words remaining in a line, is set on the phantom_start
clock cycle
and this check is taking place on the cycle following–as soon as a new
req_line_words
value is available. For reference, this is shown in Fig. 14
above as part of req_addr
.
That’s only the first part of generating arlen
. For the second step, we’ll
now need to compare this value against how many beats are between us and the
nearest burst boundary. Again, we check against the nearest burst boundary
because it’s at most an 8-bit check, rather than the 12-bit check required
for a full 4kB check.
You might notice above that I played a little sleight of hand with the two’s
compliment check. I want to check whether or not the req_addr + max_burst
is greater than (1 << LGMAXBURST)
(checking only the lowest
ADDRLSB +: LGMAXBURST
bits of req_addr
). After subtracting these req_addr
bits from both sides, I then want to check whether or not max_burst
is
greater than (1 << LGMAXBURST) - req_addr
. This is the same as checking
whether max_burst > (1 << LGMAXBURST) + ~req_addr + 1
. (Just a two’s
compliment equality replacement here.) If I then swap from a >
check to
a >=
check I can drop the +1
and the carry chain–as you can see above.
This computation will take place on the clock period just prior to
phantom_start
, and so we have the time to do this.
The last step is to generate the AXI address. This was calculated a couple of
clocks earlier by req_addr
above, so it’s simply listed below as,
The key point to remember from this algorithm is that, yes, we did take two
clocks to calculate the burst size, ARLEN
. However, most video streams would
never notice this two clock hiccup.
That said, I have thought about adjusting this so that we only take two clocks on the first burst of any line, and one clock on any subsequent burst. We’ll see what the future brings for this algorithm.
Example: S2MM
The most complicated of all my examples, however, is my own AXI stream to memory data mover. This one took quite a bit of time to get right, and then even longer to adjust it so that it would work in a minimum amount of time. Here are some key points to the problem to consider:
-
The user can select any address to start on, regardless of whether it is burst aligned or not. (My current version still requires word alignment–but that’s another topic.)
-
The user can also select any length to start with, with the only constraint being that it must be greater than zero.
-
Once the user selects the address he wants to read from and the number of words to read, he must still issue a start command to the controller. This extra clock cycle is a gift. We can hide one of our two bounds checking clock cycles inside this free clock cycle.
-
Unlike the frame buffer reader above, we only have to deal with this initial burst check once rather than at the beginning of every new line. Every burst following that initial burst will be burst aligned, and so won’t need to be checked against a 4kB boundary.
-
We also offer the new feature of being able to write all of our data values to a fixed word in memory space—perhaps the input word of some other programming logic controller, perhaps even an output controller of some type. This adds a new requirement to our list of things to check: we now have to check against one of two maximum burst sizes depending upon whether or not the user wants us to used the fixed addressing mode (max burst size of 16) or not (max burst size of 256, and
r_increment
is asserted).
To see how this all works out, let’s examine a trace showing this timing in Fig. 16 below.
This figure shows a memory copy of nine words, where the first burst isn’t aligned on a burst boundary, and so it takes two clock cycles. Let’s walk through how it works piece by piece.
First, just like the VDMA example above, this one also starts with a user issuing a configuration command containing the memory address to copy to as well as the length of the memory copy. This configuration ends with the user issuing a command to start.
As the user enters a new address, we’ll set two flags. One flag,
aw_needs_alignment
, records whether or not the first burst will be
artificially limited to the burst boundary.
Two other flags record whether or not the user has asked for multiple bursts. One flag checks for multiple bursts assuming that the addresses will be incrementing and bursts will be 256 beats in length, whereas the other flag checks whether the transfer would require multiple bursts of 16-beats in length.
Only once the user issues the start command do we know which of the two
will limits will constrain us. That means we’re going to need a flag to
let us know that this initial burst computation isn’t (quite) done yet.
That’s the purpose of the r_pre_start
flag, also shown in Fig. 16 above.
This core is a bit different from the others above in that it keeps track of
an aw_multiple_bursts_remaining
flag. This flag is equivalent to whether
or not two or more bursts are remaining. To maintain this equivalence, it is
set and maintained together with the counter containing the amount of remaining
data left to transfer.
It’s initialized once the full size of the transfer is available.
On any abort, as part of the soft reset logic, the request lengths are cleared.
In all other cases this remaining length is updated any time a new burst
request is issued. What’s not apparent from these
lines
is that the number of items remaining is calculated combinatorially from the
current number of items minus the number in
AWLEN
.
That way I can use one new remaining length signal in three separate
non-blocking assignments.
The real key to this algorithm lies in how the length of the first burst is calculated. In particular, because of the flags we calculated earlier, we can keep this burst length calculation short and based on a small number of 9-bit comparisons only.
The first step is to calculate (combinatorially) the distance to the next
burst boundary. This is roughly equivalent to
(1<<(LGMAXBURST+ADDRLSB))-cmd_addr)
. I have to separate it out for special
treatment here for two reasons. First, I want to make certain that the
operation is limited to LGMAXBURST
bits–not LGMAXBURST+1
bits, so we
drop a bit here. The second reason is that I want to drop the lower sub-word
address bits from this calculation.
Here’s the key you’ve been waiting for.
In general, the initial burst length is the full size.
If we are going to be using FIXED addressing, then we need to limit ourselves to a maximum of 16 beats, not 256. Moreover, if we aren’t doing multiple bursts, we’ll limit the burst length to one burst only.
Did you catch how this computation only requires two bits so far? That’s much better than the 32-bit comparisons we started with.
What about normal incrementing addressing? That’s the next check. Remember how we were able to check if we needed alignment and reduce it into a single flag above? That check can now be handled in a single check here. Similarly, if we don’t need to realign the burst, and if we aren’t issuing multiple full bursts, then we can issue a smaller burst length.
So, from four flags and two lengths, we’ve been able to determine our initial
burst length. That’s one LUT per bit of
initial_burstlen
.
Not bad. We might’ve also been able to handle this initial value one
clock earlier if r_increment
were defined prior to the user’s start command.
From here, we get to define the maximum burst length we can process. This is initially set by the clock following the start command to the initial value we just calculated above.
We then update this maximum burst length on every phantom_start
clock cycle.
That also means that we’ll know the maximum size of our burst immediately
following any phantom_start
cycle. That would be enough to start a second
cycle immediately, save only that we still need to check whether this much
data is in our FIFO–so it will still take the minimum two clocks between burst
starts as illustrated in Fig. 16 above.
There are four cases to consider when checking for the next burst size, divided here into two categories. The first category is the “normal” one, where addresses are incrementing throughout the transfer.
In this case, if we don’t have multiple bursts remaining, then we need to check the number of beats remaining against the maximum size of the burst. Otherwise, we stick with the maximum burst size.
We then repeat the calculation for fixed addressing, with the only difference being that the maximum burst size (16 beats) is smaller.
At this point, we know the size of our next burst. That was the hard part.
The next piece is knowing when to start the next burst. In this case, I named
the start_transaction
combinatorial signal as w_phantom_start
. As before,
we start a burst whenever there is sufficient data in the FIFO to complete
the burst.
This is a combinatorial check, depending upon the value of
sufficiently_filled
.
So we’ll need to remember that burden as we work through this logic.
We’re also going to insist on a minimum of one clock between bursts, and an extra clock to calculate the burst size initially.
As with our other checks, we can’t start any new bursts if we are still stalled waiting for the last burst request to be accepted, or likewise if we are still transmitting the data from the last burst.
Finally, if we aren’t busy doing a copy then we can’t start, neither should we start any new bursts following any request for a bus abort.
That leaves only two pieces of logic left. First, we’ll need to set AWLEN
on any new burst request. Thankfully, this has already been calculated
above from r_max_burst
, so it’s little more than a register copy here.
The outgoing address is updated as soon as any successful burst is completed.
Note the check for whether or not the r_increment
flag is set–since the core
supports both incrementing and fixed addressing.
We’ll also set the write address any time we are not (yet) busy so that it is set to the first word of any upcoming transfer.
Conclusion
Achieving high throughput in any design is dependent upon being able to sustain read or write data transfers on every single clock cycle, even crossing multiple beats if necessary. This requires creating AXI burst requests as soon as the data (or space) is available to make them. It also requires making sure that your request has crossed the interconnect and made it to the downstream slave as early as possible. Ideally, you’ll want the slave to have any subsequent burst requests waiting, on its doorstep, as soon as it has finished with processing the last transaction so that you can minimize any hiccups in the transfer.
Sadly, the complexity associated with making such requests is non-trivial. In particular, there are four criteria that need to be checked of any transfer length before it can be used to generate the next burst cycle. Depending upon the speed of your FPGA, these four criteria can take several clocks to accomplish.
Today, we looked at four separate AXI masters which all generate AXI burst transaction requests, and in particular we looked over how they could handle generating requests faster than the data could be delivered. Some solutions were very basic, such as insisting that all requests were of the same length. Other solutions were more complicated and required multiple clock cycles to set up. One common theme across all of these solutions was the need to calculate the details of the next burst in a pipelined fashion, so that those details would be ready before it was time to request the next burst. As a result, each of these AXI masters is able to sustain a full 100% bus throughput with one beat of data transferred on every clock cycle even as the requests crossed the boundary from one burst to the next, to the next one after that, and so on.
100% AXI throughput is therefore very achievable. You should accept nothing less in your designs.
He hath shewed thee, O man, what is good; and what doth the LORD require of thee, but to do justly, and to love mercy, and to walk humbly with thy God? (Mic 6:8)