The hard part of building a bursting AXI Master

This article continues our series on building AXI based components. So far, we’ve discussed what it takes to verify and then build an AXI-lite slave, and then an AXI (full) slave. We’ve examined what it takes to calculate the next address within a burst, and looked at the most common AXI mistakes along the way. More recently, we discussed how to build a basic AXI master–one that could issue multiple outstanding singleton read or write requests.

What we haven’t discussed is how to build an AXI master that will issue burst requests.

It’s not easy.

Today, let’s dig into one of the harder challenges involved in building a bursting AXI master: how to handle setting the various AxVALID, AxADDR, and AxLEN bus signals. In particular, the AXI bus protocol imposes several constraints on these signals that all need to be met at the same time. The challenge is figuring out how to generate bursts, meeting all of these constraints, without slowing down any transfers. How then should an AXI bus master be built that can fill the bus with as many requests as possible?

So let’s start by looking at what these constraints are. I’ll then share several examples of open source AXI masters that can generate burst requests meeting these constraints, progressing from simple to more complex examples along the way. My examples will include a virtual FIFO, a memory-backed logic analyzer, a vide frame buffer reader, and a stream to memory DMA. Each of these designs solves the multiple constraint problem in a slightly different way, so they form a useful set of examples to learn from.

AxVALID, AxLEN, and what makes this difficult

Let’s start by taking a peek at the logic required when setting AxLEN, and then what’s required to set AxVALID. Here, I’m using the Ax prefix to reference either the AW (write address) channel or the AR (read address) channel interchangeably. Specifically, there are four challenging requirements when driving AxLEN, and then some other requirements when driving AxVALID. These requirements also impact the addresses we might choose to send, and so also the AxADDR signal. Getting all of these requirements right, without impacting the maximum frequency of the design, tends to be one of the most challenging parts of generating AXI bursts.

So let’s start out by looking over the four requirements of AxLEN.

AxLEN can’t be any larger than one less than your maximum burst size.

For AXI4, the maximum burst length is 256 beats. Since AxLEN is one less than the requested burst size, that means AxLEN must be no greater than 255. AXI3 is similar, but with a maximum burst size of 16 beats, so the AxLEN signal in AXI3 can be no larger than 15.

I’ve tried to capture this basic protocol difference using a parameter containing the log of this maximum burst size, LGMAXBURST, which will be set to either 8, for a 255 beat burst, or 4, for a 16 beat burst, or to any user configurable value less than the protocol maximum. This allows us to express the maximum burst size as (1<<LGMAXBURST), and the maximum AxLEN value as (1<<LGMAXBURST)-1.

To illustrate these various constraints, let’s build a draft of the logic necessary to calculate this AxLEN value, and then update it after each constraint. To start out, we’ll obviously like to move as much data as we can, so our first constraint would be that we want to set the next AxLEN value to the size of a full burst.

	always @(*)
		next_axlen = (1<<LGMAXBURST) - 1;

If you want to read or write a burst from or to a fixed address, then the size of the maximum length burst drops down from 256 beats to 16 beats for both AXI4 and AXI3.

This also applies to bursts using WRAP addressing as well, although I haven’t (yet) found a good use for them–and that includes even after building a CPU cache.

	always @(*)
	begin
		next_axlen = (1<<LGMAXBURST) - 1;
		if (fixed_burst)
			next_axlen = (1<<LGMAX_FIXED_BURST) - 1;
end

Fig 1. Goals when building an AXI master component

I suppose I could make all of my bursts 16-beats, but I’m also aware that several vendor AXI components have a per-burst overhead of a couple of clock cycles. My goal when building any bus component is throughput, and so that requires minimizing any overhead. That then means that I’ll need to use 256-beat bursts when writing to incrementing (subsequent) addresses, and 16-beat bursts when writing to fixed (identical) addresses. This forces the if statement above.

It’s illegal in AXI to cross 4kB boundaries.

This comes directliy from the AXI4 specification. While I’m not quite certain why this is, my guess is that it’s to guarantee that bursts will not cross either MMU page or device boundaries. This certainly simplifies the design of any interconnect, since it prevents the interconnect from needing to check whether or not a particular burst needs to be split when crossing device boundaries.

Either way, this requirement is going to mean that we’re going to need to limit our AxLEN field again. This time it will need to be limited such that AxADDR[11:ADDRLSB] + (AxLEN +1) < 4kB

// ADDRLSB captures the the word size of the bus in relevant address bits
localparam	ADDRLSB = $clog2(C_AXI_DATA_WIDTH);

	always @(*)
	begin
		next_axlen = (1<<LGMAXBURST) - 1;
		if (fixed_burst)
			next_axlen = (1<<LGMAX_FIXED_BURST) - 1;
		if (next_address[11:ADDRLSB] + next_axlen > (1<<(12-ADDRLSB)))
			next_axlen = (1<<(12-ADDRLSB))-next_address[11:ADDRLSB];
end

Fig 2. Aligned addressing

At this point, we might be done if all bursts were to be multiples of the maximum burst size.

Sorry, but no. That’s not good enough. While that might work for the virtual FIFO, that strategy won’t work for the more common (arbitrary) DMA case.

You don’t want to request to transfer more memory than the total amount you want to transfer.

Yes, this sounds obvious. It is just about as obvious as it sounds too. However obvious it might be, though, we’ll still need to pay the logic required check this.

Basically, in many data moving applications–to include two of our later examples, the amount of memory that needs to be transferred is chosen by the user at run-time. That means you are going to need a counter that counts down as each burst is requested, so you can always know how much data you have left to send. Then, using this counter, you can ask whether or not your burst length is less than the maximum burst length, and if so you would then only transmit what’s remaining. That is, if you have LEN words remaining in your transfer, you don’t want to transfer more than those LEN words.

We’ll use remaining_transfer_length as our value below. Further, for a full featured data mover, checking this value requires a comparison across as many bits as you are using to represent LEN. Since I tend to be a perfectionist, that can be a 32-bit comparison.

	always @(*)
	begin
		next_axlen <= (1<<LGMAXBURST) - 1;
		if (fixed_burst)
			next_axlen <= (1<<LGMAX_FIXED_BURST) - 1;
		if (next_address[11:ADDRLSB] + next_axlen >= (1<<(12-ADDRLSB)))
			next_axlen <= (1<<(12-ADDRLSB))-next_address[11:ADDRLSB]-1;
		if (next_axlen >= remaining_transfer_length-1)
			next_axlen = remaining_transfer_length-1;
	end

That’s our four criteria.

Now let’s add to this mess my rule of thumb that any 32-bit operation takes one clock cycle. That means that calculating the next AxLEN value alone is going to require two clock cycles.

Fig 3. Clock Cycles

This is still unacceptable.

We have a second and similar problem with AxVALID, although it’s not nearly quite as bad. The problem with AxVALID is that you don’t want to set AxVALID unless you have next_axlen+1 data items available to be sent, in the case of writing, or next_axlen+1 spaces available in your FIFO if you are reading. Sure, the AXI bus allows you to stall the bus in both directions if the data isn’t quite ready yet, but do you really want to slow the rest of your design down by this component? Indeed, the rule of thumb here should be that once a burst has been requested then either WVALID, for writes, or RREADY for reads, should remain asserted until all of the beats of the burst transfer are complete.

So let’s build up some generic logic for starting a burst. We’ll call this start_transaction here, and we’ll make it combinatorial–since many things might depend upon it. So, in general, we’ll start a burst as soon as we either have data or space available for the transfer.

	always @(*)
	begin
		start_transaction = (data_or_space_available > next_axlen);

We’re going to need ot be careful not to start a new transaction while the last transaction request remains outstanding and stalled.

		if (M_AXI_AWVALID && !M_AXI_AWREADY)
			start_transaction = 0;

Moreover, we don’t want to start a transaction while a burst write is in process.

	if (M_AXI_WVALID && (!M_AXI_WLAST || !M_AXI_WREADY))
		start_transaction = 0;

This will also align the write address and write data channels. While this isn’t specifically required by the AXI specification, it simplifies the masters: This way, you can go about generating the AXI length once, and use it for both channels without requiring a FIFO in between them to keep their lengths synchronized. Be aware, though, that AXI doesn’t require this in general, so while this is a nice place to start when generating bursts, it’s not something you can depend upon in a slave when processing them.

Let’s add in two other basic criteria as well. For example, you want to be able to abort any transfer on a soft reset, such as might be caused by an error or external user reset request.

	if (soft_reset)
		start_transaction = 0;

You also want to be able to guarantee that you won’t start an operation until the user has requested it. We’ll use an r_busy here to indicate that the AXI master is in it’s “transfer data” state. This yields another start condition.

	if (!r_busy)
		start_transaction = 0;

If we’re not careful, all of this logic will eat into our timing slack, slowing down our over all data transfer rate. The worst offender in this chain is the check for whether or not the data (or space) available is greater than the amount we want to transfer. Worse, it depends upon knowing the amount of data to be transferred, so this test can’t take place until we’ve finished the two clocks above. That might slow our burst-to-burst issue time down from two clocks to three clocks.

Again, unacceptable.

If the goal is to be able to achieve 100% bus throughput, then we’re going to have to figure out a way to do better.

Simplifying the problen

Looking at these criteria, I wasn’t ready to settle for a three clock delay when building my AXI masters. So, I looked around to see if the problem could be simplified first. Sure enough, there are plenty of simplifications available to you.

AXI-Lite bursts

The easy way to handle this whole problem would be to use a burst length of 1. In this case, the logic would get really simple.

	always @(*)
	begin
		// Always do singleton bursts
		next_axlen = 0;
		//
		// Every burst advances the address by one word
		next_axaddr = axaddr + (1<<ADDRLSB);
		//
		// Word-align all subsequent burst addresses
		next_axaddr[ADDRLSB-1:0] = 0;
		//
		// Start a transaction as soon as space is available
		start_transaction = (data_or_space_available > (axvalid ? 2:1));
		//
		// We'll still need to check other start transaction checks though
		if (M_AXI_AWVALID && !M_AXI_AWREADY)
			start_transaction = 0;
		// If the burst size is one, then there's no reason to check for
		// WLAST here as well ...
		if (M_AXI_WVALID && !M_AXI_WREADY)
			start_transaction = 0;
		if (soft_reset || !r_busy)
			start_transaction = 0;
	end

If you are going straight into a MIG generated memory controller, this would be good enough. If you have to go into an ARM, an AXI crossbar or Xilinx’s AXI block RAM controller, on the other hand, this might cripple any throughput you might’ve otherwise had. The AXI slave controller we built on this blog, as well as my own crossbar would be fine both ways.

I’ll admit, sometimes I wonder why the designers of AXI didn’t just leave everything that simple. It would’ve made the bus so much easier to work with. Indeed, under conditions like this I was able to generate and verify the stream to Wishbone master I mentioned above in just a half a day. It was that easy.

Alas, AXI is not so simple, so let’s look at some other ways to simplify this problem.

Burst Alignment

One suggestion I came across early on was to align every burst to the size of the maximum burst. If we did that, then the first burst would need an alignment check, but nothing following would need to check for crossing any boundaries.

Now, for example, the first burst’s length computation would look like …

	always @(*)
	begin
		if (fixed_burst)
		begin
			initial_axlen <= (1<<LGMAX_FIXED_BURST)-1;
		end else begin
			initial_axlen <= (1<<LGMAXBURST) - 1;
			if (initial_address[ADDRLSB +: LGMAXBURST] + initial_axlen
					>= (1<<LGMAXBURST))
				initial_axlen <= (1<<(LGMAXBURST-ADDRLSB))
					- initial_address[LGMAXBURST-1:ADDRLSB]-1;
		end

		if (initial_axlen+1 >= transfer_length)
			initial_axlen = transfer_length-1;
	end

That gets us down to three constraints–down from four. Wait, though, it gets better. On subsequent bursts, we can now set axlen based upon the remaining transfer size alone.

	always @(*)
	begin
		if (fixed_burst)
			next_axlen <= (1<<LGMAX_FIXED_BURST)-1;
		else
			next_axlen <= (1<<LGMAXBURST) - 1;

		if (next_axlen+1 >= remaining_transfer_length)
			next_axlen = remaining_transfer_length-1;
	end

Even though the initial length will still take two cycles, the subsequent length calculation might now just fit within a single clock cycle.

Hiding Computations

Let’s take a step back for a moment though. Our goal is to be able to maintain 100% throughput across our bus. That means that we should be able to transfer one beat of information, whether WVALID && WREADY or RVALID && RREADY, on every clock cycle. Our goal is also to issue bus requests as soon as we either have the data available for any write requests, or alternatively as soon as we have the space available for any read requests. If our burst length is going to be anything more than 2 beats, does it really matter if we take a clock or two to calculate these values as long as they are calculated and issued early enough so as not to impact performace?

Let’s therefore take one clock cycle to start the transaction, and then a second clock cycle between any two AxVALID signals to do our work above.

Fig. 4 below shows what this might look like for a write process.

Fig 4. Hiding write address calculations

The burst would start by setting both AWVALID and WVALID at the same time. The core could then take one clock cycle to be able to regenerate the next AWVALID. However, we wouldn’t set it until after sending WLAST. Therefore, as long as the burst is longer than two beats, we won’t suffer any loss. I’ve tried to show this by making the AWVALID signal in Fig. 4 an unknown, just to mark that it could go high early, but the fact is that it isn’t an unknown: it’s zero until the last beat is sent. There’s also the less likely possibility of needing to send a burst that’s just one beat away from a 256-beat boundary, but that’s a rare case and even then we’d only lose one clock cycle in this setup.

Reads are similar, but having the primary difference that read requests are not synchronized with the read data. If we can keep our processing down to every other clock cycle, then we should be able to issue multiple read requests before the first result is ever returned. Further, after issuing some (user design dependent) number of read requests, we’d have to pause anyway to wait for uncommitted space available for more read returns.

Fig 5. Hiding read address calculations

As with the writes, however, the ultimate bus slave can only operate at some point on one beat at a time. Therefore, this will also only cost us a delay if the burst is less than two beats.

That could buy us some time.

What would happen, though, if the slave didn’t accept our AxVALID signal immediately? The answer is that it could impact our throughput if we waited for the burst to be accepted before calculating the next burst’s parameters.

This is why I started using phantom signals as I called them. You’ll see them throughout all of my AXI bursting master designs–all named something like phantom_read, phantom_write, or perhaps even phantom_start. The idea is that all of our burst calculation logic can take place when the phantom signal is true. We can then hide this logic inside any potential stall signals.

Fig 6. "Phantom" start signals

Let’s walk through how this might work.

First, we’d have our start_transaction signal–whatever it is. This is a combinatorial signal, and simply tells us that it’s time to start a new burst.
Then, on the next cycle, AxVALID would be high–indicating a registered transaction start. On this same cycle, the phantom_write or phantom_read signal would also be high–but only for one cycle only. This would be a signal internal to the design that registered values (not the actual AXI protocol signals) can be adjusted as though the burst request had actually taken place.
On the third cycle, the phantom signal would be low again–even though AxVALID might stay high until the channel was no longer stalled. This is the clock we’ll take to recycle our addressing. It’s also the first cycle where the start_transaction combinatorial signal might be high again.

This is partially what’s being shown in Fig. 6 above. The phantom_read signal is only high for one clock tick, but on the first clock of any new read request cycle. That way, if it takes a couple cycles for the read to be acknowledged, we’ll be ready for the next cycle at that time.
Once the AxVALID && AxREADY indicates a request has been accepted, we then can set start_transaction on the next cycle and the process repeats until the transfer is complete.

I’ve found this phantom starting signal to be quite useful. It decouples the AxLEN constraint logic from needing to wait on AxREADY. Better yet, it allows us to issue burst requests as fast as one burst request every other clock cycle–even when we are requesting full length bursts.

That’s useful.

Space available

There’s also another criteria we haven’t discussed yet, and we’ll only touch on it below, and that is that you don’t want to start a transaction until you know you can finish it. For writes, this means you don’t want to start the transaction until you have the data available somewhere–likely in a local FIFO. For reads, you don’t want to issue the read request until there’s somewhere for the returned data to go.

In general, this means you need to keep track of a counter of either the data available to be written, or the space available to be read into.

	// For writes, the amount of data available in the FIFO which can be
	// sent across the interface
	assign	data_available = fifo_fill;

	// For reads, the amount of empty space available in the FIFO which
	// can be filled with a subsequent read.
	assign	space_available = FIFO_SIZE - fifo_fill;

Only … it’s not quite so simple. In particular, after issuing a burst request, even if nothing else in the FIFO changes, the amount of data (or space) available changes just due to the fact that we’ve requested the transfer.

That leads to something closer to the following (for writes).

	always @(posedge S_AXI_ACLK)
	if (!S_AXI_ARESETN)
		data_available <= 0;
	else case({ fifo_write, phantom_write })
	2'b00: begin end
	2'b10: data_available <= data_available + 1;
	2'b01: data_available <= data_available - AWLEN -1;
	2'b11: data_available <= data_available - AWLEN;
	endcase

Fig 7. Data Available

Data then enters the FIFO any time fifo_write is true. Once enough data has accumulated to form a burst write request, the data_available counter is dropped by the length of that request–even before the data is read out of the FIFO. That way we make certain we aren’t requesting writes based upon data that’s already been committed to a prior write. Indeed, I’ll often add the assertion to my design that,

	always @(*)
	if (M_AXI_WVALID)
		assert(fifo_fill > 0);

just to make certain that I never request a data transfer for data that isn’t present.

You might think of this like a chemical production factory, as depicted in Fig. 7. As new product is created, it gets placed into a giant tank. As that product gets sold, the amount of product remaining in the tank that hasn’t yet been sold is the amount that’s available to be sold to the next customer. Hence, even if the tank is full, there might not be a full tank’s worth of product available to be sold. The amount available for sale, or in this case transfer, is what we keep track of.

Fig 8. Uncommitted Space Available

A similar structure would work nicely for reads, as illustrated in Fig. 8. The difference is that you’d be counting uncommitted empty space. That count would start with the full FIFO’s size as space available, and would then be decremented on any phantom_read signal. Once the data was (later) read out of the FIFO, you could return it to the count of uncommitted space available.

This counter would be analogous to something like a coal bin at a power plant. One of the responsibilities of the staff at the power plant is to make certain that it never runs out of coal while in operation. They will therefore purchase train loads of coal to fill up the coal bin. It costs money, however, for the train to have to wait in order to unload. Therefore, you wouldn’t request a new trainload of coal until there’s room in the coal bin for not only the new trainload, but also for all other trains to empty in the bin that may have been previously ordered but not (yet) arrived. That’s the idea behind the “space available” calculation used with reads.

The phantom signals also allows us to hide the calculation of the amount of data (or space) available, similar to the way we handled our other calculations. That way if AxVALID is ever stalled, even by one clock cycle, we’ll have this answer ready to request the next burst as soon as it’s no longer stalled.

With a little help, we can also register whether two or more full bursts of data are present in this counter as well.

	always @(*)
	begin
		new_data_available = data_available;
		case({ fifo_write, phantom_write })
		2'b00: begin end
		2'b10: new_data_available = data_available + 1;
		2'b01: new_data_available = data_available - AWLEN -1;
		2'b11: new_data_available = data_available - AWLEN;
		endcase
	end

	always @(posedge S_AXI_ACLK)
	if (!S_AXI_ARESETN)
		data_available <= 0;
		multiple_bursts_available <= 0;
	end else begin
		data_available <= new_data_available;
		multiple_bursts_available <= |new_data_available[LGFIFO:LGMAXBURST];
	end

While not all of my examples below use this multiple_bursts_available signal, it can be useful if the FIFO “size” is much larger than the maximum burst size. In that case, a 32-bit comparison might be reduced to 9-bits.

	always @(*)
	begin
		start_transaction = (multiple_bursts_available)
				|| (data_available[7:0] > next_awlen);

With those preliminaries out of the way, let’s take a look at several example designs to see how these problems might be either solved or at least mitigated.

Example: VFIFO

Fig 9. A Virtual FIFO

Our first example is that of a virtual FIFO. You might also call this a “memory backed FIFO”. The idea is that it implements all of the capability of a basic FIFO, but also that it uses an external memory in case the block RAM available in your FPGA isn’t sufficient for the task at hand. Indeed, if all you need from your SDRAM is a FIFO, and you don’t care (that much) about the latency, then this might be the perfect capability for your application.

The cool thing about the virtual FIFO is that the problem definition solves most of our AXI burst logic generation problems for us. For example, because there’s no limit to the amount of data you might wish to transfer, we don’t have to check for the maximum data amount anymore. Better yet, we can keep all bursts at the same (power-of-two) length, which then means that our addresses will always be aligned and we don’t need to check 4kB boundaries at all.

Let’s take a look at how this might work. We’ll examine the write path alone below, just for simplicity, although the read path is quite similar.

The first step is to determe when we are ready to write a burst of data to memory. This is the combinatorial flag, called start_write in this design, that takes place before the phantom signal–the same signal we called start_transaction above.

We’ll start any write as soon as we have enough data in our incoming FIFO to fill up a burst. Note here that even if our FIFO can hold (1<<LGFIFO) elements, this comparison only requires LGFIFO-LGMAXBURST bits. It’s a nifty trick you can often get away with, but you’ll have to be aware of the difference between > and >= to do it. (Using > would’ve created an LGFIFO bit comparison, not an LGFIFO-LGMAXBURST bit comparison.)

		always @(*)
		begin
			start_write = 0;

			if (ififo_fill >= (1<<LGMAXBURST))
				start_write = 1;

However, we can’t start writing if the entire FIFO–equivalent to the size of the external SDRAM–is full. Yes, that would take a lot of data to fill up most DRAMs–you’ll need induction to catch any problems here. Likewise, we don’t want to start writing if we are still in the middle of a soft reset. (A soft reset is where we reset the core without resetting the bus.) Similarly, we won’t start any new burst on that same cycle where we’ve just issued a new request–to give us a clock for the space_available counter to adjust.

		if (vfifo_full || soft_reset || phantom_write)
			start_write = 0;

Further, if there’s no external/SDRAM memory space available for us to put this burst into, then we’ll just have to wait and start again once data is available.

		if (mem_space_available_w == 0)
			start_write = 0;

How is this different from the vfifo_full flag above? As I’ve currently defined the virtual FIFO, it only has much space available as there is memory space. Practically, it also has space for the two FIFO’s as well, but since I’m counting them separately I need to check for them separately here as well.

Next, this mem_space_available_w flag represents whether or not there’s a full burst’s worth of space available in the FIFO’s memory or not. It’s not counting beats, but rather bursts. That way we can check 20-bits of a memories address space instead of 32–assuming a 4GB memory (32-bit address), 256-beat bursts (8-bits of address), and a 128-bit wide memory bus (4-bits of address).

Coming back to the problem at hand, we don’t want to issue a new write command while the last one is either still in progress or stalled while being issued.

		if (M_AXI_WVALID && (!M_AXI_WREADY || !M_AXI_WLAST))
			start_write = 0;
		if (M_AXI_AWVALID && !M_AXI_AWREADY)
			start_write = 0;

I think I mentioned above that I like aligning my write address channel request with the first data word of the first data beat. While the AXI bus protocol doesn’t require this, it simplifies the formal property check and so I require it of my designs.

Finally, this particular core will stop all transactions on any downstream bus error. Errors like these should never happen and are usually an indication that you don’t (yet) have your memory space set up properly. Given that I’ve been burned before by writing to a peripheral when I thought I was writing to memory, I’m careful to avoid this possibility if possible. (My flash memory has never been the same since …) Hence, the FIFO comes to a hard stop following any bus errors.

		if (o_err)
			start_write = 0;
	end

Address adjustments are fairly easy as well. On a reset, the address gets set to zero. On a soft reset, it only gets set to zero if there’s no outstanding (stalled) request. In all other cases, we just add one burst length (times the bus address width) to the address every time a burst has been accepted.

	initial	axi_awaddr = 0;
	always @(posedge S_AXI_ACLK)
	begin
		if (M_AXI_AWVALID && M_AXI_AWREADY)
			axi_awaddr[C_AXI_ADDR_WIDTH-1:LGMAXBURST+ADDRLSB]
			<= axi_awaddr[C_AXI_ADDR_WIDTH-1:LGMAXBURST+ADDRLSB] +1;

		if ((!M_AXI_AWVALID || M_AXI_AWREADY) && soft_reset)
			axi_awaddr <= 0;

		if (!S_AXI_ARESETN)
			axi_awaddr <= 0;

Well, there is one trick here. Specifically, since I know that every burst must be aligned, I’m going to make certain that all of the lower address bits remain zero on every clock cycle.

		axi_awaddr[LGMAXBURST+ADDRLSB-1:0] <= 0;
	end

This has two purposes. First, it simplifies the synthesis optimization pass by making it crystal clear what I want–these lower bits will always be zero. Second, it keeps me from needing to write an assertion that these bits will be zero–since they’ll be set back to zero on every clock cycle.

The final step is to set the write address length to the size of one burst.

	assign	M_AXI_AWLEN = (1 << LGMAXBURST)-1;

To see how this works, let’s take a peek at a nominal trace, shown in Fig. 10. This is a shortened trace for demonstration purposes only. Specifically, the burst length has been shortened to 4-beats, whereas it would normally be 256-beats per burst for better performance.

Fig 10. Approximate virtual FIFO trace

Here in this figure you can see incoming data coming into the FIFO. Once a full burst has arrived, the start_write flag is raised. This starts the cycle whereby this burst gets written to the external RAM. Once the BVALID acknowledgment is returned, there’s then memory available to be read so the start_read flag gets set and a read transaction begins. Further read transactions are then triggered every time there’s sufficient data in memory to trigger them. The data is read into an outgoing FIFO, and then delivered to any follow on AXI stream component from there.

Example: WBSCOPE

Fig 11. A memory backed logic analyzer

One common debugging component used during FPGA development is an internal logic analyzer of some type. Such an analyzer records data until some number of clocks following a trigger (defined externally), and then stops. This allows you to see what lead up to an event, or alternatively what happened after some event.

I like to use my own Wishbone Scope for this purpose. I’ve even got an AXI4-lite version of the same scope–just not (yet) an AXI4-lite version of the compressed scope. In this discussion, though, I’d like to discuss the idea of a similar scope, with a nearly identical user interface, but using AXI to be able to save the memory contents in an external (SD)RAM of some type.

I call this a MEMSCOPE–for lack of a better name. Fig. 11 above sort of shows the conceptual idea behind it.

I bring it up here because it’s just a little more complicated than the Virtual FIFO–not by much though. Much like the Virtual FIFO, everything about the MEMSCOPE is aligned–up until the last burst, which might need to terminate early. Unlike the Virtual FIFO, we don’t need to check if there’s space available in the memory–the MEMSCOPE just perpetually overwrites memory until it is told to stop.

Let’s take a peek at how this works.

We’ll start with the combinatorial start_transaction signal, herein called the w_phantom_start signal.

As before, we only want to start if there’s data available in our local FIFO storage. Unlike before, we also have to check whether or not the scope has stopped recording and there is (potentially) a partial burst left to be written.

	always @(*)
	begin
		// We start again if there's more information to transfer
		w_phantom_start = (data_available >= (1<<LGMAXBURST))
				||(s_stopped && (data_available != 0)
					&& !M_AXI_WVALID);

Of course, we can’t start a new burst if the write address channel is still stalled with the last burst.

		// If the address channel is stalled, then we can't issue any
		// new requests
		if (M_AXI_AWVALID && !M_AXI_AWREADY)
			w_phantom_start = 0;

Neither do we want to issue a new burst request if the data from the last burst hasn’t (yet) finished writing to memory.

		// If we're still writing the last burst, then don't start
		// any new ones
		if (M_AXI_WVALID && (!M_AXI_WLAST || !M_AXI_WREADY))
			w_phantom_start = 0;

During a soft reset, one where we reset this core without resetting the bus, or likewise after the scope has stopped collecting data, we’ll want to avoid writing anything more to the external RAM.

		if (scope_reset || !r_busy)
			w_phantom_start = 0;

Finally, we insist on one clock between new burst requests to allow all of our registered counters to adjust to the new burst.

		// Finally, don't start any new bursts if we aren't haven't
		// yet adjusted our counters from the last burst
		if (phantom_start)
			w_phantom_start = 0;
	end

So far, this should look very similar to the Virtual FIFO.

The length handling, however, is subtly different. In particular, we’ll always send a full burst unless we’ve stopped collecting data and there remains a partial burst’s worth of data left.

	initial	axi_awaddr = 0;
	always @(posedge i_clk)
	begin
		if (!M_AXI_AWVALID || M_AXI_AWREADY)
		begin
			if (!s_stopped || (|data_available[LGFIFO:LGMAXBURST]))
				axi_awlen <= (1<<LGMAXBURST)-1;
			else
				axi_awlen  <= data_available[7:0] - 1;
		end

The neat thing about this is that we don’t ever need to check against any 4kB boundaries–even though we permit an other-than-full-length burst.

The write address signal is only slightly more complex than the one for the Virtual FIFO. Here, we use the AWLEN value for the burst that has just been completed to adjust the address–meaning that when we stop, the AWADDR signal will point to the oldest address in memory.

		if (M_AXI_AWVALID && M_AXI_AWREADY)
		begin
			axi_awaddr[ADDRLSB-1:0] <= 0;
			axi_awaddr[C_AXI_ADDR_WIDTH-1:ADDRLSB]
				    <= axi_awaddr[C_AXI_ADDR_WIDTH-1:ADDRLSB]
						+ (M_AXI_AWLEN+1);
		end

Up until the scope stops, however, the lower bits will remain aligned.

		if (!s_stopped)
			axi_awaddr[LGMAXBURST+ADDRLSB-1:0] <= 0;

Of course, we’ll start back at the beginning of RAM on any reset.

		if (!S_AXI_ARESETN || (!r_busy && scope_reset))
			axi_awaddr <= 0;

This core also issues word-aligned requests only.

		axi_awaddr[ADDRLSB-1:0] <= 0;
	end

So, that wasn’t so bad. As you saw, we were able to nicely simplify those four AxLEN criteria down to something usable, and so only lost one clock between initiating one burst transaction and being able to initiate a second–a clock cycle that wouldn’t be noticed anyway due to the fact that we don’t issue write requests until the last one is complete.

Example: VDMA

Fig 12. A video frame reader

Where things really start getting dicey is when the controller no longer has control over the start address or the length of any given burst. A classical example of this would be in the framebuffer reader capability found within my AXI video DMA. This core reads from a framebuffer and generates an outgoing video stream signal from it.

The AXI operations themselves are all line based reads, as shown in Fig. 13 below.

Fig 13. A video frame reader

The reader starts reading from the first line in memory, located at a configurable frame address. Subsequent lines are all separated by a configurable “line step”, specifying the distance between lines. Then, within a given line, bursts start at the base address for the line and continue sequentially until the end of the line. Between one line and the next, there may be some unused space–it’s not required, but it is illustrated in Fig. 13 above.

To see how this works, let’s check out the line address generation logic below. It should match Fig. 13 above quite nicely.

When the design is first activated, the VDMA always starts reading from the beginning of a frame.

	initial	req_addr = 0;
	initial	req_line_addr = 0;
	always @(posedge  i_clk)
	if (i_reset || r_stopped)
	begin
		req_addr       <= { 1'b0, cfg_frame_addr };
		req_line_addr  <= { 1'b0, cfg_frame_addr };
		req_line_words <= cfg_line_words;

The internal addresses are only adjusted when a new read is issued. This is separate and distinct from the ARVALID && ARREADY cycle where it is accepted by the bus.

	end else if (phantom_start)
	begin

Internally, line and frame control are driven by two control signals: req_hlast (the last burst request in a line has been issued), and req_vlast (we are processing the last line in a frame). If both of these signals are true, then we need to re-start processing from the base address of the entire video frame.

In this case, the cfg_ prefix denotes configuration values set by the user at run time.

		if (req_hlast && req_vlast)
		begin
			req_addr       <= { 1'b0, cfg_frame_addr };
			req_line_addr  <= { 1'b0, cfg_frame_addr };
			req_line_words <= cfg_line_words;

Otherwise, if this isn’t the last line of the frame, but we have already issued the last burst of the line, then we need to step our burst starting address, herein called req_addr, forward by one line. In order to maintain where the first line of the frame was, we’ll also step the “beginning of line” address forward by one line as well. Finally, we’ll charge a counter containing the remaining number of words to be read in this line to its full width.

		end else if (req_hlast)
		begin
			// verilator lint_off WIDTH
			req_addr <= req_line_addr
					+ (r_line_step << M_AXI_ARSIZE);
			req_line_addr  <= req_line_addr
					+ (r_line_step << M_AXI_ARSIZE);
			// verilator lint_on WIDTH
			req_line_words <= r_line_words;

In all other circumstances, we’ll step forward by the length of one burst–whatever that might currently be.

You might also notice the ADDRLSB value in the request below. As you may recall from above, this is the log (based two) of the bus width in bytes. It’s equivalent to AXI’s M_AXI_ARSIZE and it’s simply used here to adjust our burst address by the number of bytes within a single bus word.

One final point I want to bring out here: whereas the first burst of any line might be misaligned, subsequent bursts will always be aligned with a maximum burst sized boundary. For this reason, you can always clear the bottom LGMAXBURST+ADDRLSB bits, even though measuring the number of remaining words will take a bit more work.

		end else begin
			// verilator lint_off WIDTH
			req_addr <= req_addr + (1<<(LGMAXBURST+ADDRLSB));
			req_line_words <= req_line_words - (M_AXI_ARLEN+1);
			// verilator lint_on  WIDTH

			req_addr[LGMAXBURST+ADDRLSB-1:0] <= 0;
		end
	end

All of that is background information describing how a burst’s address is determined. It’s important if for no other reason than it points out that we really have no control over the initial address of a given burst, nor do we have any control over length of any lines. All of that information is determined by the user at run-time.

That means we’re going to need to suffer the two clock loss when calculating the next burst’s starting information. In this case, it’s not as bad as it sounds, and here’s why:

While you might want to use 100% of the memory bandwidth while reading a line of pixels, video timing typically gives you several free clocks between lines.
We could still hide our calculations in the throughput of the bus in all cases other than the first burst of any line.

If we wanted to, we could pre-calculate some line statistics while running through the previous line–sort of pipelining the calculation if you will–I just don’t think that’s required here.

As a result, we have three clocks of interest:

There’s the first clock of the calculation. I’ll use the flag, lag_start, to identify this clock.
There’s the combinatorial clock when start_burst is true. This clock cycle follows the lag_start cycle, and forms the second of the two clocks between burst starts.
Finally, there’s the clock cycle where phantom_start is true. This is the clock cycle where the actual read is first issued on the bus.
This is followed by another lag_start clock and the cycle repeats.

Fig. 14 below shows how these signals all work together. This figure was built under the assumption of a very large FIFO for the receive data. As such, more requests are issued for memory than there are beats returned in this trace. That’s actually a good thing–when the slave can’t handle any more requests, it will drop ARREADY. That’s a slave responsibility. It’s not the master’s responsibility. The master should be holding ARVALID high anytime it knows it wants to request another packet. In the meantime, our requests will cross the interconnect while waiting for the slave to finish replying to our request. This means that the read data, once produced, should be able to continue from one burst to the next without hiccups–assuming the slave can handle that pace in the first place.

Fig 14. VDMA example trace reading from a framebuffer

Let’s walk through and discuss this figure, so you’ll follow what’s going on below.

The figure starts out on the far left with the user writing the number of words per line and the number of lines per frame to the core. On the next cycle, the user gives the core the frame’s base address. Finally, the user turns the interface on. From there on out, cfg_active is asserted indicating that the configuration is active.
It takes us three clocks to get started once the configuration becomes active. On the last of these clocks, the req_addr is set. This will form the basics of the request that follows.
A second clock is used to determine the maximum allowable burst length.
From this maximum allowable burst length we can now set ARLEN and issue a read command.
Immediately after the read command, and possibly even before it is accepted by the bus, the lag_start signal goes high and we restart the process all over again.

Let’s now take a look into the design to see how all this was accomplished. We’ll start by examining the start_transaction combinatorial flag, which in this core is called `start_burst. In this design, burst requests are issued if there’s room in the FIFO to receive another burst’s worth of data.

	always @(*)
	begin
		start_burst = 1;
		if (no_fifo_space_available)
			start_burst = 0;

What might not be apparent here is that I’m checking for whether or not there’s space enough for a full burst in the FIFO. This particular Video DMA core can’t (yet) handle single or even finite frame counts. It either outputs a stream of ongoing video data or nothing. For this reason, it doesn’t check whether or not there’s enough room in the FIFO for anything less than a full burst’s size.

That gets us started, unless … and there are a lot of “unless”es, just like the last core. In this case, we’ll start if we have room in our FIFO unless we are in one of our two burst data calculation clocks.

		if (phantom_start || lag_start)
			// Insist on a minimum of two clocks between burst
			// starts, so we can get our lengths right
			start_burst = 0;

Similarly, we can’t request a new burst if we are still stalled waiting for the last burst request to be accepted.

		// Can't start a new burst if the outgoing channel is still
		// stalled.
		if (M_AXI_ARVALID && !M_AXI_ARREADY)
			start_burst = 0;

Finally, if the user has turned off video production, i.e. if he has dropped the cfg_active flag, or if we are waiting to for a soft_reset to complete, then we refuse to start a new burst until after we come to a complete halt.

		// If the user wants us to stop, then stop
		if (soft_reset || !cfg_active)
			start_burst  = 0;
	end

How much logic did we use? Well, everything in this combinatorial path is one bit, save for the no_fifo_space_available signal which just checks the top bits of how much space is available, so perhaps about 7-8 bits. This is quite within reason.

Now we need to calculate the actual burst length, ARLEN, we are going to request. This is the hard part, and so we’ll do this in two steps.

For the first step, we check the remaining number of words to be read in the current line. If it’s greater than our maximum burst amount, then the maximum burst length would be a full burst, otherwise it would be limited by however much requested data is available.

	always @(posedge i_clk)
	if (lag_start)
	begin
		if (req_line_words >= (1<<LGMAXBURST))
			max_burst <= (1<<LGMAXBURST);
		else
			max_burst <= req_line_words;
	end

This happens on the lag_start clock cycle.

The key piece of this check is that req_line_words, our signal holding the number of words remaining in a line, is set on the phantom_start clock cycle and this check is taking place on the cycle following–as soon as a new req_line_words value is available. For reference, this is shown in Fig. 14 above as part of req_addr.

That’s only the first part of generating arlen. For the second step, we’ll now need to compare this value against how many beats are between us and the nearest burst boundary. Again, we check against the nearest burst boundary because it’s at most an 8-bit check, rather than the 12-bit check required for a full 4kB check.

	// till_boundary is the distance to the boundary (minus one)
	always @(*)
		till_boundary = ~req_addr[ADDRLSB +: LGMAXBURST];

	always @(posedge i_clk)
	if (!M_AXI_ARVALID || M_AXI_ARREADY)
	begin
		if (till_boundary > 0 && max_burst <= till_boundary)
			axi_arlen <= max_burst-1;
		else
			axi_arlen <= till_boundary;
	end

You might notice above that I played a little sleight of hand with the two’s compliment check. I want to check whether or not the req_addr + max_burst is greater than (1 << LGMAXBURST) (checking only the lowest ADDRLSB +: LGMAXBURST bits of req_addr). After subtracting these req_addr bits from both sides, I then want to check whether or not max_burst is greater than (1 << LGMAXBURST) - req_addr. This is the same as checking whether max_burst > (1 << LGMAXBURST) + ~req_addr + 1. (Just a two’s compliment equality replacement here.) If I then swap from a > check to a >= check I can drop the +1 and the carry chain–as you can see above.

This computation will take place on the clock period just prior to phantom_start, and so we have the time to do this.

The last step is to generate the AXI address. This was calculated a couple of clocks earlier by req_addr above, so it’s simply listed below as,

	initial	axi_araddr = 0;
	always @(posedge i_clk)
	begin
		if (start_burst)
			axi_araddr <= req_addr[C_AXI_ADDR_WIDTH-1:0];

		axi_araddr[ADDRLSB-1:0] <= 0;
	end

The key point to remember from this algorithm is that, yes, we did take two clocks to calculate the burst size, ARLEN. However, most video streams would never notice this two clock hiccup.

That said, I have thought about adjusting this so that we only take two clocks on the first burst of any line, and one clock on any subsequent burst. We’ll see what the future brings for this algorithm.

Example: S2MM

Fig 15. An S2MM core copies a data stream to memory

The most complicated of all my examples, however, is my own AXI stream to memory data mover. This one took quite a bit of time to get right, and then even longer to adjust it so that it would work in a minimum amount of time. Here are some key points to the problem to consider:

The user can select any address to start on, regardless of whether it is burst aligned or not. (My current version still requires word alignment–but that’s another topic.)
The user can also select any length to start with, with the only constraint being that it must be greater than zero.
Once the user selects the address he wants to read from and the number of words to read, he must still issue a start command to the controller. This extra clock cycle is a gift. We can hide one of our two bounds checking clock cycles inside this free clock cycle.
Unlike the frame buffer reader above, we only have to deal with this initial burst check once rather than at the beginning of every new line. Every burst following that initial burst will be burst aligned, and so won’t need to be checked against a 4kB boundary.
We also offer the new feature of being able to write all of our data values to a fixed word in memory space—perhaps the input word of some other programming logic controller, perhaps even an output controller of some type. This adds a new requirement to our list of things to check: we now have to check against one of two maximum burst sizes depending upon whether or not the user wants us to used the fixed addressing mode (max burst size of 16) or not (max burst size of 256, and r_increment is asserted).

To see how this all works out, let’s examine a trace showing this timing in Fig. 16 below.

Fig 16. S2MM Trace

This figure shows a memory copy of nine words, where the first burst isn’t aligned on a burst boundary, and so it takes two clock cycles. Let’s walk through how it works piece by piece.

First, just like the VDMA example above, this one also starts with a user issuing a configuration command containing the memory address to copy to as well as the length of the memory copy. This configuration ends with the user issuing a command to start.

As the user enters a new address, we’ll set two flags. One flag, aw_needs_alignment, records whether or not the first burst will be artificially limited to the burst boundary.

	always @(posedge i_clk)
	if (!r_busy)
	begin
		aw_needs_alignment <= 0;

		if (|new_wideaddr[ADDRLSB +: LGMAXBURST])
		begin
			if (|new_widelen[LGLEN-1:(LGMAXBURST+ADDRLSB)])
				aw_needs_alignment <= 1;
			if (~new_wideaddr[ADDRLSB +: LGMAXBURST]
					< new_widelen[ADDRLSB +: LGMAXBURST])
				aw_needs_alignment <= 1;
		end
	end

Two other flags record whether or not the user has asked for multiple bursts. One flag checks for multiple bursts assuming that the addresses will be incrementing and bursts will be 256 beats in length, whereas the other flag checks whether the transfer would require multiple bursts of 16-beats in length.

Only once the user issues the start command do we know which of the two will limits will constrain us. That means we’re going to need a flag to let us know that this initial burst computation isn’t (quite) done yet. That’s the purpose of the r_pre_start flag, also shown in Fig. 16 above.

	always @(posedge i_clk)
	if (!r_busy)
		r_pre_start <= 1;
	else
		r_pre_start <= 0;

This core is a bit different from the others above in that it keeps track of an aw_multiple_bursts_remaining flag. This flag is equivalent to whether or not two or more bursts are remaining. To maintain this equivalence, it is set and maintained together with the counter containing the amount of remaining data left to transfer.

It’s initialized once the full size of the transfer is available.

	initial	aw_none_remaining = 1;
	initial	aw_requests_remaining = 0;
	always @(posedge i_clk)
	if (!r_busy)
	begin
		aw_requests_remaining <= cmd_length_w;
		aw_none_remaining     <= zero_length;
		aw_multiple_bursts_remaining <= |cmd_length_w[LGLENW-1:LGMAXBURST+1];

On any abort, as part of the soft reset logic, the request lengths are cleared.

	end else if (cmd_abort || axi_abort_pending)
	begin
		aw_requests_remaining <= 0;
		aw_none_remaining <= 1;
		aw_multiple_bursts_remaining <= 0;

In all other cases this remaining length is updated any time a new burst request is issued. What’s not apparent from these lines is that the number of items remaining is calculated combinatorially from the current number of items minus the number in AWLEN. That way I can use one new remaining length signal in three separate non-blocking assignments.

	end else if (phantom_start)
	begin
		aw_requests_remaining <= aw_next_remaining;
		aw_none_remaining<= !aw_multiple_bursts_remaining
			&&(aw_next_remaining[LGMAXBURST:0] == 0);
		aw_multiple_bursts_remaining
				<= |aw_next_remaining[LGLENW-1:LGMAXBURST+1];
	end

The real key to this algorithm lies in how the length of the first burst is calculated. In particular, because of the flags we calculated earlier, we can keep this burst length calculation short and based on a small number of 9-bit comparisons only.

The first step is to calculate (combinatorially) the distance to the next burst boundary. This is roughly equivalent to (1<<(LGMAXBURST+ADDRLSB))-cmd_addr). I have to separate it out for special treatment here for two reasons. First, I want to make certain that the operation is limited to LGMAXBURST bits–not LGMAXBURST+1 bits, so we drop a bit here. The second reason is that I want to drop the lower sub-word address bits from this calculation.

	always @(*)
		addralign = 1+(~cmd_addr[ADDRLSB +: LGMAXBURST]);

Here’s the key you’ve been waiting for.

In general, the initial burst length is the full size.

	always @(*)
	begin
		initial_burstlen = (1<<LGMAXBURST);

If we are going to be using FIXED addressing, then we need to limit ourselves to a maximum of 16 beats, not 256. Moreover, if we aren’t doing multiple bursts, we’ll limit the burst length to one burst only.

		if (!r_increment)
		begin
			initial_burstlen = MAX_FIXED_BURST;
			if (!aw_multiple_fixed_bursts)
				initial_burstlen = { 1'b0, cmd_length_w[LGMAXBURST-1:0] };

Did you catch how this computation only requires two bits so far? That’s much better than the 32-bit comparisons we started with.

What about normal incrementing addressing? That’s the next check. Remember how we were able to check if we needed alignment and reduce it into a single flag above? That check can now be handled in a single check here. Similarly, if we don’t need to realign the burst, and if we aren’t issuing multiple full bursts, then we can issue a smaller burst length.

		end else if (aw_needs_alignment)
			initial_burstlen = { 1'b0, addralign };
		else if (!aw_multiple_full_bursts)
			initial_burstlen = { 1'b0, cmd_length_w[LGMAXBURST-1:0] };
	end

So, from four flags and two lengths, we’ve been able to determine our initial burst length. That’s one LUT per bit of initial_burstlen. Not bad. We might’ve also been able to handle this initial value one clock earlier if r_increment were defined prior to the user’s start command.

From here, we get to define the maximum burst length we can process. This is initially set by the clock following the start command to the initial value we just calculated above.

	initial	r_max_burst = 0;
	always @(posedge i_clk)
	if (!r_busy || r_pre_start)
	begin
		// Force us to align ourself early
		//   That way we don't need to check for
		//   alignment (again) later
		r_max_burst <= initial_burstlen;

We then update this maximum burst length on every phantom_start clock cycle. That also means that we’ll know the maximum size of our burst immediately following any phantom_start cycle. That would be enough to start a second cycle immediately, save only that we still need to check whether this much data is in our FIFO–so it will still take the minimum two clocks between burst starts as illustrated in Fig. 16 above.

There are four cases to consider when checking for the next burst size, divided here into two categories. The first category is the “normal” one, where addresses are incrementing throughout the transfer.

	end else if (phantom_start)
	begin
		// Verilator lint_off WIDTH
		if (r_increment || LGMAXBURST <= LGMAX_FIXED_BURST)
		begin

In this case, if we don’t have multiple bursts remaining, then we need to check the number of beats remaining against the maximum size of the burst. Otherwise, we stick with the maximum burst size.

			if (!aw_multiple_bursts_remaining
				&& aw_next_remaining[LGMAXBURST:0] < (1<<LGMAXBURST))
				r_max_burst <= { 1'b0, aw_next_remaining[7:0] };
			else
				r_max_burst <= (1<<LGMAXBURST);
		end else begin

We then repeat the calculation for fixed addressing, with the only difference being that the maximum burst size (16 beats) is smaller.

			if (!aw_multiple_bursts_remaining
				&& aw_next_remaining[LGMAXBURST:0] < MAX_FIXED_BURST)
				r_max_burst <= { 1'b0, aw_next_remaining[7:0] };
			else
				r_max_burst <= MAX_FIXED_BURST;
		end
		// Verilator lint_on WIDTH
	end

At this point, we know the size of our next burst. That was the hard part.

The next piece is knowing when to start the next burst. In this case, I named the start_transaction combinatorial signal as w_phantom_start. As before, we start a burst whenever there is sufficient data in the FIFO to complete the burst.

	always @(*)
	begin
		// We start again if there's more information to transfer
		w_phantom_start = !aw_none_remaining;

		// But not if the amount of information we need isn't (yet)
		// in the FIFO.
		if (!sufficiently_filled)
			w_phantom_start = 0;

This is a combinatorial check, depending upon the value of sufficiently_filled. So we’ll need to remember that burden as we work through this logic.

We’re also going to insist on a minimum of one clock between bursts, and an extra clock to calculate the burst size initially.

		// Insist on a minimum of one clock between burst starts,
		// since our burst length calculation takes a clock to do
		if (phantom_start || r_pre_start)
			w_phantom_start = 0;

As with our other checks, we can’t start any new bursts if we are still stalled waiting for the last burst request to be accepted, or likewise if we are still transmitting the data from the last burst.

		// We can't start if the last request hasn't yet been accepted
		if (M_AXI_AWVALID && !M_AXI_AWREADY)
			w_phantom_start = 0;

		// If we're still writing the last burst, then don't start
		// any new ones
		if (M_AXI_WVALID && (!M_AXI_WLAST || !M_AXI_WREADY))
			w_phantom_start = 0;

Finally, if we aren’t busy doing a copy then we can’t start, neither should we start any new bursts following any request for a bus abort.

		// Finally, don't start any new bursts if we aren't already
		// busy transmitting, or if we are in the process of aborting
		// our transfer
		if (!r_busy || cmd_abort || axi_abort_pending)
			w_phantom_start = 0;
	end

That leaves only two pieces of logic left. First, we’ll need to set AWLEN on any new burst request. Thankfully, this has already been calculated above from r_max_burst, so it’s little more than a register copy here.

	always @(posedge i_clk)
	begin
		if (!M_AXI_AWVALID || M_AXI_AWREADY)
			axi_awlen  <= r_max_burst[7:0] - 8'd1;

The outgoing address is updated as soon as any successful burst is completed. Note the check for whether or not the r_increment flag is set–since the core supports both incrementing and fixed addressing.

		if (M_AXI_AWVALID && M_AXI_AWREADY)
		begin
			axi_awaddr[ADDRLSB-1:0] <= 0;
			// Verilator lint_off WIDTH
			if (r_increment)
				axi_awaddr[C_AXI_ADDR_WIDTH-1:ADDRLSB]
				    <= axi_awaddr[C_AXI_ADDR_WIDTH-1:ADDRLSB]
						+ (M_AXI_AWLEN+1);
			// Verilator lint_on WIDTH
		end

We’ll also set the write address any time we are not (yet) busy so that it is set to the first word of any upcoming transfer.

		if (!r_busy)
			axi_awaddr<= cmd_addr;

		if (!OPT_UNALIGNED)
			axi_awaddr[ADDRLSB-1:0] <= 0;
	end

Conclusion

Achieving high throughput in any design is dependent upon being able to sustain read or write data transfers on every single clock cycle, even crossing multiple beats if necessary. This requires creating AXI burst requests as soon as the data (or space) is available to make them. It also requires making sure that your request has crossed the interconnect and made it to the downstream slave as early as possible. Ideally, you’ll want the slave to have any subsequent burst requests waiting, on its doorstep, as soon as it has finished with processing the last transaction so that you can minimize any hiccups in the transfer.

Sadly, the complexity associated with making such requests is non-trivial. In particular, there are four criteria that need to be checked of any transfer length before it can be used to generate the next burst cycle. Depending upon the speed of your FPGA, these four criteria can take several clocks to accomplish.

Fig 17. Bus interaction goal

Today, we looked at four separate AXI masters which all generate AXI burst transaction requests, and in particular we looked over how they could handle generating requests faster than the data could be delivered. Some solutions were very basic, such as insisting that all requests were of the same length. Other solutions were more complicated and required multiple clock cycles to set up. One common theme across all of these solutions was the need to calculate the details of the next burst in a pipelined fashion, so that those details would be ready before it was time to request the next burst. As a result, each of these AXI masters is able to sustain a full 100% bus throughput with one beat of data transferred on every clock cycle even as the requests crossed the boundary from one burst to the next, to the next one after that, and so on.

100% AXI throughput is therefore very achievable. You should accept nothing less in your designs.