Understanding AXI Addressing

For those who’ve been following me on twitter, you’ll know that I’ve been slowing working through understanding the AXI standard. My goal has been to create formal verification IP for either an AXI peripheral or an AXI master. I’ve made a lot of progress, and so I’d like to share some of that with you today.

Fig 1: A Signal Collection Application

My journey with AXI actually started some time ago, under a government contract. I needed to move data from the DSP code I had written within the FPGA side of an FPGA+ARM chip onto the Ethernet. I had software running on the ARM processor side that read from a FIFO within the FPGA logic, as shown in Fig. 1 above. Since it was easy to do, I connected to the low-speed interconnect of the Cyclone-V, and then converted that to an Avalon bus, then to a WB bus, and finally to a 64-bit WB bus in order to get to my peripheral. I was later surprised that my design wasn’t working nearly as fast as I wanted.

Fig 2: Too many bus transformations

Once I finally managed to put a wishbone scope into the design to probe the bus interaction, I realized that every time a transformation took place from one bus protocol to another, another clock cycle was consumed to do it. Reading from my FIFO therefore took a clock to convert from AXI to Avalon, another one to convert from Avalon to WB, and then four to convert from my 32-bit WB bus to WB with a 64-bit data width. The return path was similarly slow. I lost a clock within my AutoFPGA generated WB interconnect, another clock clock converting from WB back to Avalon, and then another clock converting from Avalon back to AXI. Or rather, I assume there was a clock lost between AXI and Avalon and back, but I never had access to that part of the design in order to probe it. Either way, my code couldn’t meet the real-time requirements I had given it and I was wasting precious clock cycles in useless bus transformations.

Normally I don’t pay much attention to bus latency. What’s a clock or two between bus components, right? Usually, this doesn’t hurt me. Since I use pipelining on the WB bus, I can send up to one request per clock cycle, and a two-clock latency on an N item burst brings the total time to N+2 clocks. This works nicely for most applications.

Not this one.

In this case, the ARM processor was only ever reading one item at a time. Nothing was pipelined. Every read therefore took the full latency of about ten clocks to process.

No wonder I wasn’t meeting my real time requirements.

The easy way to solve this problem would be to get rid of all the bus bridges, and to switch to the low speed to the high speed AXI bus connection.

The high speed bus on the Cyclone-V, however, uses AXI. It doesn’t use AXI-lite, it uses AXI. If I wanted to use AXI-lite, I’d need another bus translator.

Prior to that project, I had tried to create an AXI slave myself. I spent several weeks on it, ending up unsuccessful. So, with government dollars paying for my time, I tried again. I never even made it through the design process, much less verification and simulation. Eventually I gave up rather than to impact the critical path of the project. Several weeks after giving up in frustration, I realized my confusion, but by then it was too late.

In the end, I never met my real-time promises to my customer.

Ever had something like that that just burns you up? Where you made a promise to deliver something to someone and then didn’t deliver?

So now I’m re-examining AXI again. Perhaps if I had a formal property file this time, it would help me through the design process?

So that’s what I’m working on, and that’s what led me to today’s post.

To test whether the property file “works” or not, I’m also building a variety of AXI peripherals. Perhaps you’ve read my ramblings on twitter about all of the bugs found in Xilinx’s example AXI slave? Perhaps you’ve seen the (formally verified) AXI demonstration slave I’ve been able to put together? Or the (formally verified) bus fault isolator, which can be used with a (possibly) faulty slave to keep from needing breaking the bus so badly that only a power cycle can rescue a design? These have been my early successes.

I’m still working on an AXI to WB bridge, as well as an example (open-source, and formally verified) AXI crossbar interconnect.

In all of this work, I’ve managed to get far enough along in my understanding of the AXI protocol to be able to discuss a piece of it with you today.

Specifically, I’d like to discuss how addressing works in AXI burst transactions.

Addressing made simple

If you know nothing more about AXI addressing, you need to know this: The AXI address represents the address of the byte, not the word. This is unlike Wishbone, where the address represents the word and not the byte within it. To convert from Wishbone to AXI, add zero bits. To convert from AXI to Wishbone, drop the low order bits.

That’ll get you past any AXI-lite issues. It will also get you past any single-address transactions using AXI.

The rest of the full AXI protocol isn’t quite that easy.

Let’s start at the top. an AXI transaction begins on the address write/read channel.

Fig 3: An AXI Write transaction

A write transaction begins when the bus master describes the burst of information to be written on the write address channel. This includes the starting address of the transaction, the length of the transaction, and more. The master then sends the data associated with the transaction to the slave. Once accomplished, the slave will return a single acknowledgment. You can see an example of this, drawn from a cover() statement in the proof of this slave in Fig. 3.

Fig 4: An AXI Read transaction

Reads are similar, as shown in Fig. 4–also drawn from a cover() statement applied to the same core, in that they also begin on the read address channel. However, instead of being followed by a channel of data from the master to the slave, the slave responds instead by returning the data it has read. The last item of data from the returned by the slave is marked with an RLAST flag and it concludes the transaction.

Since both the write address and read address packets are very similar, I’ll describe both together and just use the Ax prefix to refer to an address signal that could be on one or the other of the two channels. When AxVALID is true, a transaction request has been placed on the channel. Several other values associated with this request will tell you the size and length of the requested transaction.

AxADDR. We just discussed this above. The AxADDR lines reference the byte within the burst.

Just for a fun reference here, I grew up thinking that the word “byte” meant 8-bits of data. It’s not. That’s the word “octet”. A “byte” is the smallest addressable unit of data on a bus. For POSIX compliant CPUs, that’s 8-bits. Other CPUs, to include the original ZipCPU, can have different sized bytes. In my case, I started out supported 32-bit bytes.

The AXI bus supports 8-bit bytes, and each byte can be read or written separately using the WSTRB signal–but we’re now getting ahead of ourselves. The key takeaway here is that bytes and octets aren’t necessarily the same thing, but we can use them interchangably when discussing the AXI protocol.
AxLEN. When I first examined AXI, the AxLEN field appeared to be the biggest reason to use it. With a single request on the address channel, you can request anywhere between one and 256 values by just setting AxLEN to 0-255 respectively.

But how should the address of each of those values be calculated? That’s the purpose of the AxBURST value.
AxBURST is a two-bit value. It describes whether the address is to be fixed (AxBURST == 2'b00), incremented (AxBurst == 2'b01), or wrapping (AxBurst == 2'b10).

In general, the address of each beat of the burst will increment by the number of bytes in a bus word. Only … it’s never that simple. Let’s come back to this in a moment.
AxSIZE is a three bit value referencing the size of the data transfer. The size can be anywhere between an octet, AxSIZE == 3'b000, two octets, AxSIZE == 3'b001, four octets, AxSIZE==3'b010, all the way up to 128 octets when AxSIZE == 3'b111.

The rule is that AxSIZE can only ever be less than or equal to your bus size. Since I tend to work with 32-bit busses, that means any AxSIZE must be less than or equal to 3'b010.

These are the four addressing signals we’ll look at today: AxADDR, AxLEN, AxBURST, and AxSIZE. Let’s walk through how to use these as a function of the burst type.

Types of Burst Addressing

As we mentioned above, there are three basic types of burst addressing: FIXED, INCREMENT, and WRAP. An AxBURST value of 2'b11 is reserved, and so illegal.

Let’s look at each of these in turn.

Fixed Addressing

This is perhaps the simplest of all of the addressing modes. The first address in the burst is given by AxADDR, and it never changes from there. This means that, within a burst, we can calculate the next address, we’ll call this o_next_address, from the last address, we’ll call that i_last_addr and the burst type, i_burst.

	if (i_burst == 2'b00)
		o_next_address = i_last_addr;

That’s pretty easy.

This would be the perfect addressing mode for reading from a FIFO as I described above. Since a FIFO can be represented with a single address, this is perfect.

If I ever get a chance to do a similar contract again, this is how I’d do it.

Today, though, I’m working on formal verification properties. That kind of forces me to support every mode, so I’m not yet done.

Increment Addresses

Increment addressing is the kind of addressing you may be more familiar with. It’s the type you’d use when doing a memcpy. In its simplest form, we might write:

	if (i_burst == 2'b01)
		o_next_address = i_last_addr + 1;

Only that’s not quite right. AXI allows you to transfer multiple bytes per transaction, and the AXI address references the first byte in each burst. Hence, if we have a 32-bit data bus, we’d want to increment our address by four bytes at a time.

	if (i_burst == 2'b01)
		o_next_address = i_last_addr + 4;

Even this isn’t quite right. While AXI doesn’t require AxADDR to be an aligned address, all of the subsequent addresses must be aligned.

Fig 5: Incremental burst addressing

Aligning the addresses isn’t all that hard for a 32-bit bus. Indeed, we might simply write,

	if (i_burst == 2'b01)
	begin
		o_next_address = i_last_addr + 4;

		// Force subsequent alignment
		o_next_address[1:0] = 0;
	end

While we’re making good progress, it’s still not this simple. Remember the AxSIZE parameter? AxSIZE determines our increment, not our bus size. When AxSIZE==0, the increment is one. When it’s 3'b001, we increment by two. When AxSIZE==3'b010 we increment by four and so on.

We’ll use i_size to represent AxSIZE in our pseudocode. We can then write:

	if (i_burst == 2'b01)
	begin
		increment = (1<<i_size);
		o_next_address = i_last_addr + increment;

That handles our increment. The problem now is that alignment is more painful, since our ultimate alignment depends upon the AxSIZE of the transaction.

		// Force subsequent alignment
		if(i_size == 1)
			// 16-bit alignment
			o_next_addr[0] = 0;
		else if (i_size == 2)
			// 32-bit alignment
			o_next_addr[1:0] = 0;
		else if (i_size == 3)
			// 64-bit alignment
			o_next_addr[2:0] = 0;
		else if (i_size == 4)
			// 128-bit alignment
			o_next_addr[3:0] = 0;
		else if (i_size == 5)
			// 256-bit alignment
			o_next_addr[4:0] = 0;
		else if (i_size == 6)
			// 512-bit alignment
			o_next_addr[5:0] = 0;
		else if (i_size == 7)
			// 1024-bit alignment
			o_next_addr[6:0] = 0;
	end

Voila! We’ve enforced the alignment we need, and so we can now calculate the next address during an increment.

I discovered the problem with this approach when I tried to verify the AXI example code from Vivado. I built a slave with only 5 bits of address. In this case, o_next_addr[6:5] weren’t defined. So I added in a test of the address width.

		// Force subsequent alignment
		if ((i_size == 1)&&(AW>0))
			// 16-bit alignment
			o_next_addr[0] = 0;
		else if (i_size == 2)&&(AW>1))
			// 32-bit alignment
			o_next_addr[1:0] = 0;
		else if (i_size == 3)&&(AW>2))
			// 64-bit alignment
			o_next_addr[2:0] = 0;
		else if (i_size == 4)&&(AW>3))
			// 128-bit alignment
			o_next_addr[3:0] = 0;
		else if (i_size == 5)&&(AW>4))
			// 256-bit alignment
			o_next_addr[4:0] = 0;
		else if (i_size == 6)&&(AW>5))
			// 512-bit alignment
			o_next_addr[5:0] = 0;
		else if (i_size == 7)&&(AW>6))
			// 1024-bit alignment
			o_next_addr[6:0] = 0;
	end

The code now no longer referenced any undefined addresses, since the AW>x check found and fixed that.

Only … one of my favorite (unnamed) simulation tools couldn’t handle this. Perhaps this is because I’m using a very old version of the tool. For whatever reason, I had to do something to keep this design from referencing bits I didn’t have. I chose the following solution, therefore. It’s not pretty, but it worked for AW = 6.

		// Force subsequent alignment
		if ((i_size == 1)&&(AW>0))
			// 16-bit alignment
			o_next_addr[0] = 0;
		else if (i_size == 2)&&(AW>1))
			// 32-bit alignment
			o_next_addr[((AW-1>0)?1:(AW-1)):0] = 0;
		else if (i_size == 3)&&(AW>2))
			// 64-bit alignment
			o_next_addr[((AW-2>0)?2:(AW-1)):0] = 0;
		else if (i_size == 4)&&(AW>3))
			// 128-bit alignment
			o_next_addr[((AW-3>0)?3:(AW-1)):0] = 0;
		else if (i_size == 5)&&(AW>4))
			// 256-bit alignment
			o_next_addr[((AW-4>0)?4:(AW-1)):0] = 0;
		else if (i_size == 6)&&(AW>5))
			// 512-bit alignment
			o_next_addr[((AW-5>0)?5:(AW-1)):0] = 0;
		else if (i_size == 7)&&(AW>6))
			// 1024-bit alignment
			o_next_addr[((AW-6>0)?6:(AW-1)):0] = 0;

It’s at least good enough that we can move on.

Wrapping Addresses

Many memories support address wrapping, and the AXI bus even has a burst type which will support wrapping addresses.

Fig 6: Wrap burst addressing

The idea behind address wrapping usually comes from cache accesses. An address request from the CPU might land anywhere within a given cache line. The CPU needs to invalidate and then fill the whole cache line. During this time, the CPU will be stalled until the value it needs is returned from memory. If the value lands towards the middle of the cache line, the CPU would have to be stalled during the entire read. On the other hand, if the memory can return the middle value of the burst first and then wrap around to return the first part of the burst, the CPU doesn’t need to be stalled nearly as long.

This is why address wrapping is so important. Indeed, many SDRAM chips will support address wrapping of some type.

Faster CPUs is a good thing, right?

Well … that may be true, but it certainly makes the bus more complicated.

The basic idea of wrapping is that the address bits act as though only some bits increment, and others don’t. We can capture this with a bit-wise mask I’m going to call wrap_mask. Using this mask, we can write our address wrapping code as,

	if (i_burst == WRAP)
	begin
		increment = (1<<i_size);
		o_next_addr = i_last_addr + increment;

		o_next_addr = (i_last_addr & ~wrap_mask)
				| (o_next_addr & wrap_mask);
	end

Now the only trick is to figure out how to generate wrap_mask.

There are also a couple of rules associated with address wrapping that play into this as well. For example, when using address wrapping the first burst must be aligned. Similarly, only burst lengths of 2, 4, 8, and 16 are supported, corresponding to AxLEN values of 1, 3, 7, and 15 respectively. We’ll write these rules out in a moment later.

For now, we can calculate our wrap address via,

	wrap_mask = 0;
	if (i_len == 1)
		wrap_mask = (1<<(i_size+1));
	else if (i_len == 3)
		wrap_mask = (1<<(i_size+2));
	else if (i_len == 7)
		wrap_mask = (1<<(i_size+3));
	else if (i_len == 15)
		wrap_mask = (1<<(i_size+4));

All put together

We’ve now learned how to calculate the next address of an AXI transaction from the last one. If you put all of this code together, you’ll get something looking like the following:

`default_nettype none
//
//
module	faxi_addr(i_last_addr,
		i_size,
		i_burst,
		i_len,
		o_incr,
		o_next_addr);
	parameter	AW=32;
	input	wire	[AW-1:0]	i_last_addr;
	input	wire	[2:0]		i_size; // 1b, 2b, 4b, 8b, etc
	input	wire	[1:0]		i_burst; // fixed, incr, wrap, reserved
	input	wire	[7:0]		i_len;
	output	reg	[7:0]		o_incr;
	output	reg	[AW-1:0]	o_next_addr;

	(* keep *) reg	[AW-1:0]	wrap_mask, increment;

	always @(*)
	begin
		increment = 0;
		if (i_burst != 0)
		begin
			// verilator lint_off WIDTH
			case(i_size)
			0: increment =  1;
			1: increment =  2;
			2: increment =  4;
			3: increment =  8;
			4: increment = 16;
			5: increment = 32;
			6: increment = 64;
			7: increment = 128;
			default: increment = 0;
			endcase
			// verilator lint_on WIDTH
		end
	end

	always @(*)
	begin
		wrap_mask = 0;
		if (i_burst == 2'b10)
		begin
			if (i_len == 1)
				wrap_mask = (1<<(i_size+1));
			else if (i_len == 3)
				wrap_mask = (1<<(i_size+2));
			else if (i_len == 7)
				wrap_mask = (1<<(i_size+3));
			else if (i_len == 15)
				wrap_mask = (1<<(i_size+4));
			wrap_mask = wrap_mask - 1;
		end
	end

	always @(*)
	begin
		o_next_addr = i_last_addr + increment;
		if (i_burst != 2'b00)
		begin
			// Align any subsequent address
			// verilator lint_off SELRANGE
			if(i_size == 1)
				o_next_addr[0] = 0;
			else if ((i_size == 2)&&(AW>=2))
				o_next_addr[1:0] = 0;
			else if ((i_size == 3)&&(AW>=3))
				o_next_addr[2:0] = 0;
			else if ((i_size == 4)&&(AW>=4))
				o_next_addr[3:0] = 0;
			else if ((i_size == 5)&&(AW>=5))
				o_next_addr[4:0] = 0;
			else if ((i_size == 6)&&(AW>=6))
				o_next_addr[((AW>6)?5:AW-1):0] = 0;
			else if ((i_size == 7)&&(AW>=7))
				o_next_addr[((AW>7)?6:AW-1):0] = 0;
			// verilator lint_on  SELRANGE
		end
		if (i_burst == 2'b10)
		begin
			// WRAP!
			o_next_addr[AW-1:0] = (i_last_addr & ~wrap_mask)
					| (o_next_addr & wrap_mask);;
		end
	end

	always @(*)
	begin
		o_incr = 0;
		o_incr[((AW>7)?7:AW-1):0] = increment[((AW>7)?7:AW-1):0];
	end
endmodule

The problem with this code, however, is that it takes up way too much logic. You can see this by running:

yosys -p 'read_verilog axi_addr.v; synth_xilinx; stat'

If you look at the tail end of the output, you’ll find that this design requires the following logic elements:

   Number of cells:                392
     LUT1                            5
     LUT2                           52
     LUT3                           15
     LUT4                           27
     LUT5                            3
     LUT6                           89
     MUXCY                          74
     MUXF7                          43
     MUXF8                           4
     XORCY                          80

We can do better. Indeed, we can do much better.

Formal properties

So, while preparing this article, I spent some time trying to optimize this address calculator to both make it simpler, and to lower the amount of logic it uses. Along the way, I made some amazing optimizations and managed to get this core down to fewer than 67 elements. After beating my chest for way to long, I went back back to try to formally verify some of my cores that used this address calculator.

This one failed.

Now I was stuck. Which optimization failed, and how do I roll back just the right ones?

It was time to turn to formal verification.

Fig 7: A basic miter circuit, to prove two cores are equivalent

Formal verification in this context was a bit different from many of the other proofs I’ve done, primarily because what I wanted to do was to make sure that when I optimized my work, the result doesn’t change. Ideally, what I’d like to be able to do is something like comparing two cores, one optimized and one not, to prove that the optimized one still does the same thing as the reference design.

Yes, SymbiYosys can do this too.

The basic setup is shown in Fig. 7 on the right. First, I created a miter circuit–one with both the reference address calculation code as well as the optimized code I was testing. We’ll call these ref for reference address calculator and uut for the unit under test.

module	axi_addr_miter(i_last_addr, i_size, i_burst, i_len);
	parameter	AW = 32,
			DW = 32;
	input	wire	[AW-1:0]	i_last_addr;
	input	wire	[2:0]		i_size; // 1b, 2b, 4b, 8b, etc
	input	wire	[1:0]		i_burst; // fixed, incr, wrap, reserved
	input	wire	[7:0]		i_len;

	localparam	DSZ = $clog2(DW)-3;

Let me pause here for a moment, since DSZ is going to become a very important part of our logic in a moment.

If you ever look through a piece of Vivado generated AXI code like this one, you’ll see things like C_S_AXI_ADDR_WIDTH and C_S_AXI_DATA_WIDTH. You might read these as the address width and data width respectively associated with the slave interface. I personally try to avoid these parameter names if I can, just because I try to fit my logic on an 80-column display and these long names make my task more difficult. So, instead of using very expressive terms like these, I regularly use AW for address width and DW for data width. In this case, the two are the same. (Normally I’d using AW to reference a Wishbone address width, which as we’ve discussed above doesn’t include the subword address bits.)

DSZ is a value derived from the data width. It’s designed so that a DSZ of 0 is equivalent to a DW==8, whereas DSZ==1 is equivalent to a DW==16 and so forth. It has the same representation as i_size, but rather describes the maximum i_size the bus can hold. Two DSZ values that will be of high interest to me are DSZ==2, corresponding to a bus width of 32-bits, and DSZ==3, corresponding to a bus width of 64-bits.

You’ll see DSZ often in the discussion that follows.

Now we can return to the miter circuit, and show the instantiations of our two designs, both reference,

	////////////////////////////////////////////////////////////////////////
	//
	// The reference address calculator
	//
	wire	[7:0]		ref_incr;
	wire	[AW-1:0]	ref_next_addr;

	faxi_addr #(.AW(AW))
		ref(i_last_addr, i_size, i_burst, i_len,ref_incr,ref_next_addr);

and our uut, or unit under test.

	////////////////////////////////////////////////////////////////////////
	//
	// The optimized address calculator, our "unit-under-test"
	//
	wire	[AW-1:0]	uut_next_addr;

	axi_addr #(.AW(AW), .DW(DW))
		uut(i_last_addr, i_size, i_burst, i_len, uut_next_addr);

At this point, I could’ve created my two assertions and been done.

	always @(*)
		assert(uut_incr == ref_incr);

	always @(*)
		assert(uut_next_addr == ref_next_addr);

endmodule

The only problem with this quick approach is that the inputs need to be constrained. Many of the possible input values are disallowed by the AXI specification. What we really want to do is to assert that the two circuits produce identical results for valid AXI burst transactions values.

Let’s work through some possible restrictions.

We already known that bursts may be fixed, incrementing or wrapped. An i_burst value of 2'b11, the third possibility, is illegal as per the spec.

	always @(*)
		assume(i_burst != 2'b11);

In a similar fashion, the i_size input can only ever specify a transaction width equal to or smaller than the current bus size.

	always @(*)
		assume(i_size <= DSZ);

We also know that, for wrapping bursts, the length of the burst may only ever be 2, 4, 8, or 16 beats. Since AxLEN is one less than the total number of transfers, this turns into a restriction that i_len must be 1, 3, 7, or 15.

	always @(*)
	if (i_burst == 2'b10)
		assume((i_len == 1)
			||(i_len == 3)
			||(i_len == 7)
			||(i_len == 15));

The specification also says that wrapped bursts must be aligned. So lets check for alignment on the incoming address. Remember the master rule of formal verification: assume inputs, assert any local state and outputs.

	// Determine if the incoming address is aligned
	always @(*)
	begin
		aligned = 0;
		case(DSZ)
		3'b000: aligned = 1;
		3'b001: aligned = (i_last_addr[0] == 0);
		3'b010: aligned = (i_last_addr[1:0] == 0);
		3'b011: aligned = (i_last_addr[2:0] == 0);
		3'b100: aligned = (i_last_addr[3:0] == 0);
		3'b101: aligned = (i_last_addr[4:0] == 0);
		3'b110: aligned = (i_last_addr[5:0] == 0);
		3'b111: aligned = (i_last_addr[6:0] == 0);
		endcase
	end

Now that we can tell whether or not the address is aligned, we can assume that any wrapped addresses are properly aligned on entry.

	always @(*)
	if (i_burst == 2'b10)
		assume(aligned);

There’s one final bit of tweaking required. AXI addresses are not allowed to wrap over 4kB boundaries. That means that only the bottom twelve bits will ever be allowed to change.

This suggests a quick and valuable simplification, one that would force any address bits above 12 to be constant across a burst:

		if (AW > 12)
			o_next_addr[AW-1:12] = i_last_addr[AW-1:12];

The problem with doing this is that my reference address calculator doesn’t have this code.

The reason why the reference doesn’t make certain that the top addresses are constant is that I’d like to use an assertion in my new AXI4 formal property file to make certain the address pointer never wraps across a 4kB. In other words, my reference requires that addresses be able to cross 4kB boundaries so I can detect a problem, whereas my optimized code won’t let things wrap.

A basic simplifying assumption will keep us from testing the 4kB wrap.

	always @(*)
	if (AW > 12)
		assume(i_last_addr[AW-1:12] == ref_next_addr[AW-1:12]);

As we’ve written it, our address calculator is highly parameterizable across a wide variety of bus widths, DW. To capture this, we’ll define a series of tasks within our SymbiYosys file.

[tasks]
prf8
prf16
prf32
prf64
prf128
prf256
prf512
prf1024

Each task corresponds to a different bus width, which we can set using the chparam command within the Yosys script section of our SymbiYosys configuration file.

[script]
read -formal axi_addr.v
read -formal faxi_addr.v
read -formal axi_addr_miter.v
prf8:    chparam -set DW    8 axi_addr_miter
prf16:   chparam -set DW   16 axi_addr_miter
prf32:   chparam -set DW   32 axi_addr_miter
prf64:   chparam -set DW   64 axi_addr_miter
prf128:  chparam -set DW  128 axi_addr_miter
prf256:  chparam -set DW  256 axi_addr_miter
prf512:  chparam -set DW  512 axi_addr_miter
prf1024: chparam -set DW 1024 axi_addr_miter

prep -top axi_addr_miter

We’re now ready to test any optimizations. So let’s take a new look at the code above to se if we can optimize the basic address calculator to use fewer than 392 logic cells, while maintaining the same logic.

Making Optimizations

Back at ORCONF 2017 in Hebden Bridge, I had the wonderful opportunity to meet Jan Gray, an individual who had managed to place many 600?-LUT RISC-V soft-cores inside a Xilinx FPGA. I asked him about how he managed to get his soft-cores so small. Among other pieces of advice he gave, one which we’ll work with today is to, “Make every LUT within your design justify its existence.”

So let’s apply that attitude to our code today, to see what or how we might optimize it.

Let’s start with the increment.

	always @(*)
	begin
		increment = 0;
		if (i_burst != 0)
			increment = (1<<i_size);
	end

What if we knew that, in a valid transaction, i_size would be less than 3'h2 for a 32-bit bus? Why examine all three bits of i_size, when you know the top bit will always be zero? Similarly, if the bus size is already 8-bits, then you already know the increment can only ever be one.

The following code captures some of these assumptions. While it takes more lines and appears more complex, the logic it generates may well be smaller.

Remember, DSZ is equal to $clog2(DW)-3–essentially our current bus size but in the units of i_size. Further, as per the AXI spec, we know that i_size <= DSZ.

	always @(*)
	begin
		increment = 0;
		if (i_burst != 0)
		begin
			if (DSZ == 0)
				increment = 1;
			else if (DSZ == 1)
				increment = (i_size[0]) ? 2 : 1;
			else if (DSZ == 2)
				increment = (i_size[1]) ? 4 : ((i_size[0]) ? 2 : 1);
			else if (DSZ == 3)
				case(i_size[1:0])
				2'b00: increment = 1;
				2'b01: increment = 2;
				2'b10: increment = 4;
				2'b11: increment = 8;
				endcase
			else
				increment = (1<<i_size);
		end
	end

You’ll notice that if DSZ[2] is high then we use the same logic we had before.

Now let’s turn our attention to the wrap mask.

Since we’re only going to use this value if i_burst == 2'b10, and ignore it otherwise, we can set it for all cases.

	always @(*)
	begin
		// Start with the default, minimum mask
		wrap_mask = 1;

My initial approach to set the lower mask bits, wrap_mask[i_size:0] = -1, didn’t work with all of the Verilog parsers I use, so I decided to try using a for loop.

I normally discourage the use of for loops within Verilog. Most beginning designers who use them treat them like they would a piece of C/C++ code, when in actuality the synthesis tool will unroll every loop creating more logic with each iteration. When used poorly, loops end up creating much more hardware logic than a good design would in practice.

In this case, I wanted to create a different piece of logic to describe each bit. Therefore, I tried using a for loop to define each of the bits in the wrap_mask.

		for(iB=0; iB<8; iB=iB+1)
		if ({1'b0,i_size} > iB[3:0])
			wrap_mask[iB+1] = 1;

After staring at this for a while, I realized there was no optimizations within it for the maximum possible i_size value. Since DSZ is a parameter, and since the loop index value, iB, is essentially a constant once the loop logic is implemented, I could add a check into the for loop.

		for(iB=0; iB<8; iB=iB+1)
		if ((DSZ > iB) && ({1'b0,i_size} > iB[3:0]))
			wrap_mask[iB+1] = 1;

This check, DSZ > iB, would allow the synthesis tool to remove any of the unused possibilities, leaving wrap_mask unchanged in this case.

We still need to add in the next three bits to the mask, those dependent upon the length of the burst. When I did this before, I used a series of ifs. If you look a bit closer, though, you’ll notice that i_len[3] will only be high if all of the other i_len[2:0] bits are already ones. Similarly, i_len[2] will only be high if i_len[1:0] is high, and so on down. That means we can calculate the next several bits of the wrap_mask by simply shifting i_len[3:1] up to the right location.

		wrap_mask = wrap_mask | ({29'h0,i_len[3:1]} << (i_size+1));

Why not check all of i_len[3:0]? Because we already know that i_len[0] will be true for all possible wrap bursts.

I also know that the addresses within a burst can never cross a 4kB boundary. These upper address bits must be constant.

		if (AW > 12)
			wrap_mask[AW-1:12] = 0;

Then I got to thinking … I already know that the initial address of any burst with wrap addressing must be aligned from start to finish, so it really doesn’t matter what the lower bits of this value are set to.

This allowed me to back up and remove the for loop entirely. I then replaced the shift line with,

		wrap_mask = wrap_mask | ({{(AW-4){1'b0}},i_len[3:0]} << i_size);

Notice throughout these optimizations that we’re primarily using logic based upon constant values. This helps to keep us from increasing the amount of necessary logic, rather than decreasing it.

The next block actually calculates the next address, as we’ve discussed above. The first step, shown below, has really already been optimized.

	always @(*)
	begin
		o_next_addr = i_last_addr + increment;

This brings us to our alignment code, where we force the next address to be word aligned based upon both DSZ and i_size.

My first attempt to align the next address was to switch our cascaded if statement to a case statement.

		if (i_burst != 2'b00)
		begin
			// Align any subsequent address
			case(i_size)
			3'b001:  o_next_addr[  0] = 0;
			3'b010:  o_next_addr[1:0] = 0;
			3'b011:  o_next_addr[2:0] = 0;
			3'b100:  o_next_addr[3:0] = 0;
			3'b101:  o_next_addr[4:0] = 0;
			3'b110:  o_next_addr[5:0] = 0;
			3'b111:  o_next_addr[6:0] = 0;
			default: o_next_addr = o_next_addr;
			endcase
		end

This helped, and performed okay, but I thought I might be able to do better.

Remembering that i_size must always be i_size <= DSZ, I optimized this for each of the possible ranges of bits in i_size. For example, if DSZ==0, the bus is then only DW=8-bits wide, and nothing needs to be done to align an address.

		if (i_burst != 2'b00)
		begin
			// Align any subsequent address
			if (DSZ < 1)
			begin end

Similarly, if DSZ is less than two, then our bus size must be either DW=8 bits or DW=16 bits. In that case, alignment requires only checking i_size[0] and clearing o_next_addr[0] or leaving it alone.

			else if (DSZ < 2)
			begin
				if (i_size[0])
					o_next_addr[0] = 0;

While we’ll get to a point of diminishing returns soon, we can still apply this optimization to both DSZ < 4 and then just leave things as they were otherwise.

			end else if (DSZ < 4)
			begin
				case(i_size[1:0])
				// ...
				endcase
			end else begin
				case(i_size)
				3'b001:  o_next_addr[  0] = 0;
				3'b010:  o_next_addr[1:0] = 0;
				3'b011:  o_next_addr[2:0] = 0;
				3'b100:  o_next_addr[3:0] = 0;
				3'b101:  o_next_addr[4:0] = 0;
				3'b110:  o_next_addr[5:0] = 0;
				3'b111:  o_next_addr[6:0] = 0;
				default: o_next_addr = o_next_addr;
				endcase
			end
		end

The final couple of lines then are nearly unchanged. The only difference is that instead of checking for whether or not i_burst == 2'b10, we can simplify this into just checking for i_burst[1], since we know that i_burst==2'b11 is already an invalid input.

		if (i_burst[1])
		begin
			// WRAP!
			o_next_addr[AW-1:0] = (i_last_addr & ~wrap_mask)
					| (o_next_addr & wrap_mask);;
		end
	end
endmodule

With each change, we can now run both Yosys, to see if the change reduced the amount of logic, as well as SymbiYosys to see if our change broke our design or not.

What about the logic count? After all of this work tweaking and adjusting things, how did we do?

   Number of cells:                 49
     LUT2                            1
     LUT3                           11
     LUT4                            5
     LUT5                            1
     LUT6                            7
     MUXCY                          11
     MUXF7                           1
     XORCY                          12

Given that we started with 392 cells, 49 is a tremendous improvement.

But how many LUTs does this logic consume? A 7-series Xilinx FPGA has 6 input LUTs, not LUT1s, LUT2s, LUT3s, and so on. To make scoring logic usage a touch more difficult, two LUT5s can be combined into a LUT6–provided that they share the same five bit inputs. Therefore, I’ll often summarize a logic count by adding the number of LUT6s to the number of LUT5s, to the maximum of the number of LUT4s and LUT1s, to the maximum of the number of LUT3s and LUT2s. While this most certainly overestimates the score, it does manage to capture some of the effects of packing into our metric.

Here’s the final logic score, therefore: before optimization we needed 171 Xilinx 6-LUTs. After optimization, our new design can perform an AXI addresses calculation using only 24 LUTs on a bus having 32-data bits and 32-address bits. That’s a reduction of over 7x, so it was definitely worth our time.

Conclusion

AXI is certainly one of the most complicated buses I’ve ever worked with. It’s so complicated that I’m only discussing AXI addressing today. However, in order to properly respond to a bus request from an AXI master, a slave core will need to properly generate the addresses of any transaction it is responding to.

One fascinating discovery I made along the way is that Xilinx’s example AXI slave doesn’t even implement all of this logic. Instead, it only implements a 32-bit bus width and i_size == 3'b010 addressing. You’d think for the cost of only 14 LUTs, the cost of implementing all of the i_size options, they might have included the rest.

Of course, since AXI is so complicated, there’s still plenty more to discuss. If the Lord wills, I’d like to return and present the bugs I’ve found in Xilinx’s example AXI slave core, the formal property file’s I’ve managed to create in order to formally verify a (full) AXI4 slave core, as well as some of the other cores that became easy to build once I had the property files to work with.

These topics, however, will all need to wait for another day.