Connecting lots of slaves to a bus without using a lot of logic

Fig 1. A Minimal Bus Implementation

I’m not quite sure why, but most of the time when I examine a design on-line that someone has posted to a forum, there are very few bus components. There’s typically a CPU (Microblaze, Nios2, or ARM), some kind of SDRAM memory, perhaps a flash device, and then one or two other peripherals. Perhaps these would be an SD-card controller and an ethernet controller, as shown in Fig. 1.

I’ve never quite understood this. Many of my own designs will have those same peripherals, but then perhaps another 25 more. Why not create more peripherals than just a few?

Fig 2. Adding more peripherals

Perhaps I’m adding in the kitchen sink at this point, but why not? If you can, and if you have the peripheral and the space, why not add it into your design? Maybe I’m just becoming a logic hoarder–I’ll add logic from every peripheral I’ve ever worked on into a design and then more. I’ll then even add lots of wishbone scopes just to debug the whole.

Large numbers of items on the bus has yet to become a crippling problem for anything I’ve wanted to do.

So why don’t I see block designs with even half as many components when browsing Xilinx’s forums?

My guess is that it costs most folks too much logic.

To understand the issue, let’s just say that we want to connect four masters (CPU instructions, CPU data, DMA, and debugging bus) to a bus with 32 slave peripherals on it. Just the crossbar interconnect alone, before adding any peripherals, would require 5,571 LUTs for a WB interconnect, and (gasp!) 10,341 LUTs for an AXI interconnect! It doesn’t help that the size of the crossbar goes up at a rate faster than the product of the number of masters times the number of slaves. Worse, these numbers say nothing of the difficulty associated with getting such a massive design to pass timing for all the paths within such an behemoth of a crossbar interconnect either.

Perhaps this is why I’ve never seen more than a couple of slaves in any particular design: the interconnect alone might take nearly half the part, if not more! (Depending upon your FPGA size, of course.)

This of course leads to the interesting question, how is it that I haven’t suffered from this problem when adding 20+ peripherals to a design?

The Two Simple Slaves

AutoFPGA simplifies this complex bus interconnect logic via the creation of two simpler Wishbone slave protocols. I’ll call these sub-protocols, since for each of the simpler protocols the slave can still be Wishbone compliant, it just has a couple of extra features. The first sub-protocol is appropriate for a slave that consists of just a single register. This register may always be read immediately. The second peripheral class takes a single clock cycle to return the data of interest. AutoFPGA uses a slave type tag to describe these two sub-protocol classes. The first class would have a @SLAVE.TYPE=SINGLE tag, and the second would be @SLAVE.TYPE=DOUBLE. AutoFPGA would then use this information to simplify how such a slave might connect to the automatically generated bus structure.

Let’s take a look at each of these simplified protocol classes from a Wishbone standpoint, and then see how we might use this in an AXI-lite or even from an AXI (full) standpoint.

In Wishbone, the SINGLE slave type must have only a single register assigned to it. It must never stall the bus, and the register must always be available to be read. It’s as though all the internal logic were summarized as,

always @(*)
begin
	o_wb_ack   = i_wb_stb;
	o_wb_stall = 1'b0;
	o_wb_data  = internal_register;
	o_wb_err   = 1'b0;
end

always @(posedge i_clk)
if (i_wb_stb && i_wb_we)
begin
	if (i_wb_sel[3])
		internal_register[31:24] = i_wb_data[31:24];
	if (i_wb_sel[2])
		internal_register[23:16] = i_wb_data[23:16];
	if (i_wb_sel[1])
		internal_register[15: 8] = i_wb_data[15: 8];
	if (i_wb_sel[0])
		internal_register[ 7: 0] = i_wb_data[ 7: 0];
end

It’s really simple. Now, what if the interconnect could just ignore o_wb_stall (always zero), and o_wb_ack (always i_wb_stb), set the STB line (i_wb_stb) dependent upon which slave it was talking to, and then use a big case statement based upon the current address to determine the return value?

The DOUBLE type is very similar. In this case, though, the ACK line takes another cycle to return.

// Delay the acknowledgement by one cycle, so we can accomplish our logic
initial o_wb_ack = 1'b0;
always @(posedge i_clk)
	o_wb_ack   = i_wb_stb;

This extra cycle makes it possible to for the slave to select from among several possible internal registers you might wish to return before returning the result.

always @(posedge i_clk)
case(i_wb_addr)
0: o_wb_data <= internal_register[0][31:0];
1: o_wb_data <= internal_register[1][31:0];
2: o_wb_data <= internal_register[2][31:0];
3: o_wb_data <= internal_register[3][31:0];
// ...
endcase

This would again simplify the interconnect, since it would no longer need to wait for !STALL, nor would it need to check ACK to know if the resulting data was valid.

Because the logic for both SINGLE and DOUBLE slaves is a subset of the full Wishbone protocol, the slave can still be a valid Wishbone slave in its own right, while also allowing for the interconnect to optimize its bus access. This means that the slave should still work in a non-optimized Wishbone context as well as the optimized one, so you lose nothing there.

I’ve now used this approach within AutoFPGA for some time with great success–but only for Wishbone peripherals so far, and only with a very simplified interconnect structure.

The reason this has come to light is that I’m now in the process of upgrading AutoFPGA to use a full crossbar interconnect. As part of this upgrade, I came across this little optimization and wondered if I should keep it or throw it out. I almost threw it out, but then got to thinking some more about it.

To see the impact, consider the design shown above in Fig. 2. Had I collected SINGLE and DOUBLE slaves together for interconnect purposes, the design might’ve been simplified to the one in Fig. 3 below.

Fig 3. Creating slave groups by type, SINGLE and DOUBLE

My current thought is, can or should this be done with AXI peripherals, and if so how?

Simplifying AXI

So, if we were to totally simplify AXI to create simpler slaves and to gather eliminate any common bus logic between them together, how would we do it?

Here’s my current working proposal:

The interconnect guarantees that the core receives no back-pressure, leaving BREADY and RREADY both high.

This may require one (or more) skid buffers, or perhaps even small FIFOs within the interconnect, but this should still be quite doable.
The slave can then guarantee that it will keep all of the slave generated *READY signals high as well: AWREADY, WREADY, and ARREADY.

Sorry, but this property will keep you from using many of Xilinx’s peripherals, since they tend to idle with their *READY signals low.
The interconnect guarantees that AWVALID == WVALID. This will save the slave from the hassle of needing to implement incoming skid buffers just to synchronize these two signals.

Even better, if the slave logic is done right, the synthesis tool should be able to remove the skid buffer logic from an otherwise fully AXI compliant core.
The slave can then guarantee that BVALID == $past(AWVALID) and RVALID == $past(ARVALID) for SINGLE peripherals. For DOUBLE peripherals, the slave would guarantees that BVALID == $past(AWVALID,2) and RVALID == $past(ARVALID,2).

The neat thing about all of this is that these rules would work for AXI-lite as well as for AXI.

With just a little more work, we could guarantee the ability to connect an AXI-lite slave to a simplified AXI interconnect without the need for any further simplification logic.

To do this using a fully AXI capable slave, we’d need a couple other bus simplifying rules as well.

The interconnect must guarantee that AxLEN == 0 any time AxVALID is true for both channels.

This means that the interconnect will need to break apart any bursts into individual beats before they ever reach the slave.

This doesn’t mean that the interconnect will no longer support bursting at a rate of one beat of the transfer per clock cycle, but rather that each individual beat will be given its own address from the interconnect.
The interconnect would also then guarantees that WLAST == 1 any time WVALID is true.

This just follows from guaranteeing that AxLEN == 0.
The interconnect guarantees that AxID = 0, and then ignores xID on the return channel.

Yes, I understand that there are reasons for using the ID field–just not in this simplified version of AXI.

Also, having a known response time from the slave makes the conversion from AXI to AXI-lite a lot easier–without requiring any loss in burst speed. To see how difficult the conversion can be, consider this full speed bridge and notice the challenge of matching up the return ID with the requested burst, as well generating BVALID or even RVALID & RLAST signals with the end of the burst. It wasn’t easy to do, certainly not while maintaining a high throughput, and it was even harder to verify.
The interconnect guarantees that AxSIZE = $clog(C_AXI_DATA_WIDTH)-3, and then leaves it constant.

This also follows from setting AxLEN == 0.
The slave ignores AxBURST, AxCACHE, AxPROT, AxQOS, and so on. (The master guarantees these values are zero, in case the slave doesn’t quite want to ignore them.)
The slave might still support AxLOCK if desired, or ignore it if not. I haven’t decided if that would be useful or not.

Finally, if the interconnect does its job right, you wouldn’t lose any burst support, but still be able to retire beats at a rate of one per clock.

Indeed, the AXI slave logic might easily be simplified to something like the following for the SINGLE type peripheral:

//
// Simplified AXI (SINGLE) write logic
//
always @(posedge S_AXI_ACLK)
if (S_AXI_AWVALID)
begin
	if (S_AXI_WSTRB[3])
		internal_register[31:24] = S_AXI_WDATA[31:24];
	if (S_AXI_WSTRB[2])
		internal_register[23:16] = S_AXI_WDATA[23:16];
	if (S_AXI_WSTRB[1])
		internal_register[15: 8] = S_AXI_WDATA[15: 8];
	if (S_AXI_WSTRB[0])
		internal_register[ 7: 0] = S_AXI_WDATA[ 7: 0];
end

always @(*)
	S_AXI_BRESP = 2'b00;

//
// Simplified AXI (SINGLE) read logic
//
always @(*)
	S_AXI_RDATA = internal_register;

always @(*)
	S_AXI_RRESP = 2'b00;

The DOUBLE type peripheral logic would also be similarly simplified.

//
// Simplified AXI (DOUBLE) write logic
//
assign	wreg = S_AXI_AWADDR[C_AXI_ADDR_WIDTH-1:$clog2(C_AXI_DATA_WIDTH)-3];

always @(posedge S_AXI_ACLK)
if (S_AXI_AWVALID)
begin
	if (S_AXI_WSTRB[3])
		internal_register[wreg][31:24] = S_AXI_WDATA[31:24];
	if (S_AXI_WSTRB[2])
		internal_register[wreg][23:16] = S_AXI_WDATA[23:16];
	if (S_AXI_WSTRB[1])
		internal_register[wreg][15: 8] = S_AXI_WDATA[15: 8];
	if (S_AXI_WSTRB[0])
		internal_register[wreg][ 7: 0] = S_AXI_WDATA[ 7: 0];
end

//
// Simplified AXI (DOUBLE) read logic
//
assign	rreg = S_AXI_ARADDR[C_AXI_ADDR_WIDTH-1:$clog2(C_AXI_DATA_WIDTH)-3];

always @(posedge S_AXI_ACLK)
case(rreg)
0: o_wb_data <= internal_register[0][31:0];
1: o_wb_data <= internal_register[1][31:0];
2: o_wb_data <= internal_register[2][31:0];
3: o_wb_data <= internal_register[3][31:0];
// ...

Yes, this eliminates a lot of the logic necessary to deal with the AXI protocol. All of that ugly logic would be aggregated into one axisingle or one axidouble module that would then handle all of the full AXI protocol interaction in order to create this simplified protocol. You can see an example of such an AXI-lite axilsingle peripheral on github, should you be interested in how this might work.

This approach allows the bus interconnect to simplify its logic drastically. Instead of a 10k LUT crossbar, it should now be possible to connect the design together using a 3.4k LUT crossbar, Such a crossbar might support 4 masters and 8 slaves, where one of those slaves controls the SINGLE peripherals and one controls the DOUBLE peripherals. The logic in the slaves might even be as low as 600 LUTs (based upon a SINGLE peripheral drawn from an AXI-lite example).

Yes, there would be some distinct differences in this approach. For example, only one master could ever command a read (or write) port of a SINGLE or DOUBLE peripheral at a time–rather than allowing a separate master to connect to every simplified peripheral. This isn’t really that much of a problem, since if you anticipated contention, you might split the SINGLE (or DOUBLE) peripherals into two groups–and so avoid the contention. You might also place any high demand peripherals into their own peripheral slot in the interconnect and just ignore the potential optimizations–indeed, how you group slaves into SINGLE or DOUBLE peripheral locations is completely application dependent.

This is also a very different approach from the more common approach of using an AHB slave as a “lite” slave. First, AHB has no support for simultaneous reads and writes. That would force the read and write channels to be synchronous prior to handling an AHB slave. Second, because AHB permits arbitrary stall amounts, the master/interconnect can’t simplify the returns among a group of peripherals, but instead is required to check for the return from each individual peripheral. Similarly, while it is possible to generate an AHB interconnect, and so group peripheral returns, such groups of multiple slaves under the same AHB port would just slow everything down–since AHB is primarily a combinatorial logic bus.

Unlike that AHB approach, this approach maintains the high clock speed and multiple inflight transactions that AXI is known for already. It also maintains the separate read and write channels, as well as full/burst speed–unlike many of the AXI-lite implementations I’ve seen.

Conclusions

As you may remember, I’m in the process of upgrading AutoFPGA so that it can handle multiple bus types. My current upgrade plans include both full WB support as well as AXI-lite support, although once I get that far AXI shouldn’t be much harder. Indeed, most of the AXI work has already been done in either the AXI-lite bus logic generator, or the AXI crossbar.

Of course, my current work to this end is still quite preliminary, but this at least outlines how I intend to get the bus to be able to handle large numbers of slaves without breaking the piggy bank to get there. My goal is also to make the generated logic usable for all, rather than encumbered by copyrights, so that I could then use it in a vendor-independent basis for an intermediate digital design tutorial.

If the Lord wills, I’d love to have the opportunity to come back and blog about the success of this work. We’ll see what future the Lord brings.