I’m not quite sure why, but most of the time when I examine a design on-line that someone has posted to a forum, there are very few bus components. There’s typically a CPU (Microblaze, Nios2, or ARM), some kind of SDRAM memory, perhaps a flash device, and then one or two other peripherals. Perhaps these would be an SD-card controller and an ethernet controller, as shown in Fig. 1.
I’ve never quite understood this. Many of my own designs will have those same peripherals, but then perhaps another 25 more. Why not create more peripherals than just a few?
Perhaps I’m adding in the kitchen sink at this point, but why not? If you can, and if you have the peripheral and the space, why not add it into your design? Maybe I’m just becoming a logic hoarder–I’ll add logic from every peripheral I’ve ever worked on into a design and then more. I’ll then even add lots of wishbone scopes just to debug the whole.
Large numbers of items on the bus has yet to become a crippling problem for anything I’ve wanted to do.
So why don’t I see block designs with even half as many components when browsing Xilinx’s forums?
My guess is that it costs most folks too much logic.
To understand the issue, let’s just say that we want to connect four masters (CPU instructions, CPU data, DMA, and debugging bus) to a bus with 32 slave peripherals on it. Just the crossbar interconnect alone, before adding any peripherals, would require 5,571 LUTs for a WB interconnect, and (gasp!) 10,341 LUTs for an AXI interconnect! It doesn’t help that the size of the crossbar goes up at a rate faster than the product of the number of masters times the number of slaves. Worse, these numbers say nothing of the difficulty associated with getting such a massive design to pass timing for all the paths within such an behemoth of a crossbar interconnect either.
Perhaps this is why I’ve never seen more than a couple of slaves in any particular design: the interconnect alone might take nearly half the part, if not more! (Depending upon your FPGA size, of course.)
This of course leads to the interesting question, how is it that I haven’t suffered from this problem when adding 20+ peripherals to a design?
The Two Simple Slaves
simplifies this complex bus interconnect logic via the creation of two simpler
I’ll call these sub-protocols, since for each of the simpler protocols the
can still be Wishbone
compliant, it just has a couple of extra features.
The first sub-protocol is appropriate for a
that consists of just a single register. This register may always be read
immediately. The second peripheral class takes a single clock cycle to return
the data of interest.
uses a slave type tag to describe these two sub-protocol classes. The first
class would have a
@SLAVE.TYPE=SINGLE tag, and the second would be
then use this information to simplify how such a
might connect to the automatically generated bus
type must have only a single register assigned to it. It must never stall the
bus, and the register must always be available to be read. It’s as though all
the internal logic were summarized as,
It’s really simple. Now, what if the
could just ignore
o_wb_stall (always zero), and
i_wb_stb), set the
STB line (
i_wb_stb) dependent upon which
it was talking to, and then use a big case statement based
upon the current address to determine the return value?
DOUBLE type is very similar. In this case, though, the
ACK line takes
another cycle to return.
This extra cycle makes it possible to for the slave to select from among several possible internal registers you might wish to return before returning the result.
This would again simplify the
since it would no longer need to wait for
!STALL, nor would it need
ACK to know if the resulting data was valid.
Because the logic for both
is a subset of the
can still be a valid
in its own right, while also allowing for the
to optimize its bus access. This means that the
should still work in a non-optimized
context as well as the optimized one, so you lose nothing there.
The reason this has come to light is that I’m now in the process of upgrading AutoFPGA to use a full crossbar interconnect. As part of this upgrade, I came across this little optimization and wondered if I should keep it or throw it out. I almost threw it out, but then got to thinking some more about it.
To see the impact, consider the design shown above in Fig. 2. Had I collected
DOUBLE slaves together for interconnect purposes, the design
might’ve been simplified to the one in Fig. 3 below.
My current thought is, can or should this be done with AXI peripherals, and if so how?
So, if we were to totally simplify AXI to create simpler slaves and to gather eliminate any common bus logic between them together, how would we do it?
Here’s my current working proposal:
The interconnect guarantees that the core receives no back-pressure, leaving
This may require one (or more) skid buffers, or perhaps even small FIFOs within the interconnect, but this should still be quite doable.
The slave can then guarantee that it will keep all of the slave generated
*READYsignals high as well:
Sorry, but this property will keep you from using many of Xilinx’s peripherals, since they tend to idle with their
The interconnect guarantees that
AWVALID == WVALID. This will save the slave from the hassle of needing to implement incoming skid buffers just to synchronize these two signals.
The slave can then guarantee that
BVALID == $past(AWVALID)and
RVALID == $past(ARVALID)for
DOUBLEperipherals, the slave would guarantees that
BVALID == $past(AWVALID,2)and
RVALID == $past(ARVALID,2).
To do this using a fully AXI capable slave, we’d need a couple other bus simplifying rules as well.
The interconnect must guarantee that
AxLEN == 0any time
AxVALIDis true for both channels.
This means that the interconnect will need to break apart any bursts into individual beats before they ever reach the slave.
This doesn’t mean that the interconnect will no longer support bursting at a rate of one beat of the transfer per clock cycle, but rather that each individual beat will be given its own address from the interconnect.
The interconnect would also then guarantees that
WLAST == 1any time
This just follows from guaranteeing that
AxLEN == 0.
The interconnect guarantees that
AxID = 0, and then ignores
xIDon the return channel.
Yes, I understand that there are reasons for using the ID field–just not in this simplified version of AXI.
Also, having a known response time from the slave makes the conversion from AXI to AXI-lite a lot easier–without requiring any loss in burst speed. To see how difficult the conversion can be, consider this full speed bridge and notice the challenge of matching up the return ID with the requested burst, as well generating
RVALID & RLASTsignals with the end of the burst. It wasn’t easy to do, certainly not while maintaining a high throughput, and it was even harder to verify.
The interconnect guarantees that
AxSIZE = $clog(C_AXI_DATA_WIDTH)-3, and then leaves it constant.
This also follows from setting
AxLEN == 0.
The slave ignores
AxQOS, and so on. (The master guarantees these values are zero, in case the slave doesn’t quite want to ignore them.)
The slave might still support
AxLOCKif desired, or ignore it if not. I haven’t decided if that would be useful or not.
Finally, if the interconnect does its job right, you wouldn’t lose any burst support, but still be able to retire beats at a rate of one per clock.
Indeed, the AXI
slave logic might easily be simplified to something like the following for
SINGLE type peripheral:
DOUBLE type peripheral logic would also be similarly simplified.
Yes, this eliminates a lot of the logic necessary to deal with the
protocol. All of that ugly logic would be aggregated into one
axidouble module that would then handle all of the full
protocol interaction in order to create this simplified protocol. You can see
an example of such an
axilsingle peripheral on
should you be interested in how this might work.
This approach allows the bus interconnect to simplify its logic drastically.
Instead of a 10k LUT
it should now be possible to connect the design together using a 3.4k LUT
Such a crossbar
might support 4 masters and 8 slaves, where one of those slaves controls
SINGLE peripherals and one controls the
DOUBLE peripherals. The
logic in the slaves might even be as low as 600 LUTs (based upon a
peripheral drawn from an AXI-lite
Yes, there would be some distinct differences in this approach. For
example, only one master could ever command a read (or write) port of a
DOUBLE peripheral at a time–rather than allowing a separate
master to connect to every simplified peripheral. This isn’t really that
much of a problem, since if you anticipated contention, you might split the
DOUBLE) peripherals into two groups–and so avoid
the contention. You might also place any high demand peripherals into
their own peripheral slot in the interconnect and just ignore the potential
optimizations–indeed, how you group slaves into
DOUBLE peripheral locations is completely application dependent.
This is also a very different approach from the more common approach of using an AHB slave as a “lite” slave. First, AHB has no support for simultaneous reads and writes. That would force the read and write channels to be synchronous prior to handling an AHB slave. Second, because AHB permits arbitrary stall amounts, the master/interconnect can’t simplify the returns among a group of peripherals, but instead is required to check for the return from each individual peripheral. Similarly, while it is possible to generate an AHB interconnect, and so group peripheral returns, such groups of multiple slaves under the same AHB port would just slow everything down–since AHB is primarily a combinatorial logic bus.
Unlike that AHB approach, this approach maintains the high clock speed and multiple inflight transactions that AXI is known for already. It also maintains the separate read and write channels, as well as full/burst speed–unlike many of the AXI-lite implementations I’ve seen.
As you may remember, I’m in the process of upgrading AutoFPGA so that it can handle multiple bus types. My current upgrade plans include both full WB support as well as AXI-lite support, although once I get that far AXI shouldn’t be much harder. Indeed, most of the AXI work has already been done in either the AXI-lite bus logic generator, or the AXI crossbar.
Of course, my current work to this end is still quite preliminary, but this at least outlines how I intend to get the bus to be able to handle large numbers of slaves without breaking the piggy bank to get there. My goal is also to make the generated logic usable for all, rather than encumbered by copyrights, so that I could then use it in a vendor-independent basis for an intermediate digital design tutorial.
And whosoever shall fall on this stone shall be broken: but on whomsoever it shall fall, it will grind him to powder. (Matt 21:44)