What is a Virtual Packet FIFO?

I first came across virtual packet FIFOs in a SONAR project by necessity. The SONAR device’s only means of communicating with the outside world was via Gb Ethernet. There was no UART and no JTAG. Everything went over Ethernet. Collected data went over Ethernet. Device control was over Ethernet. Debugging had to be done over Ethernet. FPGA reconfiguration and all software updates had to go over Ethernet. Last of all, the CPU needed to talk to the outside world over Ethernet. This was where I first came up with the idea of a virtual packet FIFO.

Fig 1. A Virtual Packets FIFO

The idea came from necessity, given how my previous network controller operated. In that controller, packets would be received into a small block RAM connected the controller. That block RAM could hold only one packet at a time. Once a packet was received, therefore, the network controller would be deaf until the CPU processed the packet and then notified the controller it could use its memory for another packet. Likewise, when the CPU wished to transmit a packet, it would write a single packet to the controller’s memory, notify it that a packet was present, and then wait for the controller to finish transmitting it before writing the next packet to memory.

This works great–on a low bandwidth interface. But what happens if two packets arrive in short succession? Or, similarly, what happens if packets arrive that are larger than the controller’s internal buffer? What about “Jumbo packets”?

All of these problems necessitated a new solution, and the solution I chose was a virtual packet FIFO. This solution has two big upgrades to the previous one. The first is a size upgrade. A virtual packet FIFO can be much larger than its block RAM counterpart. The second upgrade is the number of packets that can be held. Frankly, it doesn’t make much sense if you can hold lots of data, if you can’t also fill that with either lots of packets or a small number of jumbo packets.

Since this is a neat idea, let’s take a moment and discuss it.

Packet streams

Some time ago, I discussed the problems with the AXI stream protocol. At the time, I based my discussion on three specific applications: video, data capture, and network packet handling. In each of these applications, data would arrive at the incoming interface independent of whether or not there was space available to handle it. Backpressure, a key feature of the AXI stream protocol, could not be supported properly without risking data corruption.

At that time, I suggested a new AXI stream field: ABORT. If the ABORT signal was ever asserted from an upstream source, the rest of any data packet would need to be dropped, and data handling would need to start over with the first beat of the next packet. This new ABORT signal has worked nicely in network packet handling constructs. Indeed, it has worked very well.

Yes, it’s a bit harder to work with and harder to verify than straight AXI streams. This is to be expected.
However, it was joy to watch the network design “just work” with this protocol. In particular, I watched network data get captured, formed into packets, and then dropped as the design started up–because either the network interface hadn’t finished its negotiation into 1Gb mode (it could never keep up at less than 1Gb/s), or because the data hadn’t been told where to go yet. (Yes, it still needed a destination addresses for the SONAR data, both IP and Ethernet, before it could send it out.)

Once configuration completed, the protocol started blasting captured packets without a hitch.

I loved it!

Others, however, have argued that my proposed ABORT field was unnecessary. Why create a new protocol, they argued, vs. just using straight AXI stream? The answer to this is twofold:

Jumbo Frames

In order to use straight AXI stream, you have to first convert the incoming network packet to AXI stream in the first place. The follows simply because that incoming network interface doesn’t know anything about backpressure. To do this conversion, incoming packets need to first go into a buffer. If there’s not sufficient space in the buffer, the packet is simply dropped. If there’s sufficient space, the packet is “committed” and can then be read out of the buffer via standard AXI stream.

The size of this buffer forces a limit on the maximum packet size that can be handled. Packets larger than the buffer size will need to be dropped.

While I was designing the original SONAR Ethernet controller, my customer asked about jumbo frames–packets much larger than the (otherwise) maximum Ethernet packet size of 1500 Bytes. How much larger? They didn’t say. All of a sudden, I could no longer size my buffer prior to hardware layout (place and route).

The Virtual Packet FIFO we’ll discuss today can solve this problem of converting an (otherwise) unsized packet to AXI stream proper.
Vendor Infrastructure

If I used Xilinx (or any other vendor’s) AXI stream infrastructure, I might be tied to that protocol. The choice of whether or not to use AXI stream is really a business decision: either rebuild the AXI stream infrastructure from scratch to support a modified protocol, or stick to the AXI stream protocol as is.

Fig 2. Advantages to using your own IP

If I rebuild the infrastructure from scratch, I incur additional costs above and beyond what I might have incurred had I used someone else’s (free) infrastructure. I can release any IP I build under my chosen user license. I can also formally verify anything I build. I will also gain the ability (and responsibility, and cost associated with) debugging and maintaining it. The good news, though, is that I can guarantee the quality of any IP I control.

Fig 3. Advantages to using vendor IP

If I use a vendor’s infrastructure then I might save some money–while risking my project’s success on the vendor’s responsiveness to bugs found in their infrastructure. Given that I’m aware of bugs that’ve lived in Xilinx IP for nearly 10 years, and given that I’m a small one-man nobody shop, I don’t have a strong confidence that they’ll fix anything that’s broken.

Yes, I suppose this is a business decision.

Frankly, I don’t use vendor infrastructure unless I have to. It’s just the nature of how I’ve structured my own business at Gisselquist Technology. I’ve now built my own CPU, my own GNU compiler and assembler back ends, my own bus interconnects, my own DSP filters, CORDICs, FFTs, etc. So it should come as no surprise that I’d have no problems building an AXI stream infrastructure based around a new “ABORT” field.

Yes, there are risks with this approach. One common risk is that I might need to interface with a vendor protocol, so I often have conversion routines available to move back and forth between one protocol and another when necessary. For example, it’s not enough to use Wishbone if you need to interact with Xilinx’s MIG–so I use a bridge from one protocol to the other. I also have sufficient infrastructure to use AXI without bridges if necessary.

Still, AXI stream is a really simple protocol, and this modified AXI network stream protocol, while more complex, isn’t really that much more difficult to deal with.

Since writing that article, I’ve had great success with this new ABORT field. Indeed, I’ve had so much success, that I’m now rebuilding all of my network data handling components to use it.

However, there is one (more) problem this new protocol needs to address: stream widths.

When working with 1Gb Ethernet, I could operate at 8b/clock at 125MHz, and stream widths weren’t really a problem–every beat contained exactly one byte. Well, not quite. Stream widths became a bit of a problem when crossing clock domains, since I would need to guarantee sufficient handling width. To handle the CDC case, I first converted to a wider (32b) AXI stream, and then prepended a 32b packet length to the packet. This kept me from supporting jumbo frames, so when rebuilding for a 10Gb interface, I needed a new solution.

Standard AXI stream solves this problem with their TSTRB and TKEEP fields. Each field has one bit per byte per beat within it, and allows the stream processor to handle less than a full beat of information. For example, when dealing with a 32-bit interface, a 16-bit value might contain two NULL bytes, where a NULL byte is defined as one where TKEEP and TSTRB are both low.

This seemed insufficient for me for a variety of reasons. In general, to use an AXI stream of this type, you’d first want to pack it and remove all NULL bytes. This would force any unused bytes into the last beat, while also requiring that the last beat had at least one valid byte. The last beat would also need to be packed, so that all used bytes would be on the low end–when using little endian semantics, or the high end otherwise. Further, I never saw a reason for keeping “position” bytes (TKEEP && !TSTRB) around. The result was that TKEEP and TSTRB contained too many bits for my purpose.

So I created a new field: BYTES. At first, the BYTES field had $clog2(DW/8+1) bits to it, where DW is the number of bits in the DATA field–sometimes called C_AXIS_DATA_WIDTH. This BYTES field would then be equal to DW/8 for every beat prior to the last one, and between one and DW/8 inclusive for the last beat. (0 < BYTES <= DW/8) Then, on second thought, I realized the top bit of BYTES was irrelevant: Since BYTES was never zero, and never more than DW/8, I could map the DW/8 value to zero and drop a bit. So, now, BYTES has $clog2(DW/8) bits and a value of zero (representing DW/8 bytes) for all but the LAST beat where it might represent fewer bytes per beat.

So, in summary, to support packet data I made the following changes to the AXI stream protocol:

ABORT: A new field, indicating that the upstream processor needed to drop the packet for any reason. Possible reasons I’ve come across include: 1) CRC errors, 2) protocol errors, 3) hardware errors, or even 4) insufficient memory for handling backpressure, from downstream.
TKEEP/TSTRB: I dropped both of these fields.
BYTES: A new field to replace the TSTRB/TKEEP fields, while still indicating how many bytes are active in a given beat.
And, of course, all beats are fully packed. Hence, all but the LAST beat will have DW/8 valid bytes in it.

I’ve named this (new) protocol the AXI-networking, or AXIN, protocol, for lack of a better name. As a result, if you look through the designs I’ve built to use this protocol, you’ll find “AXIN” in a lot of the names.

I also have a lot of infrastructure for this new protocol, and that infrastructure is growing on a daily basis. For example, I have AXIN skidbuffers, asynchronous and synchronous FIFOs, broadcasters, arbiters, a width converter, and more. (A CRC checker is still being verified, but will likely be posted soon.)

This brings us to the topic of virtual FIFOs.

What is a virtual FIFO?

A virtual FIFO is simply a FIFO that uses external instead of internal memory. That external memory is typically accessed via a bus, shared among many potential users, and commonly exists off-chip. A classic example would be a DDR3 SDRAM memory accessed via an AXI (or Wishbone) bus. You can see the difference between a traditional and a virtual FIFO in Fig. 4.

Fig 4. Difference between a FIFO and a Virtual FIFO

Some time ago, I built a Virtual FIFO for the AXI protocol. The flow went as follows:

The first step was to buffer a burst of data into a synchronous FIFO. To work smoothly, the synchronous FIFO needed space for at least two AXI bursts.
Once a full burst’s worth of data was available in the local FIFO, that data would be burst to the AXI bus.

When using Wishbone, I’d do the same thing, save that the burst would only end after the incoming FIFO was completely drained.
Once BVALID was then received, we would know that a full AXI burst’s worth of memory was now available in the external RAM to be read back.

In the case of Wishbone, I’d count data words, not burst sizes, but it’s the same principle.
There was again another (local, block RAM) synchronous FIFO on the read memory side. Like the first FIFO, this one also required enough room to contain at least two AXI bursts.
Once a burst’s worth of data has been placed into the external RAM, and there is sufficient (uncommitted) data in the second FIFO to hold it, a burst read request would be issued.

Again, when using Wishbone, I’d make requests until the entire FIFO’s size was committed–not just the initial burst size. Hence, as reads might be made from the FIFO while requesting data from the bus, additional reads would be made.
Data read back from memory would then get sent straight into this second buffer once it returned from the bus.
The final, outgoing AXI stream, would then be fed straight from this second buffer.
Only when the incoming FIFO is full would backpressure, attempt to slow down the upstream source.

The incoming FIFO would be “full” if it wasn’t getting emptied. This would happen if either 1) the memory was full, or 2) the FIFO couldn’t write to memory fast enough.

Success, when using this technique, required that the stream bandwidth be less than 50% of the memory bandwidth. This will often require that any stream necessitating a high throughput might first need to be resized to a wider width–just to reduce the throughput to something the memory can handle. Remember, when sizing memory bandwidth, there are lots of things that can use up your bandwidth:

Fig 5. Calculating memory bandwidth

You’ll need bandwidth for the data to come in and get written to memory
You’ll need that much again to read the data back out
You’ll need to allocate time for bus latency.

This can be worse for any bus that needs to stop in order to switch directions.
A memory can only read or write on any given clock cycle, and also needs a couple cycles to switch from reading to writing and back again.
Don’t forget, you’ll also need to allocate some number of memory clock cycles for refresh. How many cycles will be required here depends upon your memory, your bus structure, and whether or not your memory allows pulling refresh cycles or whether such pulled cycles are supported in your controller.
Xilinx’s MIG controller also uses a couple of clock cycles per refresh to keep it’s IO PLL locked on the memories DQS strobe signal(s).
You’ll also need memory bandwidth for everything else that might use the memory.

In short, it depends. The best way to size memory bandwidth requirements is to calculate how many beats per second you will need, and then make sure your memory can support perhaps four times that amount.

A key problem with the standard virtual FIFO, described above, is that there’s no (good) way to store non-data information such as packet boundaries in memory. Either you increase the memory storage requirement to hold a LAST bit (often by 2x!), or it just gets dropped. Indeed, my basic AXI virtual FIFO implementation simply drops this data. As a result, it works well on a proper stream interface, but not very well on a packet interface.

I have a separate virtual FIFO that I’ve built for my scope, sometimes called an internal logic analyzer. (This one’s been formally verified, but never used in any practical context. It was fun to build, and a good learning exercise, it just hasn’t fit into any important usage scenarios … yet.) In this case, if the scope ever gets overrun and can’t keep up, all the data will be dropped and it will start collecting all over again with new data.

Again, the problem with the stream protocol is backpressure, and what to do if you overrun the FIFO, and where/when in your stream processing will that information be known. When dealing with packets, the rule is data needs to be dropped at packet boundaries. That information needs to be communicated to the place where the decision can be made.

So how do we mix the packet concept with the concept of a virtual FIFO? The answer is a virtual packet FIFO.

What is a virtual packet FIFO?

A virtual packet FIFO is simply a virtual FIFO that maintains packet boundaries.

Fig 6. Virtual FIFO definitions

That means two things.

First, it means we have to preserve packet boundaries. That LAST signal is important when working with packets. Moreover, packet boundaries need octet level precision.

Second, it means that packets are written to the FIFO before being committed to the FIFO. Only after a full packet has been written to the FIFO can it ever get committed.

To handle both of these requirements, I rearranged how I stored packets in memory. Instead of storing packet data alone, or packet data plus a LAST bit, or packet data plus some number of ancillary bits (TSTRB, TKEEP, TUSER, and TLAST), I store the length of the packet before the packet.

Fig. 6 shows this pictorally.

Fig 6. Virtual packets in memory

Specifically:

All packet length fields precede the packet they describe.
All packet lengths are 32’bits. Yes, this is an arbitrary length. However, 1) this seems to be the smallest bus size I’ve ever needed to work with. 2) I rarely have more than 4GB of memory, so this seems sufficient. 3) It allows for jumbo packet sizes up to whatever memory size I have on hand.
I also force a minimum 32-bit alignment on all accesses. So, for a 128-bit wide bus, this word will be aligned on a 32-bit boundary.
Before a packet is committed, its packet length shall be NULL. (i.e. 32’h0) You can think of this like the NULL pointer at the end of a linked list.
The packet data is written to memory using the full width of the bus.

In the context of the 10Gb Ethernet switch I’m working on, maintaining memory throughput is important. As a result, I need to use the full memory width (512 bits) as often as possible. Anything less would reduce my memory bandwidth.
If ever a packet gets dropped, the packet writer just goes back to the beginning of the packet data area and starts over following the NULL packet length word when the next packet starts.
- Packets can be dropped for any upstream reason.
- Packets are also dropped if the virtual packet FIFO runs out of room.
  
  This is a necessary criteria to prevent a deadlock created if the upstream source never needs to abort, and yet there’s no room in memory to hold the last of the packet in memory.
  
  In order to support packet length pointers, a packet may not be completed unless there’s room for both the packet length before and the packet length of the packet to follow.

As I mentioned, I’m now working on building a Virtual Packet FIFO. It hasn’t yet been verified, or I’d present the entire FIFO here. For now, let me point out the three major components and discuss how they work together:

There’s the Controller,
The writer, and
The reader.

You may expect these components to change as they eventually get verified, and then tested and proven in hardware. (As of today, only the writer has passed a formal check, and even that check didn’t properly include the contract.)

Let’s discuss each of these components briefly.

The Virtual Packet FIFO Controller

The controller is responsible for setting the base address and memory size allocated to the virtual FIFO. These two values are then propagated down to both writer and reader. It’s also responsible for resetting the FIFO, and (depending on the configuration) releasing it from reset.

Even though this is only my second virtual packet FIFO, I’ve already had several diverse needs for this controller. In one design, the controller was given a fixed memory allocation. This is appropriate if the controller is required to start up and operate without any CPU intervention. In another design, the CPU could allocate memory for the FIFO and then enable or disable the FIFO. When I can’t decide which of the two I want, sometimes I will generate a combination of the two, so that the FIFO may start with a default allocation that the CPU can come back to and adjust later if necessary.

What happens if the CPU gives it a bad allocation? One of the challenges of controller design is determining how the virtual packet FIFO should handle bus errors. In general, a bus error indicates that the FIFO has a bad memory allocation. This might be the case if the CPU has allocated memory to the FIFO that isn’t present in the system. In this case, it makes the most sense for the FIFO to shut down with some type of error condition, and to then wait for the CPU to correct its memory allocation. On the other hand, if the CPU will get its instructions for “fixing” any faults from the network, then the network must be able to heal itself without any CPU intervention.

A similar problem might be generated by the reader if it ever comes across an invalid packet length word. Such a packet length migth be equal to zero, greater than the total size of the allocated memory, or perhaps just big enough to pass the write pointer.

In both of these cases, either a bus error or an invalid packet length, there should be an appropriate way to fix the situation. In a hands-off implementation, the FIFO will need to just reset itself–hopefully in that case memory allocation issues will be handled before implementation. In another case, the FIFO will wait for the CPU to issue a new address before releasing itself from reset. The same could be done with the read packet length, or alternatively the packet reader might just skip to where the write pointer is at–skipping any packets in the way.

Which method of resolving faults is appropriate depends upon the particular design requirements.

The Write State Machine

Once the base address and memory size are known, and once the FIFO has been released from reset, incoming packets may be written to memory.

This takes place in several discrete steps.

First, prior to any packet, the packet’s length word must be written as zero.
Then, as packet data enters the FIFO, its data gets written directly to memory.

Unfortunately, the 32-bit packet length word guarantees that further writes to memory can not be guaranteed to have any particular alignment. Incoming data must then be realigned as it is written to memory. This also means that there may need to be N+1 memory writes for every N memory words.

A second problem here is associated with the data pointer. Specifically, pointers wrap. Hence, any calculation of the next memory address must include a check against the last memory address, and a wrap back to the first address if it passes the end of memory.

Finally, this is the only place where committed memory space is checked. If a packet uses all of the available memory space, not just the remaining memory space, then it must be aborted locally.
On any packet ABORTs, the write pointer is set to follow the prior NULL length. On a local packet ABORT, such as might take place if the packet overflowed memory, then we need to resync to the beginning of the next packet.
Once a packet is complete, the next word becomes the length field of the next packet. It is set to NULL.
After this next word has been set to NULL, the FIFO writer can then go back and write the length of the current (just written) packet into memory.

This is what actually commits the packet to memory. We can know the packet has been committed once all bus requests have been completed.
Once the bus becomes idle, we tell the reader our new start-of-packet pointer and go back to step #2 above to handle the next packet.

All of this is handled via a (monster) state machine that can master the bus.

The Read State Machine

Once a packet is committed, a second state machine can then read the packet back from the bus. (This one still needs a lot of verification work …)

First, the reader reads the length word from memory.

Knowing when to read this length word is a bit of a problem. Were this a piece of CPU software, we might poll this memory word. If the memory word was ever non-zero, we’d know a packet was present. However, this design is intended for a hardware implementation. Hardware can poll memory on every clock cycle, so much so that the writer wouldn’t have any cycles left to write the packet to memory. To prevent this, we’d need to specify a polling interval, which would then increase our latency. Supporting minimum latency requires a different solution.

My solution to this problem has been to use an out-of-band communication scheme through the controller. In this scheme, the writer tells the reader a pointer to the length word of the last packet committed. If this address doesn’t match the reader’s last memory address, then a packet is present that may be read. In another version of this FIFO, one with the CPU in the middle, the CPU provides the reader with the same pointer. Again, this tells the reader when it’s safe to go and read the packet length counter for the next packet.
Once returned, the reader then verifies the packet length word.
- It’s not allowed to be zero.
- The packet length may not pass the write pointer. This would indicate a memory overrun condition.
- The packet length must be less than the size of memory.
On any failure, we can either reset the entire FIFO, or (alternatively) just drop all packets between our current location and the write pointer.
If the length pointer is good, we start reading from memory.

There are two challenges with this task. The first challenge is that the memory will (in general) be misaligned. The second challenge is that Wishbone has no concept of backpressure, and our outgoing stream interface may require it.

To handle misalignment, we need to keep track of the previously read memory word. That, plus the current memory word, both shifted appropriately, we’ll yield an outgoing stream word. The trick is that we may need to read an additional word to get all of the outgoing stream data associated with this packet.

The way to handle backpressure when using Wishbone is to guarantee that we don’t issue a read request in the first place unless there’s space available in the following outgoing FIFO for a packet word.

One of the nice things about the reader, is that we don’t need to generate any ABORTs. That’s a pleasant simplification. Indeed, at this point in our return processing, we could finally handle (infinite) backpressure if need be.

A New Interconnect

One piece I wasn’t expecting in this new architecture was an updated/better Wishbone interconnect.

As perspective, this virtual packet FIFO, is designed to support a 10Gb, 4-way Ethernet switch. That means I want to be able to support 10Gb arriving (and departing) on each of the 4 interfaces at the same time. When using our planned hardware, the memory will run on a 200MHz clock, reading (or writing) 512-bits (64-bytes) of data per clock cycle. However, a 10Gb Ethernet switch will generate one 512-bit word every 51.s ns, or (roughly) once every 11 clocks at 200MHz. Hence, when the interface is running full speed, we’ll be getting requests from rotating controllers. The first controller might want a beat, but then not need anything for another 10 beats while the second controller wants a beat, etc.

Typically, I run Wishbone in a fashion where I burst data to the bus (i.e. to memory) and then wait for the response before shutting the interface down. When using Xilinx’s MIG, this can take up 20 clock cycles of latency. If I did that here, I’d never have enough memory bandwidth to keep up.

My solution to this problem is to use a special type of interconnect–one I first developed for an AXI project. When using this interconnect, N masters may request bus accesses of a single slave. In this case, as each bus master makes its request, the master’s ID is placed in a FIFO. Since Wishbone requests are always returned in the order they are received, I can then use this FIFO to route responses back to the appropriate master. This will allow me to interleave requests from multiple masters together on their way to memory.

That’s the good news–more bandwidth. The bad news is that this N:1 arbiter will break Wishbone in two ways. First, since there’s no guaranteed concept of the end of a transaction, there’s no way to know when to lock the bus. Second, as I implement Wishbone, a bus error terminates any ongoing transaction. This means that if N masters are active and only one of those masters receives a bus error–in response to some erroneous transaction, then all Wishbone masters will receive a bus error in return to their ongoing operations. For now, this will work: 1) these virtual packet FIFOs will not be locking the bus, and 2) any bus errors should be rare or even non-existent. Still, it’s a risk, and I’ll need to make sure it’s well documented throughout the project.

Conclusions

This is now the second virtual packet FIFO I’ve created. If any design becomes so useful that you need to build it more than once, then it’s going to become useful again.

In this case, this virtual packet FIFO will play an important part of the 10Gb Ethernet switch) I’m working on. As packets arrive from the PHY, their CRC’s will be validated, their stream width expanded, they’ll then cross clock domains, their source MACs will recorded in the router, and they will enter this virtual packet FIFO. Once these packets come out of the FIFO, the’ll go into a separate synchronous FIFO, have their destination MACs checked, get routed to an outgoing interface, cross clock domains (again), have their widths adjusted back to the interface width, and finally get blasted out the network. Feel free to check out this picture to see an overview of this entire operation, as well as the status of the various components required of this project.

For now, however, the project is still draft. The hardware, while drafted, isn’t yet built and I’m still working on the RTL components within it. Lord willing, I’ll have the RTL done by the time the hardware is available.

Still, the overall concept of a Virtual Packet FIFO was one I felt would be worth sharing.