Lessons in Hardware Reuse

Fig 1. Can hardware designs be recycled?

When I first started doing digital design, I had a strong software background. One of the first lessons you learn in software is to reuse your own code as much as possible. The sooner you learn this lesson, the faster you’ll be able to finish your next homework problem.

The lesson goes well beyond school, but into industry as well. Consider the various operating systems and how often they are reused. Are you reading this article from a device running Linux, MacOS, Windows, or something else? Just being able to list the number of major operating systems on one hand is a testament itself to software reuse.

The same lesson applies to compilers and system libraries. How is it, for example, that Vivado, Quartus, Yosys, Verilator or any other EDA tool can run on so many platforms? Software reuse. It’s real. It works.

But what about hardware? Specifically, what about reusing digital design components?

Here, in this field, reuse becomes a bit more of a challenge.

The first and biggest challenge are the hardware licenses. The licenses that worked so well for software don’t apply as well to hardware. While I personally love the GPLv3 license, conveying a hardware design that uses a GPLv3 component to someone else requires also conveying to them the ability to rebuild the rest of the entire design. This isn’t so easy, since many of the popular major design components, ARM cores, SERDES cores, I/O components, and so forth, are still very proprietary.

Within a company, however, design reuse shouldn’t be a problem. The company owns all of their own designs, so they should be able to use them freely from one product to another, right?

This is the case here, within Gisselquist Technology, LLC, and yet even in this optimal reuse environment hardware design reuse is still a long way from achieving the goals that have been achieved by software reuse.

Let’s take some time today to look at several experiences I’ve had with design reuse since I started with digital design over a decade ago. (Wow, has it actually been that long?) We’ll start by looking over standardization problems I’ve had across tools, and then work our way from the bottom of a design all the way up through some components, through bus slaves, and on to bus master interoperability.

Tools aren’t standard

The first problem with design reuse is that the various tools tend to be vendor and often even platform centric. This makes it a challenge to reuse designs from one platform to the next. For example, design constraint files (XDC, UCF, SDC, PCF, etc.) differ in format and content from one vendor to the next. This means that I/O timing constraints and false path constraints all need to be rewritten when attempting to reuse a design across different vendors.

Well, at least the HDL languages are standard among vendors and tools, right? How about just the subset of Verilog that I like to use?

Well, no. Not even Verilog is standard across vendor tools.

One of the first things I teach anyone who will listen is to place a default_nettype none declaration at the top of every Verilog source file. Doing this prevents the synthesis tool from turning a spelling mistake into a new signal within your design. It has helped me catch a lot of mistakes over the years.

The problem is that placing this line in a Quartus DE-10 Nano design will cause Quartus to fail to build the design. Why? Because the default_nettype setting isn’t applied across a single file, but rather every design file following, if not the entire design. Worse, it seems as though the Altera’s engineers used this language “feature” to avoid declaring signals within their designs. Hence, what makes my design better breaks their design components.

The problem isn’t limited to Quartus. Yosys handles the default_nettype statement on a file by file basis. This means that if I change default_nettype back to its original wire setting at the end of the file, the design will now work with Quartus but it will no longer get the default_nettype benefit from Yosys.

There is one annoying detail associated with this command: input ports need to be declared as input wire rather than just input once you set default_nettype to none. The Verilog standard requires this, yet neither Yosys nor Verilator require it. This means that designs that pass a verilator -Wall -cc topmodule.v check might still fail to be built under another tool.
HDL designs don’t build without warnings

If there’s one thing that frustrates me, it’s the inconsistency of warnings across tools. Coming from the software world, I’m used to a program that can be compiled without warnings. Here in the hardware world, this is a challenge.

Consider, for example, the following code.

	parameter	W = 5;
	reg [W-1:0]	A;

	always @(posedge clk)
		A <= A + 1;

This will generate a warning that a 32-bit number, the 1, is being added to a 5 bit number, and so there might be a loss of precision. While I might rewrite this to get rid of the warning,

	always @(posedge clk)
		A <= A + 5'h1;

the warning will then return again whenever I change the width, W, to something other than five.

If I then try to change the design to

	always @(posedge clk)
		A <= A + 1'b1;

I then get rid of the warning when using Verific based front ends, only for it to return with Verilator.

My solution has been to build my designs so that they have no warnings when using verilator -Wall, and then to ignore any of the warnings generated by the Verific, parser used by both Vivado, ISE and Quartus.

Still, its annoying to have a design build without warnings in one environment, but not in another.

Unused values

Many interfaces have signals that aren’t used by all cores. In order to make the cores generic, I pass those signals that aren’t used along with the interface anyway. Verilator generates a warning when I do this. Verific (i.e. the front-end language parser used by Vivado, ISE, and Quartus) also generates a warning. However, I can turn the Verilator warnings off on a case by case basis by simply using,

	// Verilator lint_off UNUSED
	wire	unused;
	assign	unused = &{ 1'b0, unused_signal_one, unused_signal_two, etc };
	// Verilator lint_on UNUSED

While this doesn’t get rid of the warnings when using the commercial vendor tools, at least those warnings are now about the wire named unused being unused, and so they’re now easy to work through.

Of course, the problem with ignoring synthesis warnings like this is what happens when a design mysteriously stops working. In that case, I find myself digging through all of the useless warnings generated in the logs of the various tools and looking for any evidence of what might’ve happened.

Generate loop block names

Much to my surprise, a design that worked in Yosys, Verilator, Vivado, and ISE failed to synthesize under Quartus for the simple reason that the for loops within my generate blocks weren’t named.

generate genvar k;
for(k=0; k &lt; NADC; k=k+1)
begin : BLOCK_NAME_NEEDED_HERE
	assign adc_data[k]
	= raw_adc_data[k*ADCBITS +: ADCBITS];
end endgenerate

My point here is simply that seemingly useless differences between vendor tools can become quite annoying in practice and a hindrance to design reuse. All of a sudden, you find that a design component that worked under one vendor’s tools mysteriously causes build failures under another vendor’s tools.

This problem was solved in software by an open source compiler, gcc. Verilog has an open source synthesizer, Yosys which can come close. It can synthesize designs for ASICs, iCE40, ECP5, Xilinx 7-series, and some Intel designs. In many ways this is halfway to nirvana. Unfortunately, there’s no open source synthesis tool for VHDL, nor is there any open source tool for SystemVerilog–although there is a Yosys plugin, called ghdl-synth, that I’m told is getting close to offering VHDL support in Yosys.

Why not reuse FIFOs?

Fig 2. Surely common components can be reused?

Once you get past the tool issues, the next biggest question is why can’t I reuse some of my most common components? The most obvious of these common components is a FIFO. FIFOs are perhaps the most common core used across designs. I use FIFOs in my bus bridges, my ADC cores, a microphone core I’ve built, my UART cores, and even in my debugging bus. Surely one simple FIFO design can be used across all architectures?

Fig 3. Common FIFO ports

The good news, at least for me, is that after writing many (dissimilar) FIFO implementations, I’m now starting to coalesce around a single synchronous FIFO implementation. Even with this implementation, there are a lot of per-design configuration differences that need to be made.

The data width changes from one application to the next, as does the necessary FIFO depth (RAM size).

Thankfully, these changes are easily parameterized–making the FIFO (mostly) generic.
Should the empty/full flags be registered? Do they need to be? It costs extra LUTs to calculate these values one clock earlier, but doing so can also keep any FIFO users off the critical timing path.
Some FPGAs have distributed RAM, others don’t–something I discuss in my tutorial lesson on FIFOs. On an iCE40, all RAM reads must be done directly into a register before they can be read, whereas Xilinx architectures support “distributed RAM” reads on the same clock cycle they are used.
Handshake signaling differs from one implementation to another. My current FIFO implementation uses a READY/VALID type of handshake for reading (i_rd and !o_empty) from and writing (i_wr and !o_full) to the FIFO.

The problem is that this interface isn’t necessarily appropriate for all applications. In some data centric applications, such as coming from an A/D or a video source where the data comes in at a fixed speed, the source will write to the FIFO regardless of whether or not the FIFO is ready. Doing this properly really requires generating an error signal, which my one-size-fits-most FIFO implementation doesn’t (yet) have.
Some applications, such as a UART, require being able to know how much data is in the FIFO. They want to read the FIFO’s fill level back out. This can be useful for waking up a processor only when the FIFO is half full or half empty, for example, or reading until it is empty following an interrupt. Other applications don’t care about the fill. Leaving a port unused and dangling, however, is likely to cause a tool warning and get in the way of building a warning-less design.
Other applications, such as stream to memory bridges, might want a trigger threshold implemented within the FIFO. Such a trigger, in the case of a stream to memory component, might cause the FIFO to empty into memory like flushing toilet empties the tank into the bowl.

Can one FIFO work in all applications? I haven’t managed to do it (yet). In addition to reuse, there is something to be said for keeping things simple. Of course, the problem then comes about when I fix a bug in one FIFO that then still remains in one of my other implementations.

Xilinx’s solution appears to be to use a FIFO generator that will then generate the logic for a FIFO that can be used across many Xilinx hardware platforms. At the same time, this (proprietary) FIFO generator has given me no end of hassles when trying to formally verify what little they have published about their interconnect. Tell me, for example, why does a FIFO require nearly 100 parameters and just as many ports? Hence, while configurability in the name of reuse is a good thing, this generator appears to be taking configuration to an extreme.

Can we reuse serial ports?

Fig 4. A common serial port interface

So let’s move up the ladder, from FIFOs to full level design components. How about serial ports? What can we learn about hardware reuse from serial ports?

A fellow open source designer, Olof Kindgren, is known for his strong opinion that we should stop building new serial ports. Surely among all design components serial ports should be prime candidates for reuse! The communications standard hasn’t changed in years, so why ever build a new serial port?

To put it in his own words,

I use the UART as a pathological example because it’s a function so simple that many people feel it’s easier to write a new rather than reuse an existing one. But in practice this leads to another implementation with bugs but without proper docs, tests, and drivers.

(Twitter)

There are a lot of things you can learn from serial ports.

Building a serial port is a good beginner’s design exercise.

If you’ve never built a serial port before, go ahead and build one. It’s a fun design to learn from, especially since you can typically “see” your design working when you are done. Indeed, serial ports are one of the many designs I work through in my beginners tutorial.
The UART16550 interface has long since outlived its time.

The classic serial port interface goes back to the UART16550 chip built by National Semiconductor. It seems that much of the industry has standardized around its software interface. Its not hard to find software drivers that can communicate with this interface, so why not just reuse it?

Sadly, this chip appears to have been built back in the days of 8-bit buses. In order to set the baud rate of this chip, you need to set two different registers, and you’ll need to adjust a paging register in the meantime just to get access to those other registers.

Worse, the UART16550 only supports a 16-element FIFO. Why not increase the size of the FIFO? That should be easy, right? Well, yes, it is fairly easy to do—it’s just that you now need to adjust all of the software that depends the size of this FIFO.

From my own perspective, I only came across the UART16550 after building my own serial port core. Using my own serial port, I can completely configure baud rate, number of stop bits, number of bits per byte, the parity bit, and even whether or not flow control will be used by just writing one 32-bit value to a 32-bit bus-based interface.

Pretty cool, huh?

Sure, you could reuse the older core, but it’d be easier to configure, reconfigure, and use with a more modern interface. (Such as my own …)

Of course, it doesn’t help that the open source UART16550 core has a (formal-verification discovered) bug within it that might cause it to send arbitrary data across the channel ….

Fig 5. How much does a serial port require?

My first UART cores (TX, and RX)

As I said above, everybody digital designer should take the opportunity to build their own serial port. It’s a fun project. It’s also one of the first projects I ever did.

As with many projects, I started with all the material I could find on online about serial ports. I discovered all the things a serial port could or should support: 5, 6, 7, or 8-bit bytes, 1 or 2 stop bits, odd, even, mark, space, or no-parity, and baud rates from 300 Baud all the way up to 25MBaud or higher.

Did I implement all that? Yep. You guessed it. My first serial port was such an awesome design, it could do anything! It even supported sending or detecting BREAK conditions.

It just wouldn’t fit in a Spartan 6/LX4 next to my CPU. Neither did I ever use the parity, the BREAK conditions, the 5 or 6 bit bytes, changing the baud rate, etc.

This awesome design wasn’t very reuse friendly. It “cost” too much.
Building a UART-Lite core

As a result, I now support UART-Lite cores: a transmit and receiver pair that will only ever use 8-data bits, no parity, and one stop bits–8N1 as it is commonly called. These lite cores no longer handle BREAK conditions. They only support a fixed baud rate, predetermined at design build time.

So why rebuild the wheel when it comes to serial ports? Because 1) the UART16550 interface hasn’t aged well, and 2) my “ultimate” serial port cost me too much to use.

If you compare these problems to software, wouldn’t these also be problems that where one might learn lessons from software reuse? Not really. Unlike hardware, software bloat doesn’t nearly cost as much. Just a kB here, and a kB there, and no one will notice that a piece of software has a lot of unnecessary functionality in it. The fact that the Internet Explorer was declared to be an integral part of the Windows operating system should prove my point about software bloat.

What about Olof’s advice? In hind sight, he has a strong point. Several latent bugs existed in the core prior to formal verification. Despite the fact that the full service core had so much functionality, barely any of it was properly verified prior to that time. Further, the software driver had to be rewritten multiple times over. Still, the core components have been used over and over again in many projects with great success.

Reusing an SD-Card component

Fig 6. Reusing an SD-Card Controller

What about other components? For example, what about SD-cards? Why can’t we reuse SD card controllers from one design to another? Can reuse finally be achieved here?

Let’s start by looking over this SD Card controller found on github, but which traces its roots back to OpenCores. This is a nice, full featured driver–but also one that I’ve never used.

Why not?
- While this core implements the full SDIO interface, it also contains DMA’s and other items I don’t need in a resource-constrained FPGA design. If my design already has a DMA within it, why add another one (or two, or three) that I won’t be using regularly?
- I like to use the Wishbone pipelined bus standard, not the Wishbone classic standard. Unlike Wishbone classic standard, Wishbone pipeline is faster, and scales better up to higher clock rates.
- This full SDIO controller requires 2.8k Xilinx 6-LUTs, nearly 10% of a basic Artix-7/35T design. On an iCE40 with 4-LUTs, this would use 4k of the 8k LUTs available–a full 50% of the area, vice the simpler 601 LUT design discussed below.
  
  As I mentioned earlier, when discussing my serial port, people don’t necessarily notice software bloat whereas hardware bloat costs money.
How about building and reusing my own SD card controller then? Surely I might be able to use this across multiple designs?

This, again, is a good and bad reuse story.

Yes, I built an SD card controller based upon the SPI interface supported by most cards. You can see it pictured above in Fig. 6. I also built an SD card emulator to go with it, one that worked with a backing file in a Verilator based simulation context. The controller worked nicely in simulation. I was even able to read and write sectors when commanding it from my debugging bus in hardware.

What more could one want for reuse?

In this case, when I finally went to use the core as part of one of a design for contract, I discovered several bugs. First, I missed part of the spec, and then needed to retrofit the design in order to provide the SD-card with a startup clock–even before lowering CS# for its first command. My emulator didn’t (initially) require this. Next, the low-level clock generator wasn’t generating a clock output of a constant width. Worse, the core had a bug where it just couldn’t handle high speed bus transactions–such as transactions from or with the DMA–transactions it was specifically designed to handle.

What happened?

Under the hood, what happened was that it took two clocks to read a value: one to generate the address, and a second clock to read from that (new) address. High speed reads only provided one clock per read transaction, and so they read the first word of any sector twice. Anything slower wouldn’t notice the problem.

Because I had only tested this design with constant valued sectors, or perhaps because I’d never stared hard enough at the results of reading or writing sectors, I’d never noticed this bug. (I’m still scratching my head, to be honest, wondering how I could’ve missed this one …)

While the core has since been fixed, this story does a good job of illustrating some of the problems with reuse: Verification is the expensive part of any design process, and how shall you know that a design has been properly verified for your usage environment?

Even better, the core (now) uses only about 601 4-LUTs on an iCE40, so it’s much cheaper to use than the OpenCores-based 4k LUT core.

This illustrates another big problem with reuse: Just because a design “works” in one bus/interconnect environment, doesn’t mean it will work in practice in your environment. This leaves the individual reusing the core with the unenviable task of needing to debug his own design enough to convince the author of any subcore within it that a bug remains within the component, rather than within the context in which it was written.

Doesn’t software also have the same problem? I suppose you might argue that it does. The difference, however, is the difficulty associated with debugging “broken” hardware components. Debugging software is fairly easy. Debugging hardware, that’s much harder.

The good news is that by using a formal property file, you can verify that a core will function in all bus interconnect and usage environments–something you don’t get from either a bench test, nor an integrated simulation environment.

Reusing an I2C controller

Fig 7. Reusing an I2C Controller

Some time ago, I built an I2C master and separate slave controller. They were built to support an HDMI based pass-through design, and so one controller would read the EDID information from a downstream monitor, and then that information would be used to populate the EDID information used by the upstream HDMI source–in this case a Raspberry Pi.

Did the design work? Beautifully. No, it wasn’t automatic, but it was still quite general purpose. (It required a ZipCPU program to forward the information from the EDID master to the slave.)

Recently, however, someone gave me an I2C chip to work with that doesn’t follow the single byte address, multi-byte data protocol. Try as I might, I can’t seem to figure out any way to control this new device with my older I2C controller.

Why not reuse? Because even though the lower level protocol remained the same, the upper level protocol changed and the cores that I might’ve used combined the two protocol layers.

Reuse at the interconnect level

Connecting components like a serial port and/or an I2C controller together within a design tends to require some sort of glueware–an interconnect–that holds the components together while allowing them to talk to each other. Many modern designs are composed of some kind of system level bus, or even a hierarchical bus structure, that connects many components together. Components to be connected include bus masters–those that want to drive an interaction, bus bridges, and bus slaves–those that actually perform some resulting action.

Fig 8. Can the interconnect be reused?

This may be the one level at which I have seen the least reuse between designs crossing multiple vendors. There just aren’t that many well-known interconnect generators that will work cross platform.

What keeps interconnects from being reused?

First and foremost, an interconnect is often tailor-made on a design by design basis.

One design might have one master and twenty slaves. Another design might have two masters and four slaves. Slave address regions might change, bus data widths change from design to design, etc. Worse, it takes a lot of work to connect all these masters and slaves to an interconnect that will allow each of the masters to talk to each of the slaves.

It is for this reason that the solutions I have seen typically involve code generators–something that can “automatically” connect various components together. I’m going to call such a core generator in this context an interconnect generator.

AutoFPGA was designed from its inception to be such an interconnect generator.
Connectivity is further complicated by the number of bus standards. Perhaps you are familiar with some of these: AXI, AXI-Lite, Wishbone, Avalon, AHB, APB, or even TileLink?

A design integrator, that is someone composing their design from multiple cores they intend to reuse, just wants to be able to “plug” things together and immediately “play” with them.

Moving from one protocol to another requires bus bridges. Will the interconnect generator insert these automatically? If so, what will the performance cost be? In one example I worked with, I ended up inserting so many bus bridges that I could no longer maintain the I/O speed I had promised my customer.

The worst part of bridging between bus standards is that not all functionality can be bridged. For example, although I have an AXI to Wishbone bus bridge, consisting of both read and write component bridges, the AXI AxLOCK functionality doesn’t bridge from one side to the other very well. Neither do the AxCACHE, AxPROT, AxSIZE, etc, have clear analogs in Wishbone.

What about bus widths? DDR3 SDRAM works best with a bus that is multiples of 64-bits wide. The UART16550 core I mentioned above wants an 8-bit bus width. Modern CPU’s naturally want to interact at the width of their register size–that’s 32-bits for the ZipCPU. Reuse requires being able to bridge across these multiple bus sizes, something that doesn’t always work across all peripherals and bus standards. In particular, what happens to a peripheral that performs a side-effect on reads, as for example a serial port might, when you try to read it from a wider bus standard?

Of course, I haven’t mentioned the problem of getting the bus to cross clock domains as part of the interconnect–something which leads to a whole new can of worms. As an example, you are very likely going to need to be able to cross clock domains to support both video and memory at the same time–as you would with any sort of framebuffer implementation.

Neither have I mentioned how the interconnect should handle “optional” bus capabilities. For example, AXI offers USER interface wires: AWUSER, WUSER, ARUSER, BUSER, and RUSER. These tend to have context-defined meanings, which can change from one design to the next. How then should the interconnect handle connecting slaves together that have multiple, dissimilar, definitions of these USER wires? The Wishbone classic tag signals have similar problems.
Performance and verification factor in here as well

To illustrate this, let me point out that Xilinx offers several options when configuring their AXI interconnect. One option turns an N:M crossbar into an N:1:M crossbar: N masters get arbitrated to a single read and write channel (never both), which then gets fed to one of M slaves. Only one of those slaves will ever be addressed at any time, so they all share the same read and write addresses. (Yes, even the read and write addresses are shared across channels–defeating much of the purpose of having separate read and write channels in the first place–but that’s another story.)

Fig 9. Xilinx's Area Optimized N:1:M AXI Crossbar

This is all fine and good until you switch a design component from the N:1:M crossbar interconnect to the full N:M crossbar. Chances are, if you do that, that you’ll discover that your design no longer works. Both Xilinx’s demonstration IP cores and their AXI Ethernet-Lite core would break–if not other Xilinx cores as well.

Fig 10. Whose core do you blame when something goes wrong?

Tell me, what would you do? If you reconfigured your crossbar and suddenly your design stopped working, where would you look for the bug? Would you try to find a bug in Xilinx’s interconnect? That’s where I would look! Worse, I’d get all frustrated that their crossbar was closed source, and then likely blame them for the bug–even if it was in one of my own design components! This is what you’ll suffer from when your own core can’t handle backpressure properly.

Did I mention changing standards? Some of the earlier ARM based SOCs, Zynqs included, supported only AXI3–even though most FPGA designs today use AXI4.

It doesn’t help that vendor based interconnects can’t be simulated with 3rd party tools like Verilator, or verified with things like SymbiYosys, simply because the designs are proprietary.

As I mentioned above, this proprietary nature of most interconnect generators just hides the bugs within them, and obscures any bugs hidden elsewhere in the design.

This may be perhaps the biggest place where good open source based reuse might improve designs.

Reusing the ZipCPU

Let’s now turn our attention to a place where all the stars should align to make reuse easy: within IP cores generated by a single company, owned by a single entity, and all using the same bus standard.

In this ideal environment, reuse should be easy. Right?

The ZipCPU (now) supports many hardware architectures. It has been built on iCE40s, ECP5s, Intel MAX-10s, Xilinx Spartan 6s and Artix 7s.
The automated bus interconnect generator that I now use has also demonstrated its ability to compose designs across architectures.

So have I managed to achieve reuse nirvana then? Let’s take a look at several ZipCPU designs and see what might be learned from reusing the ZipCPU across multiple designs.

Designs require resources. The ZipCPU is no different. It requires both LUTs for logic and one (or more) on-chip RAMs for the register file. Not all FPGAs have the LUTs required to build a full-featured design.

No, the ZipCPU will not fit on a 1k iCE40 FPGA. Sorry.

Perhaps a better example of this is the ZipCPU design I built for the CMod S6. This design was based around a Spartan 6/LX4, and demonstrated an ability to run a multitasking “O/S”–or at least that’s what I’m going to call it. (It did do multitasking, but whether it was an actual “O/S” that fit into less than 16kB of RAM is really another question for another day.)

Just getting the ZipCPU to fit on the LX4 meant that I needed to adjust the CPU–removing “features” that cost too much. I removed the cache, pipelining, the LOCK instruction, and I used a really cheap load/store memory solution. I even scoured all my peripherals for lower logic alternatives and dumped my debugging bus.

In the end, the design fit with less than 10 LUTs to spare–depending upon the build. Sometimes it fit with no LUTs to spare.

This has since left me with a problem: ISE is a pain to use. I don’t often open it up. This means that the CPU, once ported to this platform, doesn’t often get the updates also ported to the platform. This particular design is then well out of date. Sure, I’ve reused the CPU. However, I’m not convinced that the current version of the ZipCPU will still fit on this device, and I’m fairly certain that the version that does fit has bugs that have since been “fixed” in the master branch.
My XuLA2 LX25 design is another interesting story. Years ago, I bought one of these boards from Xess.com. I really liked it too, although it’s now a bit dated. I even built a ZipCPU based SoC for this board.

Then someone wanted to reuse this design on their own XuLA2 LX board, only they purchased a XuLA2 LX9 board and not the LX25 board.

Did the design fit on their board? Nope!

Did this user understand why not? Not at all.

Size matters.
Then there’s the reuse challenge associated with clock speed. The ZipCPU was originally built on and for a Basys3 containing an Artix 7. On that device, it runs comfortably at 100MHz.

It doesn’t run at 100MHz on other devices. It only gets about 80MHz or so on a Spartan 6, 50MHz or so on an iCE40 HX, and 25MHz or so on an iCE40 LX. The only good news here is that one user reported running it on a Kintex 7 at over 140MHz if I recall correctly.

This, however, oversimplifies reality.

The ZipCPU on an Artix-7 with instruction and data caches runs can run at nearly one instruction per clock period, and so the 100MHz number is fairly accurate.

The ZipCPU performance on the Spartan 6 LX4 within the CMod S6? That didn’t run nearly as fast. Sure, it runs at an 80MHz clock speed. The problem is that it might take 20+ clocks to read a single instruction from flash memory. Since there’s barely any block RAM, and certainly no caches, the same CPU can run at best at nearly 4MIPS.

As a result, the CPU that might manage to meet a real-time requirement on an Artix-7 35T, might get nowhere close on a lesser architecture.
The ZipCPU requires RAM. Even if you turn the caches off, the ZipCPU still uses RAM for its register file.

On most Xilinx devices, this isn’t a problem.

Where this becomes a problem is on an iCE40, since the iCE40 requires that all RAM outputs need to be registered. Reuse? Sure, but I had to rearrange internal details of how the ZipCPU operated just to support this platform.
What about those multiplies? The ZipCPU handles multiplication by inference, as in OUT <= IN1 * IN2. What then happens if the device it is on has no DSP support?

This was first a problem in my Spartan 6 designs. I was unable to get decent timing on a Spartan 6 with this multiplication algorithm. So I created a parameterized multiplication approach that did polynomial multiplication based around 16x16 bit hard multipliers.

Even this didn’t work on an iCE40 with no built-in DSP support. In that case, I needed to build a shift and add based multiplier. Even this took several rounds of design until I had an implementation that not only worked, but also left enough room behind for non-CPU design components.
I should also mention my debugging bus here. One of the key components to any design I build is my debugging bus. This allows me to load the ZipCPU externally, to run a ZipCPU debugger, and to interact with peripherals even when the ZipCPU is halted.

This debugging bus didn’t fit on the Spartan 6/LX4 of the CMod S6, and so this is the one design where I wasn’t able to make (much) use of it–although I still used it as part of a special design to load software onto the flash.

My supersized debugging bus implementation, with FIFOs and compression, didn’t fit on the iCE40. The hexbus implementation that we built together on the blog did fit. I had never intended to use this implementation in any of my designs, yet now having built it I’m glad that I did.

With all that said, the debugging bus implementations are really reuse success stories, since they’ve been used across so many of my designs.
Clocks. Yes, clocks. It should come as no surprise that I’ve had to redesign the clocking logic from one architecture to the next. Every vendor provides their own PLLs, they all need to be configured differently, etc.

The good news is that AutoFPGA can handle this nicely.

No, AutoFPGA doesn’t do clock configuration. However, you can use AutoFPGA to propagate clock-based configuration constants across the design. This makes it easy, for example, to adjust the clock rate and have all of the UART or SDRAM constant design parameters adjust themselves automatically.

Isn’t this just as easy as setting a parameter in the top level Verilog file? Not at all! Don’t forget that the emulated serial port needs to be adjusted for the new baud rate, as does the host software. One of the nice things AutoFPGA handles is propagating information across multiple files of dissimilar languages.
AutoFPGA has worked wonders for building cross platform interconnects, and I’ve been very pleased with it.

My newer version (still in the dev branch at this time) has support for multiple bus masters and multiple bus types. (The original version required manually placing bus arbiters into a design to whittle the design down to one master, and only ever supported the Wishbone pipeline bus type.) Even better, all of the components are open source–so any design so composed should work nicely with Verilator.

My problem is that this new version of AutoFPGA is somewhat incompatible with the last one. Why? Because the interfaces have changed.

Creating a design with multiple bus masters meant that I needed to rename the bus, giving all of the bus slaves their own individual bus signal wires to work with. This alone is an incompatible change to all of my designs that assumed all bus wires would be prefixed with wb_: wb_cyc, wb_stb, wb_we, etc.

I’ve also struggled with designs where I haven’t set up the bus logic properly. In earlier versions of AutoFPGA, I’d pass wb_stb && (slave_sel) to the design’s Wishbone strobe pin, and then suffered any time I forgot to properly place the slave_sel in that logic. The new version allows you to just say @$(SLAVE.PORTLIST) and it then fills in all the necessary bus connections. This has forced all of my slaves to use the same bus portlist order and format–another incompatible change. (I could also use the ANSI ‘dot’ notation for connecting ports, with just @$(SLAVE.ANSIPORTLIST).)

My point is simply this: after making a simple interface change like this, I’m going to need to go back and re-build and then re-verify all of my designs. This isn’t all that unlike the configuration hell vendors have found themselves within. (But … my design worked with Vivado 2016.1, why doesn’t it work any more with Vivado 2018.2?)

Still, the fact that the ZipCPU has been successfully used across so many architectures is by itself a reuse success story. Nirvana? Perhaps not, but still quite valuable.

Reusing the design across CPUs

Okay, so I’ve now got a design framework I like using. Can it be reused?

Specifically, one customer wanted me to reuse my framework to build a platform containing a RISC-V CPU instead of the ZipCPU. Surely reuse would work here, right?

Let’s see: I owned all the submodule and component designs except for the PicoRV32 CPU I chose to use, so licensing wasn’t a problem. I used AutoFPGA to compose the component cores together, so there was no problem with building the interconnect. I could reuse my CPU loader to load the (flash) memory into a design, so that wouldn’t be a problem. The PicoRV32, like the ZipCPU, was highly configurable so it could be configured to start from the memory address provided by AutoFPGA, it could be configured for the number of interrupts AutoFPGA assigned to it, the design had support for 32x32 bit multiplies, … what could possibly go wrong?

Since it looked so easy, I made a big mistake: I mis-estimated the amount of time the project would take. Since it was all reuse, it should’ve all been easy. Again, what could’ve gone wrong?

Fig 11. Endianness: Which byte of a word is byte zero?

Endianness.

That’s right. The ZipCPU, is big-endian, and RISC-V machines are little endian. While AXI specifically uses a little-endian byte order, Wishbone can be either big or little endian.

This left me with the age old question of how do you fit a square peg in a round hole?

My first thought was that I should add to the PicoRV32 wrapper the logic necessary to swap all the words on the bus.

Then I got to thinking–all of my peripherals depended upon the data bits of the bus, from bits 31 down to bit 0, being in MSB down to LSB order. If I swapped the bytes in any of these control words, then writes to these peripherals would’ve all been provided in { data[7:0], data[15:8], data[23:16], data[31:24] } order.

That would never work.

In the end, I created a new concept: bus endian. Every peripheral that was word addressable would use bus endian order and stay the same. Every peripheral that was byte addressable, rather than word addressable, would get its byte order swapped. This now limited my pain to the flash and the network controllers. A simple parameter, set by the CPU’s AutoFPGA configuration, could then control whether or not the flash or network controller needed to have their endianness swapped.
There was a second problem as well, and that had to deal with the PicoRV32’s bus interface. It was custom, not Wishbone. No, it wasn’t all that hard to convert from the PicoRV32’s custom interface to a Wishbone interface, but the PicoRV32 took a big performance hit in the process.

What? Why?

Well, the PicoRV32 only ever issues one bus request at a time. That request waits on the PicoRV32’s I/O lines until it is fulfilled. Without knowing the next request location, every I/O request becomes independent. Every I/O request then requires a number of clocks defined by the bus latency plus the peripherals latency.

At best, the 250MHz (on a Xilinx device) capable PicoRV32 CPU then ran slower than 40 clocks per instruction. No, this design didn’t support a DDR based flash, and I didn’t manage to get the DDR3 SDRAM support working–hence it took 40 clocks to read each instruction from the flash. Further, because the rest of the design slowed down the system clock, the PicoRV32 could only ever run at 50MHz. Then, because the PicoRV32 didn’t support a pipelined bus by nature, the best speed it could ever achieve was limited to (roughly) 1.25MIPS.
Finally, I needed to adjust the software ecosystem

This is both a win and a fail for software reuse. It’s a win, since most of what I did was to copy the example boot loader that came with the PicoRV32, and the example newlib system calls from the ZipCPU. It’s a fail because I still had to edit the two, but in general by the time I got this far things “just worked”.

What’s the lesson here? Did reuse work? Well, yes and no. I did manage to reuse most of the design across both CPUs. I did manage to reuse the bus interconnect framework across both CPUs. No, I wasn’t able to reuse the ZipCPU’s debugger with the PicoRV32–but then again I wasn’t expecting to. That said, it wasn’t all that hard to issue halt or reset commands to the PicoRV32 from the debugging bus interface over the Wishbone bus like I would’ve done with the ZipCPU.

Conclusion

Hardware is not software.

Let me say that again, hardware is not software. What’s easy to do in software can be ten times harder in hardware where its that much harder to “see” your bugs.

What else might we conclude?

There’s a large portion of digital design that isn’t covered by any HDL standard, but that is rather vendor and even device dependent. This includes clocks, PLLs, I/O primitives, sometimes RAM structures, and definitely hardware multiplies.

To be reusable across platforms, you’ll need to take these differences into account.
If that’s not enough, the differences between the “standard” languages the tools accept can also be really annoying.

Don’t expect a design that hasn’t been used across vendor tools before to immediately work when switching tools.
Software bloat costs more memory than anything else, whereas hardware bloat costs actual dollars in terms of scarce hardware resources on an FPGA or area on an ASIC. As a result, hardware designs take more work in order to become reusable across a large variety of needs.
Bus standards are awesome–when they are truly standard. AXI USER or Wishbone tag signals aren’t really standard. Similarly, the bus bridges necessary to cross standards have a cost in both area and performance that can’t always be ignored during reuse.

Making sure bus standards are standard is one of those reasons why I maintain a series of formal bus protocol checkers in my Wishbone to AXI (pipelined) repository: AXI-Lite, Wishbone (pipeline), Wishbone (classic), Avalon, and even APB.
The worst reuse stories, not necessarily those captured above, are reserved for trying to reuse a core that was never formally verified in the first place. It’s in these cases that I most often find myself mis-estimating the time and energy required to get a design “working”, leaving me burning the midnight oil to get a design done by the deadline.

Can reuse happen? Yes, it can.

Do be prepared for all kinds of unexpected issues along the way.