Lessons learned while building an ASIC design

Ever since I started working with FPGAs, I’ve always wanted to do an ASIC design. If nothing else, I wanted to understand from experience what it was like.

Last year, I got my chance. I’m now on my second design, and the team I’m working with has just sent a piece of this second design to be taped out.

The process, however, was different in many ways I wasn’t expecting. Here, below, are some of those differences I wasn’t expecting.

Late from the beginning

The first thing that surprised me was the schedule pressure.

Fig 1. Memory Controller IP

The design as a whole is a memory controller, as shown in Fig. 1. It’s a memory controller which can be sold independently to customers for placement into a larger SoC design.

This first ASIC component of the design I needed to build was actually fairly simple. There’s not a lot of logic in it. I personally like to think of this portion of the design as a glorified serializer and deserializer. It takes 8x samples of a signal and serializes them to an output, and then does the same in reverse. Except, there are some subtle differences.

Fig 2. High speed design steps

More specifically, the design is the first half of a two stage design process to build a memory controller. The first stage involves building the high speed portions of the design, as shown in Fig. 2. This is the high risk stage. It involves building components that cannot be implemented in an FPGA. In some respects, this is a throw-away portion of the design, and so it is important to minimize its cost. The second part will handle the protocol and logic associated with the design. This is the more complicated part. Its gate count will be much higher. By verifying this portion of the design within an FPGA, cost and time to development can be optimized. Even better, if necessary, the protocol can be debugged and fixed–something not easily done on an ASIC. The first stage, therefore, is basically a speed translator. It communicates with slower logic on one side of the chip, and faster logic on the other. Since this is for a memory controller, the high speed lines are bidirectional, and they come accompanied by a clock whenever data is present. Shifting this clock by 90 degrees was a bit of a challenge, as was delaying the clock in order to sample in the middle of a received data bit, but the design was still pretty basic.

I estimated it would take me about 2-4 weeks to build the design.

Much to my surprise, I was late the moment the project started.

Not all of this was my fault. As I understand things, the engineer who was working this project before me had left mid-project. My job was to pick up where he left off. Unfortunately, that meant there was a bit of management confusion between when the last engineer left and when my contract was signed. It didn’t help that I wasn’t certain I’d even have the bandwidth for this project when I started it, and so I delayed bidding for the project by a couple months.

Once the contract was signed, however, the project was late.

The project was also late because after building my portion of the project (in 2-4 weeks), I discovered that wasn’t nearly enough. Sure, I formally verified (nearly) every portion of the design, but I couldn’t simulate the entire design end to end. The low-speed logic design simulation was handed to me in a non-functional state, and I had just changed the interface on top of that. In whole, this was a non-starter. How was I to know if my new interface was sufficient, if I couldn’t verify the whole?

So, let’s back up a bit to understand how this started. I was given a design consisting of a protocol portion and a physical layer portion, together with several simulation components–much like Fig. 3 below.

Fig 3. Design as received

I was told a lot of things about it. For example, this was what was left from a previous working design that had been delivered to customers. Since then, that design had been adjusted and modified by a previous engineer, but those modifications had yet to make it into a delivery. I was also told that less than half of the test scripts were passing. Further, the 8b internal interface was too fast for an FPGA to handle, and so I would need to slow it down by parallelizing the data path. So my initial task was simply to upgrade from an 8-bit data path in the digital front end to a 64-bit data path. That much of the task was quite straightforward, and that was the portion that was to go into this ASIC design.

Straightforward? Yes. However, it did send me crawling all over the original design, and I did have to make changes in more places than I was expecting–even in the low-speed logic that was destined to be scrapped as part of my upgrades.

For example, the AXI bus width of the design given to me was naturally 32-bits wide. This works nicely with an AXI4-lite controller. A 64-bit datapath, however, meant that either the bus would also need to be made 64-bits wide, or that the bus width would become a performance bottleneck.

The next problem was that the simulation I was given didn’t work. Yes, the simulation supported nearly a hundred separate tests, but I was told at the outset that most of them didn’t work.

On top of that, the free Verilog simulator I had access to on my desktop, Icarus Verilog, didn’t support the SystemVerilog dot notation when calling tasks. Getting the simulation, therefore, to even build in Icarus Verilog took some days.

To make matters worse, once I got to the point where I could try one or two of the “verified” tests, that is once I could test the first of the simulations that were supposed to “just work”–they didn’t work. Instead, the “verified” tests did the worst thing possible: they hung the simulation. That left me wondering if, or when, I should kill the simulation or if it was actually doing something useful while it wasn’t printing anything to the screen.

Needless to say, my 2-4 week task took much longer to accomplish than I was anticipating. It is now five months later, and the design is only now taping out.

So much for my time estimates.

Seriously, I’d be a much better engineer, and certainly a much more profitable one, if I could estimate the time to complete a project from the beginning much better than I can today.

Gates are cheap, verification is not

Nothing is static in this world, not even design requirements.

It wasn’t until after I had built my design that I got a chance to read the original contract between my client and his customer. That contract called for a built-in self testing (BIST) capability. Oops. I hadn’t built that in. No problem, I thought, I could add in a simple BIST capability with just a couple extra registers.

First, there was the control wire. The slave access port(s) need to be adjusted so they could turn on the BIST checker. That part was easy, and I know how to formally verify that a bus register can be properly controlled, so I was good there. The second part was to capture the internal state on a given signal. This was almost as simple as,

always @(posedge clk)
if (reset)
	capture <= 0;
else if (!triggered)
	capture <= internal_value;

Cool! I was done.

But did it work in simulation? No. I hadn’t built a simulation for it.

Building the simulation took another day, since I needed to check all of the various bits that could be captured above. (The capture signal was wider than a single bit.) It then took another day (or two) to get it all to work.

Fig 4. "Of course" is not a good verification practice

Was the task done? No. Now, every time I change the design, I have to go back and re-verify it against this simulation. Worse, because of the clock games taking place within this design, there were all kinds of timing errors generated by this logic. In the end, I split the capture signal into two, each captured on a separate clock. Even that wasn’t enough, because I only later thought through the fact that eight of the internal values were captured on their own special clocks–but that’s part of the longer story.

That happened to be only part of the BIST story.

For the other half, I suggested monitoring the output of the high speed device through the input channel. The data would just reflect back within the high speed portion of the design onto the input, as shown in Fig. 5 below.

Fig 5. Reflection checking

This, I reckoned, could be implemented with a simple no-logic change in the front end. That is, it was a no-logic change until I actually took the time to simulate it instead of telling everyone it would “just work”. Only after I built a simulation for this check did I realize I had turned reflections off to optimize power. Although I fixed up the design to get this to work, there was a second problem to it that I wasn’t expecting. Because the design would receive and return its own transmitted data as it was transmitting it, the I/O lines couldn’t be shared between transmit and receive. This nearly doubled the number of I/O pins on the ASIC to the point where the size of the I/O pin pads dominated the size of the ASIC and therefore it’s manufacturing cost.

Both adjustments required only a minimal amount of design adjustments to our high speed ASIC design. The design changes may have taken only 15 minutes each. Building the simulation necessary to prove these changes may have taken closer to a day for each of them. Running all the various simulations now at tape out still takes several days–assuming everything works.

Even these early simulations weren’t the end of the verification task. Once the design was laid out and the internal timing values from within the design were known, it then needed to be verified again, and then again as a second portion of the design was laid out, and again as a third portion, etc. The number of times my “working” logic has had to go through a simulator has been somewhat of a shock to me. More on that to come.

Design for Test (DFT) signals

One of the first things I did with the design I received was to strip out any and all unused logic. This meant both the logic I wasn’t using as well as any logic I couldn’t explain. As a result, I quickly removed the TEST_MODE input, as well as the several SCAN_* inputs associated with it, such as the SCAN_CLK.

That worked great until I sent my design to the layout engineer. He told me I had to put these values back into the design to support DFT scan chain insertion.

Our DFT implementation worked off of a basic scan chain. That meant that every flip-flop in the design needed to be connected to a massive shift register running through the whole design. This allows you to test the internal circuitry of the design after it has been manufactured, to verify that it was manufactured correctly.

Getting this support working also required some changes to the design itself.

The digital design module needed the DFT inputs listed as part of its portlist

That meant I had to go back and put this logic back into the design in spite of having removed it.

This felt a bit strange to do. The DFT signals weren’t connected to anything within the RTL, and they generated Verilator lint errors, but they were apparently still necessary.
Every incoming clock must be multiplexed with the SCAN_CLK

This was the biggest change. I needed to add a clock switch to every incoming clock within my design. If TEST_MODE was high, then the design required the use of the SCAN_CLK. If TEST_MODE was low, then the design clock would be used.

Unlike the clock switch we studied earlier on the blog, however, this switch was nothing more than a simple multiplexer selecting which of two clocks would be produced at the output.
Every internal clock needed to be multiplexed with the SCAN_CLK. This was in addition to the incoming clocks mentioned above.

Basically, if you are going to create a logic generated clock, anything that will subsequently become the edge of a flip-flop, then the DFT logic needs to be able to toggle that downstream logic with the SCAN_CLK. Every clock within the design therefore, and not just those one coming from the PLL, needed a multiplexer on it before it could be used.

This also means that there will only ever be a single clock throughout the design when in the DFT test mode. This will naturally limit the things the DFT test mode can actually test. In other words, any further testing and verification that might need to be done in silicon was my responsibility.
The rule for clocks applies to asynchronous resets as well

This one also surprised me. Because my design logic might toggle on an asynchronous reset, it also needed a multiplexer to bypass the reset synchronizer when the test mode was active.

In many ways, this DFT logic looked and acted very much like JTAG logic might–but without the JTAG state machine. There was a long shift register, or even series of shift registers–i.e. the scan chain, within the design that were controlled by this DFT circuitry. I just didn’t realize that I, as the digital designer prior to layout, had a role to play in the process.

In other words, the next time I’m given a design like this, I’m not going to immediately rip out the DFT logic as my first step.

Cost is not always measured in gate counts

Manufacturing cost for ASIC production is generally proportional to the area used by the design. More specifically, the cost is dominated by the cost to produce the masks necessary to manufacture a wafer. It’s also possible to place multiple dissimilar chips on a given wafer in order to help spread the cost of manufacturing a wafer across multiple users. That still leaves wafer area as the dominant measure of cost.

Prior to this design, I had always thought that meant that the logic area of the design, as measured in gate counts, would then be roughly proportional to the overall area and would therefore dominate the overall manufacturing cost.

Much to my surprise, I discovered that each I/O pad connecting the design to the outside world required a minimum amount of area. In my case, the design required so many I/O pads that the size of these pads proved to dominate the size of the design. The actual gate area was much smaller.

Even apart from the I/O pad sizes, there was a very large analog section to this high-speed chip. This included things like the PLL and several DLLs, in addition to the circuitry necessary to handle unwanted electrostatic discharge (ESD) or the circuitry required to “clean” (i.e. filter) the power for the analog logic. I had never thought of a design as needing these components before, and to my surprise the digital logic was very small in size in comparison to them.

X Propagation Matters

Ok, I’ll be honest here, I’ve never used ‘x propagation in any of my designs prior to these ASIC projects. My favorite simulator, Verilator, doesn’t support them as a design decision. SymbiYosys, the formal verification tool I use, always assigns a ‘1 or a ‘0 to every value in a broken trace, and checks all possible values for something that isn’t given a initial value–so I haven’t needed ‘x support.

Then a customer complained that my first ASIC design didn’t work in their simulation. I traced the complaint down to one of two problems.

The always @(*) block.

I’ve enjoyed using always @(*) blocks for any combinational logic. In particular, I’ve enjoyed using this block when dealing with generate blocks like the one below

reg	[N-1:0]	VAL;

generate if (OPT_DESIGN_OPTION)
begin

	always @(posedge clk)
	begin
		// Some complex block setting VAL
	end

end else begin

	always @(*)
		VAL = CONSTANT;

end endgenerate

I like this approach because I don’t need to create second wire to hold the value in VAL.

The problem with this approach is that nothing triggers the always @(*) block. It’s not a problem in hardware, VAL is given the CONSTANT as its definition. It’s only a problem in simulation. In simulation, VAL is not given an initial value and the always @(*) block is never triggered because nothing within it ever changes. As a result, VAL remains x (undefined) in the simulation.

The SystemVerilog specification fixes this issue somewhat in its definition of always_comb. However, other than localparams, I’ve tried to avoid SystemVerilog features so that I can maintain compatibility with the older parsers that are out there.

Fixing this forced me to adjust my personal design standards so that VAL would be defined as a wire (a.k.a. a net) in these constructs. This also meant that I would now need to define a separate register, let’s call it r_VAL, which the former logic sets. In the end, the wire is then assigned the resulting value either way.

wire	[N-1:0]	VAL;

generate if (OPT_DESIGN_OPTION)
begin
	reg	[N-1:0]	r_VAL;

	always @(posedge clk)
	begin
		// Some complex block setting r_VAL
	end

	assign	VAL = r_VAL;

end else begin

	assign	VAL = CONSTANT;

end endgenerate

Personally I find this cumbersome. However, it’s now going to become part of my personal coding standard lest I come across this bug again. Indeed, there’s now a version of the ZipCPU following this new coding guideline as well.

Recursive definitions

This one burns me up. Imagine you have a clock divider, such as

	always @(posedge clk)
		div_clk <= !div_clk;

In this example, div_clk isn’t given an initial value because, well, initial values aren’t allowed in ASICs. As long as the hardware can fix the value at either 0 or 1, this clock divider will do the right thing. Even better, I can use formal tools to verify that this simple circuit will do the right thing either way it’s set. The problem is not that the hardware won’t work, the problem is that the simulator won’t work with something like this. div_clk will be given an initial assignment of ‘x, and anything that depends upon it will then get an ‘x value.

The result of all of this is that I’ve found myself forcing signals to be reset that don’t really need to be reset at all.

The asynchronous reset applies to everything

Up until now, I’ve used initial statements in my designs quite heavily. They work great in FPGA designs. They just don’t work at all in ASIC designs.

Worse, because of ‘x propagation issues, any bit that isn’t set by the asynchronous reset gets flagged as an ‘x and shows up like a sore thumb on any simulation trace.

I’ve also avoided asynchronous resets in the past, based upon a comment in some Xilinx documentation suggesting that RF interference might trigger an accidental asynchronous reset. (I’ve since been asked by a Xilinx designer to find the document, and … I can’t remember where I found it initially. They claim asynchronous resets should work just fine.)

Not so with this design. In this case, every flop was initialized with an asynchronous reset. In some cases, the asynchronous reset would be active long before the clock ever was.

This also affected my formal proofs. My first attempt at proofs without initial statements involved just not evaluating any assertions on the first clock cycle of the proof. Now I’m starting to get into the habit of gating all of my formal assertions with a reset check, to make sure that the logic works once the reset is accomplished.

This also means that my AXI bus property sets now have options for asynchronous reset checking. If this option is turned on, the AXI property sets will now insist all VALID flags go to zero on the same clock as the reset, in addition to the clock following.

Perhaps I shouldn’t complain. As I mentioned above, logic is cheap. Once I knew what was going on, these fixes only took minutes to make. It’s not the logic that costs so much, it’s the verification part that’s much harder.

ASIC Clocks are … different

In FPGAs, there are rules for clocks. One of those rules is that logic generated clocks are bad. In general, FPGA tools can’t handle the timing analysis of any logic generated clocks, the logic generated clock doesn’t have the same timing relationship as the clock it came from, and it isn’t placed automatically on the clock routing network. These are all reasons why logic generated clocks are generally bad.

These rules don’t necessarily apply to ASIC designs.

ASICs use logic generated clocks

ASIC designs are different. Indeed, once you dig into the weeds of an ASIC, you might start to believe that all clocks are logic generated. You would be right to some extent, because even the PLLs have some amount of logic within them.

Unlike FPGAs, ASICs don’t come with a set of dedicated clock routing networks. Instead, the clock trees used within ASIC designs have to be engineered and inserted into the design for each clock that uses them.
ASIC designs are known for gating their clocks

This technique is primarily used for reducing power within a design.

I grew up in Minnesota, not far from Lock and Dam number one on the Mississippi river. That lock and Dam has since become imagery for me to understand power usage within an electronic circuit. Imagine, if you will, that every wire within an electronic design is a lock on a river that can either hold water (i.e. charge) or not. Energy is used every time the lock is filled, and it is measured by the amount of water necessary to fill the lock.

Fig 6. A Lock and Dam analogy to electrical power usage

The higher the water level is, that is the higher the core voltage is within a design, the more water that will be necessary to fill the lock. Similarly, the longer the lock is whose water level (a.k.a voltage) needs to be adjusted, the more water it will take to adjust it.

Clock trees are equivalent to very long lock chambers throughout the design that all need to be filled. It takes a lot of current to switch the tree from one voltage level to another, and the more times the clock toggles the more power is used by the clock tree.

This leads to three ways of reducing power in a design. You can lower the core voltage. This is equivalent to lowering the height of the water in the lock. You can also lower the frequency. This is equivalent to raising and lowering the lock fewer times, and so you use less water over time. You could also limit the number of flip-flops that toggle based upon the clock, although that doesn’t fit our analogy nearly as will. Finally, it doesn’t make sense to adjust the water level in a lock that nothing depends upon.

This is how gating a clock can reduce the power usage of a design. Because clock trees tend to have a large amount of circuitry dependent upon them, and because they use a lot of area within the chip, toggling the clock costs a lot of energy.

Although you can do this within an FPGA, the technique itself isn’t commonly used. FPGAs offer alternatives to clock gating instead, things like clock-enables in flip-flops.

One of the things I learned in this design, which I really should have already known, is that gating a clock with a simple AND gate is insufficient when working with digital logic. While it might be appropriate when implementing a DFT scan chain, it’s highly inappropriate in general. A clock gate requires a proper gating circuit lest the clock and everything dependent upon it become ‘x in simulation, or fail setup and hold timing requirements in actual hardware leading to a mismatch between simulation and implementation.

This fact came into play due to the fact that we were running simulations at multiple clock rates. The simulations would start out with a slower clock, and then gradually increase the clock frequency to the maximum frequency required by the device. Further, while the clock frequency was changing in the PLL, the PLL would gate the downstream clocks with an AND gate. Sure enough, in one particular run, this AND gate clipped the clock at something less than a full pulse width. One flip flop dependent upon this clock switch, a flip flop used to generate a logic clock for downstream processing, then turned into an ‘x. The entire simulation failed from that point forward.

Clocks can be switched

One of the tasks required of this design was to subtly adjust the timing of particular signals within a clock’s width. Indeed, we were shooting for an 80ps time delay adjustment capability similar to Xilinx’s ODELAY or IDELAY hardware blocks. We achieved some of what we were looking for by moving data signals from one phase of the primary clock to another.

A giant clock multiplexer was used for selecting from among the many clock phases necessary for this operation.

All of these operations were fairly easy to design and implement in Verilog. For several months the design with these wonderful blocks in them was awesome.

Then the design was implemented, placed and routed. All of a sudden the consequences of these various clock choices started to make themselves clear. The balanced clock multiplexer took nearly a fifth of a clock cycle to select the right clock. The output DDR element had one path delay on one leg and another path delay on another. Indeed, uncontrolled layout timing delays on the order of 256ps made it very difficult to finally control output delays with a better than 80ps accuracy. These realities forced some level of last minute redesign that I wasn’t expecting.

Specify Blocks

I remember being in a design meeting with the engineers who built SymbiYosys at one time discussing specify blocks. One of their customers had requested that SymbiYosys support them, and the discussion centered on whether or not SymbiYosys should support them, and how they should be supported if at all. At the time, I had no idea what a specify block was. As a result, it was no skin off of any project I was working with if we didn’t support them.

Then I started working on this project.

The model I was given for the device we were to interact with had multiple specify blocks within it. These specified such things as the setup and hold time prior to a clock tick, the time a value needed to hold constant following a clock edge, or even the minimum skew between particular items. Much to my pleasant surprise, these timing specifications read like a formalization of many data sheets I’d read before. Indeed, all of the specification requirements within the design could be read and translated directly from the specification we were working with.

This was awesome! I’d never used this capability before. I liked it!

Then I started running into trouble.

My first problem was that Icarus Verilog didn’t support them. I didn’t realize this at first. Then I switched from Icarus Verilog to a commercial simulator. I first tried ncverilog. When ncverilog didn’t support localparam statements I then switched again to XCellium. Then, XCellium started generating errors when my design wasn’t holding to the various timing requirements found within the protocol.

After wrestling with the Verilog simulation for some time, I now wish in hindsight that I had been more supportive of specify block support within SymbiYosys. Now that I’ve read through the SystemVerilog specification discussing specify blocks, I’m also convinced that such support wouldn’t be that hard to build. The hardest part would be the parsing, but in general that’s already a solved problem.

It’s not over when the digital design is done

One of the things I’ve already alluded to above is that the project was far from over once the digital design was complete. While I wasn’t a part of many of the steps that followed, I was part of enough of them.

I’ve already mentioned DFT scan chain insertion
The design needed to pass a lint check. The default linter wanted to complain about a bunch of highly irrelevant “problems”. I convinced the team we could use Verilator for linting instead. As a result, I was able to produce a “clean” design with no lint errors.
I was asked to run an automated coverage analysis check on the design. Basically, I re-ran all of my simulations and recorded which lines were getting executed, which bits were toggled and so on. Signals that didn’t toggle, or logic that didn’t get exercised were both flagged for discussion and possible adjustment to the simulations.
Another team member ran a clock domain crossing analysis made on my design. This analysis looked at every clock domain crossing, and caused us to look real hard at them–did they need proper timing constraints or false path insertion?
Power and ground pins needed to be assigned to the design, and the I/O pins needed to be apportioned to different locations on the chip interspersed with a sufficient number of power and ground pins to support them.
I then needed to build an “I/O ring” for the design. This was new for me, as I’d never done one of these before. Basically, I needed to build a Verilog design that connected all of the external pins, whether inputs or outputs, to the rest of my Verilog design. In a Xilinx world, this would be like connecting all of your I/O pins to an IBUF, OBUF, or an IOBUF rather than relying on the synthesis tool to do this for you. It was a touch different, however, in that I also needed to place multiple power and ground pins for the design.

During this process, the analog engineer I was working with laughed at me for doing this. Why, he asked? Why do you need to model this in Verilog at all? As it is, the design needs to be turned into gates and components laid out in a three dimensional grid. Verilog is only an intermediate step. Once you have the three dimensional layout, why do you need the Verilog describing it anymore?

At the same time, even though this I/O ring was quite simple to build, I still messed it up. A coworker noticed late in the design process that one signal had bypassed the I/O pads to go straight into the design.

Sadly, we didn’t have any good verification tools to support this portion of the design. Sure, I verified that the design worked as desired with the I/O pads in place, but what would happen if I connected a particular I/O to the wrong pin? It would still work in simulation. What if I used a signal that bypassed the I/O ring? That would also still work in simulation.
Since the design consisted of both digital and analog components, I was given a Verilog model of the analog components to simulate with. I initially ignored this model, something that turned out to be quite a mistake. Why? Because I already “knew” my design worked against the analog model I’d been working with, so why did I need this updated model? As long as the analog designer had built his design according to the specification we had agreed upon, what difference would it make if I used my model or his?

Unlike mine, however, his model was based upon the hardware “as-built”, not on my ideas of how it was going to be built.

Unfortunately again, once I finally replaced my own model with this “as-built” model the design no longer passed simulation. In one case, the problem had to do with the fidelity of the analog model. The analog designer hadn’t truly modeled one of the circuits. In another case, however, the analog designer had misunderstood my specification and built the wrong component. Had I not done simulations with the analog model, I would not have found this mistake.

I found this mistake late in our design cycle–right as we were finalizing the design for tape out. Although we managed to fix the problem and update the design, a lot of work needed to be re-done due to where it fell in the timeline.
I was also given a model of the digital design with post place and route timing annotations within it, and asked to re-run the simulations again.

Subtle bugs in the device simulation model made this task take much longer than I expected. Indeed, this process took nearly two weeks to debug both the simulation and any design problems. This, of course, is two weeks in what I thought was a 2-4 week problem in the first place.
Other things were taking place as well that I was only peripherally aware of. There were ESD simulations being run, DRC checks applied to the design, packaging options were examined and chosen, as well as solder balls designed for the pins.

All put together, this 2-4 week RTL design took many more months to accomplish. Worse, much to my surprise and dismay, RTL issues were still being discovered late in the design process. These were things I was responsible for. They’re also things I’ll be considering as lessons to learn for the future, so that I might learn from them to keep them from happening again later.

Is Simulation Verilog Software or is it Hardware?

I have a deep software background. Indeed, I’ve been building software since grade school.

My first experience with digital design was in college, back in 1992-1993. Other than two courses, my college work was focused on either my Computer Science or Mathematics degrees.

Since that time, most of my work from 1993-2009 was in digital signal processing (DSP). From my standpoint, DSP was nothing but software applied to mathematical constructs. It was often accomplished in embedded platforms, although not always. Sure, I have a Ph.D. in EE, but the focus was more on how to process radio frequency signals than circuit design or simulation.

Fig 7. PI should never be a magic number!

My point is simply this: there are rules to good software design. One rule, for example, is that all constants should be declared in a common area separate from your algorithm’s implementation. Failing to follow this rule often leads to what are known as “magic numbers”. Another rule is that you should never write the same algorithm multiple times. You should instead create functions and function calls to implement such algorithms once, and then to reference those implementations. This will keep you from copying a broken algorithm and then needing to find and fix all the places where it is broken.

Fig 8. The problem with magic numbers

One of the problems I struggled with throughout the project was that the device simulation model violated the rule of three extensively. There were three implementations of every operation that the model understood–one for each of the three protocol sections. Each implementation had its own means of reading from the interface. For example, there was one implementation for reading from the normal memory, and another implementation for reading from a special area of the memory. These two (there were more) implementations didn’t reuse any logic between them in spite of the fact that the interface protocol was the same between both. As a result, when I went to debug the read-ID feature, whereby the simulated device could be queried for its ID, I was forced to fix (again) the nearly identical read logic for reading from the memory only this time with a different read result. This left me debugging the design again and again and again for what were often the same bugs. It didn’t help that the simulation took hours to either complete or halt on a bug, nor that I would often run the design first without generating a trace just to know if or when the trace would need to be generated.

That was one problem where good software engineering practices would help. The simulation model really needed to be rewritten from scratch to fix these problems throughout. The problem only compounded due to the fact that the project as a whole was late from the start. We therefore committed to patching, and re-patching, and re-patching the simulation model again and again only to promise ourselves that we would rebuild it once the fast portion of the design tape out was complete.

The second place this design looked like software was on the CPU side of the design.

Fig 9. Is test bench Verilog "Software"?

As shown in Fig. 9, the entire design I was working with had two interfaces. On one end, it interfaced with the memory device we were working with. On the other end of the design it interacted with an AXI3 bus that would likely be controlled by a CPU. A large section of the test bench consisted of the definitions of 114 software functions that would call tasks within an AXI3 Verification IP model in order to communicate with the design. The test bench script itself consisted primarily of a series of Verilog references to these tasks, what I would call function calls in software, to interact with the design. This portion of the test bench read and operated just like software.

So, why wasn’t it software?

If the goal was to interact with the design as a CPU might, then why not use a CPU to control the interaction? Even better, if you do so, you can then deliver to your eventual customer an example of a software device driver that is known to work with your design.

I didn’t place a CPU into our test bench for several reasons, most of the dominated by time. Remember, the goal was to do this quickly and I was late the moment the design process began. On the other hand, if I were to place a ZipCPU in place of the external test bench, then it should be possible to do exactly that: run software instead of Verilog to exercise the design by issuing appropriate bus commands. Even better, the ZipCPU might issue commands more aggressively than the AXI VIP I was using might.

This is now my goal for the second phase of this project. In order to make this happen, the ZipCPU has now been re-made to be bus-independent. The new version now has an AXI-lite prefetch, and an AXI-lite memory module. There’s even an AXI4 (not lite) instruction cache. These interfaces should all handle bus widths of 32-bits or greater–they’re nicely configurable in that fashion. Moreover, the debugging register interface is being redesigned, and the ZipCPU wrapper is getting formally verified for the first time.

What about the bus? I normally work with Wishbone, although I’ve done a lot of recent work with AXI4. Better yet, I now have a converter from AXI3 to AXI4. This particular design will require an AXI bus that’s at least 64-bits wide to avoid slowing the interface down. That means my debugging bus will need to be converted from 32-bits wide to 64-bits. That converter is now complete. (It’s not the high-speed AXI4 converter I wanted to build and even started working on, but rather a basic AXI4-lite bus width converter.) Indeed, the debugging bus itself now has an AXI4 back end as well. (I used that for my last project.) I actually have two such back ends, one that supports burst interactions and another that supports AXI4-lite alone, but that’s a bit off topic.

On top of all of that, AutoFPGA was more than happy to build and connect an AXI4 bus model for me. I can easily connect one (or more) AXI4 DMA’s to this bus for simulation purposes–which will likely come in very handy soon enough. I can also cheaply connect an arbitrarily sized AXI4 based memory for the CPU to run off of as well. That makes that portion of the design easy.

There’s still work to be done, however. In particular, I’m missing two critical components. The first is an AXI4 downsizer that will take a request from AXI4 and convert it to an AXI4-lite request at a smaller bus width. This component has now been drafted, although it’s not yet passing a formal check. (i.e. there are known, serious, and significant bugs still within it–hence why I haven’t yet posted my draft of the logic.) The second big item to handle is the big endian versus little endian issue. As you may recall, the AXI4 bus is by nature little endian, whereas the ZipCPU is naturally a big-endian machine. I haven’t (yet) decided how exactly I’m going to handle the difference. The new AXI4 bus interfaces do have a byte-swapping hack that might be sufficient. Alternatively, I might just create a little endian version of the tools: GCC and binutils. Time will tell what solution I eventually come up with.

Conclusions

When moving from FPGA to ASIC design, a lot of things changed. Sure, a lot of things stayed the same: I was still designing with Verilog, I was still using formal methods and simulations, it’s just that, well, it was quite different. This was one of the things I was hoping to experience.

One of the reasons why I was so interested in learning ASIC design was to see what kind of impact formal methods might have on the ASIC design process. Those who know me know that I am a strong proponent of formal methods. I have been ever since I started finding bugs in my designs that weren’t getting found in simulation. So when I started this project, I wanted to know if formal methods would help or not, or to what extent they might help.

Now that I’ve gotten this far into the project, I can safely say that none of my formally verified logic contributed to any of the faults discovered late in the design process. Well, that’s not quite true, the change log indicates two late changes in pieces that had been formally verified. In one case, I had a timeout counter to check for the presence of a clock and then never wrote any properties to make sure that counter worked. The rest of that module was formally verified, even though that register wasn’t. In another example, I had built a calibration logic controller to the wrong specification. Yes, it was verified, but it was verified that it would do the wrong thing. As for the other faults, let’s see … one was caused by an “obvious” 34-line design that was made in haste to alleviate a timing problem with my original implementation. This had a consequence that wasn’t quite thought through. In particular, it required two clock edges from a discontinuous clock …. Another fault was caused by incorrectly setting a reset value–that fault caused a startup glitch, but was otherwise innocuous.

Another place the formal tools really helped me was within the slow logic side of the design. There, I used formal tools extensively as I first studied and then rewrote several critical design components. Indeed, I found it very valuable to know, for example, that I would only ever request the number of bytes to be transferred that were appropriate, or that various subtle timing delays were implemented as desired.

I can also safely say that I vastly underestimated the cost for this work. This has left me considering Psalm 15, and the man who “sweareth to his own hurt and changeth not.” Having agreed to one price for the project, I have not adjusted it–even though the project has taken far longer than I was expecting. Perhaps I’ll do a better job estimating the number of hours the next time as a result of completing this project today. In the mean time, let’s just say that I’ve covered the cost of this cost overrun with an internal research and development (IRAD) fund.