SDIO RX: Bugs found w/ Formal methods

This post is the second post regarding my new SDIO/eMMC controller. The SDIO protocol is commonly used on SD cards, and the eMMC protocol for eMMC chips. The two protocols are so similar that, when using this controller, they will differ in software only. Today’s bottom line is that, although the controller is still quite new and only barely silicon proven, this week I had the chance to formally verify the receive portion of the controller and so I thought I might write about what took place. My goal will be to answer the question of whether this extra step of doing formal verification was worth it or not.

This SDIO controller is being written as part of the Klusterlab project. I’ve been calling this same project the ETH10G project, because at its core it is a 10Gb Ethernet switch. The project team, however, has named it KlusterLab because of all of the various hardware and IO interfaces that have been integrated into it. This has given me plenty of opportunities for testing hardware components, and for writing blog articles about them. As a result, I’ve now written about the development of this project a couple times. First, I wrote about virtual packet FIFOs, then about using Verilog tasks to script the SDIO transmitter, and now today I want to discuss formally verifying the SDIO receiver that will soon be tested on this board. I’ll even go so far as to discuss the remaining bugs that were found during hardware verification.

Fig 1. SDIO progress

You might argue this SDIO receiver had already been verified. Indeed, you might argue that the entire controller had been verified. Let’s walk through the various development steps. At this point in its development, the entire SDIO controller has been written. It passes a Verilator lint check. I’ve also generated simulation test scripts to verify the divided clock generator and the transmitter in isolation. A final simulation environment drives the entire controller through its paces: starting up the SDIO controller in a simulated environment all the way from sending the CMD0 (GO IDLE) all the way through reading and writing a page of data (CMD24 and CMD17). It passes these simulation steps nicely. I’ve integrated the controller into the larger design, and it passes both Vivado synthesis and timing. Indeed, it’s been implemented in hardware. Most recently, it failed in hardware testing for PCB reasons, not logic reasons–but that still counts as a failure in hardware, so I’ve got more work to do before I can call this silicon proven. At the end of this article, I’ll share the results of my next round silicon testing–once I finished verifying the receiver.

The question before us today is whether or not I’ve skipped any necessary tests.

Fig 2. The rule of gold

One might argue at this point that this is all the controller needs to do in practice, and so I should stop here. In the past, I might’ve done so. However, the controller can do a lot more than I’ve tested so far. It’s designed to operate with either 1 data bit, 4 data bits (SDIO/eMMC), or 8 data bits (eMMC only), in either SDR and DDR modes, and both with (eMMC only) and without (SDIO/eMMC) a data strobe. The clock divider is designed to allow the IOs to be driven at less than one edge per clock cycle, two edges per clock cycle, or even four edges per clock cycle. That’s a lot of features, and due to the way the current board is designed, I won’t be able to test all of them. (Specifically, the PCB design connected the clock line to the CCLK pin, and it doesn’t allow a card voltage change from 3.3V to 1.8V, so I can’t use either ODDR or OSERDES controllers to drive it any faster than than half the clock rate.)

Here’s my problem with stopping here: I want to place this SDIO/eMMC controller in my “library” of working components, and I’ve had too many experiences in the past of pulling something out of my library only to end up debugging it when I place it onto hardware. In practice, it’s worse than that–because by the time I pull it out of my library, I typically won’t remember that I only tested some of the modes the IP supports, or whether or not it’s been updated since I last used it on hardware. I just remember that it has “worked” in the past, so I consider it a piece of “working” IP from my library. What that means is that, when things don’t work in hardware, I won’t be suspecting this piece of IP. Hence, I’ll find myself looking all over some large SOC design for a bug, instrumenting everything and its brother, before I finally realize that a “working” IP component from my library had been left with a bug in it.

This is unsatisfactory.

Debugging a large design is a painful process. It takes a lot of time–often time that’s been allocated for other purposes–you know, like the new capabilities the design is supposed to have–capabilities the sponsor is paying for. It delays product delivery with unscheduled debugging. Frankly, I don’t like spending my time on unplanned projects like that. As a result, I want a solid assurance that every IP component in my “library” works before I add it to a larger design. It’s not enough that it worked in silicon the last time it was used. I want to know if any updates made since that time still work. I want to know to all the features work, to include features that haven’t yet been tested in silicon.

This requires a more rigorous approach to IP verification than just demonstrating the IP once in silicon.

For me, that more vigorous approach involves formally verifying each leaf component, and then simulating the library component as a whole. When it comes to this SDIO/eMMC controller, I have formal proofs of most of the major components. I have a proof of the clock generator, the Wishbone controller, the command wire handler and the transmitter. What I didn’t have, which we’ll be discussing today, is a formal proof of the receiver.

Yes, I now have a formal proof of the SDIO/eMMC receiver.

So, before we get started, let me ask: how many bugs do you think I found going through this process?

Fitting the Receiver into the Design

Let me take a moment, though, to introduce you to this subcomponent and discuss how it is supposed to work, prior to discussing the problems it had.

Fig 3. The SDIO/eMMC receive component

The SDIO/eMMC receive framer, as I call it, is responsible for receiving a block of data, checking the CRC(s), and writing that block of data to an external ping-pong buffer. A separate Wishbone component acts as its controller in two ways. First, it tells the receiver what IO mode is in operation. The i_cfg_width setting tells us if we are using 1, 4 (SDIO/eMMC), or 8 (eMMC only) IO pins. The i_cfg_ddr setting controls whether or not we’ll need to check separate CRCs for each clock edge. Similarly, if we are using the data strobe pin, as indicated by i_cfg_ds, then we’ll be accepting data via the asynchronous data port from the PHY, rather than the simpler synchronous port.

When the user issues a CMD17 to read a block, the Wishbone controller will raise the i_rx_en line to indicate a block of data is on the way. It will also set i_length to the length of the block to be expected. The i_crc_en pin also allows us to receive things that may, or may not–counter to protocol–have CRCs attached. Since the protocol requires CRCs, I may remove this (unused) configuration bit in the future.

The front end provides two sets of inputs for us, of which we will pick and choose only one. The first set is the synchronous path. This is the path used in all SDIO modes and most of the eMMC modes–the path that doesn’t depend upon the data strobe return from the eMMC device. The second path is the asynchronous path from the front end.

Fig 4. The PHY supports three operating modes

Both of these paths come to us from a front end component that I’m going to call the PHY. The PHY can be built in one of three ways, as shown in Fig. 4. First, it can be built in “standard” mode, where the IO buffers are driven directly from logic. It can also be built where the IO buffers are all driven via ODDR components, and the returns come back via IDDR sampling. Finally, there’s a front end mode which will drive the IOs via 8:1 OSERDES elements, and read the results back via a 1:8 ISERDES. Which IO mode is used controls the maximum clock speed. Likewise, only the SERDES IO mode supports the data strobe.

All data messages in the SDIO/eMMC protocol start with a zero start bit. This is used as a synchronization point. In our case one of the key features of the front end is that it strips off the start bit. It also samples our data for us–either by sampling the outgoing clock edges to discover a sample point, or by sampling data when the return data strobe is present (eMMC only).

A second key feature is specific to the synchronous path. In this case, the PHY measures the outgoing clock signal (before it gets to the pins), and sets a sample time some programmable delayed time afterwards. In this way, for high speed IO, we allow ourselves to sample the incoming data at a programmable fraction of a clock cycle later than the outgoing clock itself, to allow for any clock propagation time from our controller, through the PCB to the SDIO/eMMC chip, and then coming back from the SDIO/eMMC chip through the PCB to our FPGA.

All this is to say that by the time we get any data, all the hard work of discovering when to sample the various IO bits has been taken care of for us.

On the synchronous interface, the i_rx_strb signal will indicate whether we have new data available. It will either indicate no sampling clock edges (i_rx_strb==0), one edge (i_rx_strb==2'b10), or two edges (i_rx_strb==2'b11) of data. If i_rx_strb[1] is true, then either 1, 4, or 8 bits of data will be available on the i_rx_data[15:8] ports. If i_rx_strb[0] is also true, then 1, 4, or 8 bits of data will also be present in the i_rx_data[7:0] inputs.

The asynchronous interface is even simpler to use. If S_ASYNC_VALID is ever true, then we’ll have 32-bits of incoming data available to us. There will never be less. This is due in part due to how the front end IOs are set up, and also in part due to the nature of how the data strobe line is used. Specifically, it is never used in one or four bit modes–always 8 bit. Likewise, the data strobe is only ever used in DDR mode, when data is transmitted on both clock edges.

That describes what comes into this receive component.

There are two interfaces on the output. The first is the control interface. For every request that is made, that is for every time i_rx_en is enabled, the controller will process a received packet. Once the packet is complete, this receiver will raise the o_done flag. At that time, it will also raise the o_err flag if there were any errors associated with the packet. Such errors could either be 1) a CRC mismatch, or 2) a watchdog timeout error.

Let me pause here for a moment to point out, whenever you use a return data strobe for sampling data coming back to a chip, you always need to add a watchdog timer. This is to keep your controller from hanging in the event you make a mistake and either 1) don’t properly wire up the data strobe, or 2) make a mistake in your protocol handling so that the downstream chip doesn’t return the number of data strobes you are expecting. In our case, the watchdog timer will also generate a timeout if the start bit isn’t received within its timeout window–something that will come back to haunt us when we get to hardware testing.

The last interface, coming out of this controller, is the memory interface. This is designed to feed one of two ping-pong buffers. My vision is that these ping-pong buffers will be as wide as the bus, so they can be used in high speed DMA operations on wider buses if necessary–although to date I’ve only tried them at 32-bits each. (Yes, the ETH10G project uses a 512-bus, but I’m initially only going to connect this to the 32-bit control bus portion of that design.)

The memory interface has valid, strobe, and data lines. If the valid line is high, then the strobe lines will tell you which bytes within the data lines to write.

If all goes well, once i_rx_en is set, memory will flow from the PHY, get collected into bytes and/or words, and then sent out the memory interface. Once complete, the o_done signal will be raised and the controller will then drop the i_rx_en line, and only raise it following another command from the user–or perhaps the to-be written DMA.

At least, that’s how this portion of the design is supposed to work.

Outlining the Formal Proof

One of the reasons why this component took so long to verify was because I had a sort of writer’s block when I first looked at it. I didn’t really know where to start. The design, I said to myself, was so simple–what could possibly be done to verify it?

Yeah.

As a result, this component sat on the shelf for a week or two while I worked on other things.

Fig 5. When do I get to the real stuff?

The minimum requirement of any formal proof is something I call “The Contract.” The contract describes how the IP is supposed to operate if everything is working. In this case, the contract is fairly easy to express in words: given an arbitrary byte, arriving at an arbitrary position in the received data stream, formally prove that this arbitrary byte gets processed properly and sent to the output.

As with most things in life, however, you need some sort of structure to hang all of this verification logic off of. You can think of it like a skeleton. Just like a skeleton holds all your joints, ligaments, and muscles in place, a good formal verification structure can be used to hold all of the formal verification logic in place.

I chose two pieces for my skeleton. The first was a bit counter. Starting from the beginning of the operation, I would count the number of bits arriving on our interface. If we were in one bit mode, that would be the number of ones arriving on i_rx_strb. For the four or eight bit interface, it would be four or eight times that much. It simply counts how many bits of valid data we’ve received. The second key component was a memory counter. This memory counter would count the number of bytes written to the ping-pong buffer control outputs.

I then needed some assertions to tie these two together.

Those two counters alone were enough to find the first several bugs.

They were also enough to allow me to build and express the contract.

A third component of the skeleton that I added at a later time was a 1-bit state machine. This one bit state machine would become high upon a request for operation–one cycle after i_rx_en goes high, and then it would go low once we completed our task. I needed this to prove that the design wouldn’t hang–especially since I was seeing it hang at the time.

At this point, we can come back to our original question, and ask: how many bugs did I find?

Bugs discovered via formal verification

Let’s count the bugs I found. Since I’m using git, it’s not all that hard. I’m just doing a git diff, or rather meld sdrxframe to be more specific, and counting all of the differences between the commit before verifying the receiver and the changes after now that the verification now passes. Let’s walk through the differences, shall we?

The first difference doesn’t really count. I discovered, via simulation testing, that I had stripped off the start bit in two locations: first in the PHY, and second in the receive framer. The result was both lost data and a failing CRC, since the receive framer would remove one or more clock cycles of data from the beginning of any packet, while looking for that start bit. This change just hadn’t made it into my baseline commit.
The next big change is sort of borderline as to whether it should count or not. Since I had let the design sit for a couple weeks before coming back to it to verify it, I came at it with fresh eyes and noticed a big bug while simply desk checking: I never implemented the CRCs for the negative clock edge.

When using the SDIO/eMMC protocol, each data wire used in transmission gets its own CRC at the end of the data block. These CRCs are each 16-bits in length, and they protect the entire data block. That’s in single data rate (SDR) mode. When operating in dual data rate (DDR) mode and sending data on each edge of the clock, there’s one CRC for each data wire on the positive edge of the clock, and a separate CRC for each data wire protecting the data sent on the negative edge of the clock. Both CRCs are 16-bits, and they are interleaved–so the positive edge CRC will alternate transmission with the negative edge CRC at the end of the packet.

When building the receive controller, however, I had only implemented the positive edge CRCs.

Oops.

I’m not sure I’d call that a formal verification bug, though, since the tools didn’t really find it. I found it via a desk check. That is, I found it via a desk check that I was only doing because I was adding formal properties to the design in order to verify it.
The next bug was associated with the logic for o_done. This bug didn’t really show up as an assertion failure, rather it showed up as I was trying to formally describe how the logic was supposed to operate.

The first problem here was that I had two done signals. One was used internally, and the other was my external signal. Further, I couldn’t really make out (from my own design even!) what the real difference was between these two signals. How were they supposed to relate? Were they supposed to be identical?

Let me back up and explain this a bit more. I want to formally verify the entire operation of the receive framer. That means I want to verify, formally, that it can properly receive 512 bytes of data in all modes, to include the mode where it only receives one bit at a time for a minimum total of 4096 clock cycles. This is a minimum, however, because when I operate the design at 100kHz (the slowest potential clock speed), there will be 1,000 clock cycles between every bit. Hence, a full operation will take more than 4M clock cycles. Most formal proofs will die on anything over about 20 clock cycles, with the longest proof I have running at about 350 clock cycles. There would be no way I’d verify 4M clock cycles of operation, therefore, without using Induction.

Induction, however, requires assertions to both verify and then guarantee all of the relationships between registers. That means I need assertions to describe the differences between these two done registers.

Yeah, the second done register was quickly dropped when I couldn’t decide what it’s real purpose was.

Even that wasn’t enough, since there were several registers that needed to act on the clock prior to o_done. Therefore, I ended up creating a signal I called w_done to indicate that o_done was about to be set, and everything remaining should clean itself up.
The next bug was that CRC errors (not tested by my simulation), wouldn’t show up coincident with the done signal. Yes, my done logic was really messed up. At first I was declaring the design done once all the data (not CRCs) had been received. Then I tried setting done once all the CRCs had been received, but not allowing for the last bit(s) to impact the CRCs, nor for a test of whether or not the last CRC successfully received its data.

In many ways this didn’t surprise me: I rarely test fault conditions in simulation. I should. Indeed, I need to make it a habit of doing so, but my simulation setup for this design was still somewhat new, so I hadn’t yet verified failed CRC handling.
At full speed, data would get written to the wrong memory address.

Remember how I said that the skeleton of the proof would help? Well, it turns out my address counting was messed up. I would calculate the next memory address on the cycle I wrote to memory. That wasn’t a problem. However, I’d then use that memory address to shift the next memory strobe and data into position, and so my logic required a dead cycle between memory writes in order to be successful. That would be fine when operating on four data bits (SDIO/eMMC) in SDR mode, when not using the OSERDES (i.e. when using the CCLK pin). In other words, it would work fine the way both my simulation and my hardware were setup. However, this approach would fail quickly when/if I ever transitioned to one of the high speed modes this IP was supposed to support.

When pride cometh, then cometh shame: but with the lowly is wisdom. (Prov 11:2)

Remember how I commented earlier on my frustrations when taking a design out of my library to add to a larger SOC-based design? This would’ve lead directly to one of those problems. Had I not verified this IP, I would’ve run it on hardware and been really proud of it. I’d put it, in my pride, into my library and declare it to be “working”, only to come back later, configure it for a (supposed to be supported) high frequency mode, only to discover that mode didn’t work.

This is why I like formal methods.

I also came across a bug whereby the receiver might ingest one too many clock cycles. What happens, for example, if you want to receive a five byte packet, the data width is set to 8-bits, and i_rx_strb==2'b11 on every clock cycle? The answer is that, on the last clock cycle, the data associated with the second clock edge would need to be discarded.

In this case, I needed to generate a new signal, one I called last_strb, to keep the IP from ingesting more than one clock edge with the last data set.
The next bug was associated with disabling CRC checking. When I built this receiver, I built it with a mode for receiving something that doesn’t have a CRC. This was to support reading particular registers that weren’t CRC protected. In hindsight, however, I’m not really sure I need this mode–since 1) all of those registers transfer their data over the CMD wire, and 2) even those unprotected registers still have CRCs–they just can’t be trusted. Regardless of whether it is needed or not, however, the formal tool decided to test it and found it broken.

Of course, my simulation didn’t check this mode. There was no reason to. All data transactions require CRCs. However, design lock ups are bad, and that was what the formal tool found. If ever the CRC checking was disabled, the design might accept its packet (but not the CRC) and then hang waiting on the remaining CRC that would never come.
The last bug was more serious. It involved those cases where I might receive data on two separate clock edges within a single clock cycle. This might be the case when using either the ODDR component in DDR mode, or the SERDES component in a multiple IO clock per system clock mode. The bug would only be triggered if I receive data on one clock edge at first (i_rx_strb==2'b10), and then ever after received data on both clock edges (i_rx_strb==2'b11). Not only that, it’s only triggered in 8-bit mode.

Here’s how the bug works. When the first 8-bits of data arrive, those bits get written to bits [31:24] of the memory bus–assuming it’s 32-bits, which it is for these runs. On all subsequent clock cycles, 16-bits arrive and get forwarded to the memory. Hence you’d write to bits [23:8] of memory on the second write, and then you’d want to write to bits [7:0] of the current memory word and (oops) bits [31:24] of the subsequent word. This is called an unaligned data access, and herein lies the bug. I didn’t account for writing unaligned data to memory.

Fixing this bug wasn’t hard, but it did require logic to handle the unaligned memory write request.

One of the tricks I often use when formally verifying components is to assume difficult things won’t happen. It helps the proof along, and can often help me get through the simpler logic. Of course, difficult things do happen in real life, and so these assumptions can easily render a proof invalid. For this reason, I make sure to place all such assumptions in a specially marked block at the end of the file–a block I like to call “Careless assumptions,” because of the likelihood that they will void a proof. Over time, as I get the opportunity, I’ll slowly work off these “careless assumptions” until none remain.

In this case, my “Careless assumptions” section held two assumptions for a while that I needed to come back to. The first was that i_rx_strb would only ever be either 2'b11 or 2'b10. This allowed me to get the proof to pass first, and then come back later to handle the unaligned memory requests. My second assumption is that the watchdog timeout would never fire. In both cases, I had to come back later and work through removing these assumptions before the proof could really be declared complete.

Today, I can now say with confidence that this design and proof no longer contains any “careless” assumptions.

Hardware bringup

Yes, but … does it work? Alternatively, I might ask, did all that formal verification work actually make a dent when it came to how long it took to bringup the controller in silicon to talk to its first device?

To answer that question, let’s go over the bugs found during hardware bringup.

First, as background, the design did need a hardware change before starting. The FPGA was driving the SD Card at 1.8Volts via a TI TXB0108 voltage translator to 3.3V, and the voltage translator couldn’t handle the open drain signaling required during startup.

Second, I was quite pleased to see the card respond to the very first command I gave it, the SEND_IF_COND command. Not only did it respond, but it also returned a valid response. This helped to add momentum to the subsequent testing, knowing that at least the interaction via the command wire worked.

Now let’s go over the bugs I found.

The next command in the bringup of an SD card, following the SEND_IF_COND command, is to ask the card to send its Operating Conditions Register (OCR). This is part of a voltage negotiation that takes place between the controller and the card. It’s supposed to be a handshake. The bug: in my first software drafts, I never told the card what voltages I could provide. Hence, from the card’s perspective, we had never come to an agreement on the required voltage and so the card never booted up.

Reading through the specification helped here.

Conclusion: This was a software bug.
The next problem was that I couldn’t get the card to respond to the next command, ALL_SEND_CID. This is where every card sharing the bus sends its identification via an open-drain setup, and whoever sends a ‘0’ wins the bus for that bit and following. It’s a part of the protocol designed to allow multiple cards to share a bus–although I’ve never actually seen this used in practice. In this case, I just couldn’t get the card to respond at all to this request. The card had responded fine to the previous command, just not this one.

The problem here turned out to have nothing to do with the ALL_SEND_CID command at all–it was how I handled the response to the reading the OCR. Bit [31] of the OCR is listed in the specification as “Card power up status bit (busy)”. So, I figured that once the bit was clear, the card was no longer busy. The bug: Re-reading the specification revealed I had the sense wrong–the bit needed to become a one before moving on. Because I wasn’t waiting, the card hadn’t finished powering up when I gave it its next command, hence it wasn’t responding.

I found this bug via simulation, once I tried increasing the power up time in simulation to the point where it would have an impact. Then, when the simulation didn’t match my software, I knew I was on to something.

Conclusion: This was a second software bug.

Fig 6. Voodoo computing

During this time I should point out I did a lot of return code debugging. Specifically, my command wire processor got a lot of scrubbing to make sure I was getting the right return code for any error I encountered. I’m not sure I really found anything here, but I did change a bunch of stuff in the process.

Conclusion: I’m not sure there was a bug here at all. I think the bottom line issue here was that I had forgotten, between when I wrote this module and when I came back to it, exactly how the interface was supposed to work. So I ended up rewriting how errors should be reported, even though they may have not been reported incorrectly in the first place.

Sadly, I also discovered that I had left a “Careless assumption” in my command wire processor, an assumption that kept the formal proof of this processor from ever examining a timeout situation. So, I had to pause here to remove this last assumption–especially since I was getting timeout errors, and I had no confidence that these errors were correct.
I then managed to get far enough to read several registers from the SD card. Two in particular, the Card Identification (CID) register and the Card Specific Register (CSR) deserve some extra mention. These are each 128’bit registers (including the CRC). . They follow what would normally be an echo of the 8’b command, and end with a 7-bit CRC followed by a stop bit for a total of 136 bits. In testing, I could read these registers just fine. The bug the problem was that the command wire processor was indicating a CRC error every time it read from these registers.

The bug: Digging further, I discovered I had calculated the CRC over the 8’bit prefix to the 120’bit data register, not just the 120’bit data bits. In this case, both my SDIO model and my controller were in error. This is a classic example of building the wrong thing right.

I’m not sure I would’ve found this apart from hardware testing.

Conclusion: This one was a hardware bug. The hardware did everything I had designed it to do and it did it all properly, I had just designed it to do the wrong thing.
I also came across a second problem with these 128-bit registers, and that was that I could read them once, and once only. Ever after that first success, the register would always read zero.

To understand this bug, we have to look a bit deeper into the design.

The SDIO controller is designed to handle data transfer via two internal FIFOs–the ping-pong buffers. Normally, those buffers are only used for data transfer: Either software writes a sector into them that is then forwarded to the SD Card, or the controller reads a sector from the SD Card and places the results into the buffer for software to come back and read once the operation is complete. The exception to this rule is that these 128-bit registers are also written to the ping-pong buffers, not by the receive framer, but by the command wire handler.

The problem in this case had to do with the pointers to the FIFO. The bug: I wasn’t resetting the read pointer when I issued a command to read these registers. As a result, the first time I read the registers properly from addresses 0, 1, 2, and 3. When I issued the command again, the pointers weren’t reset and so I was attempting to read the 128-bit register value from addresses 4, 5, 6, and 7–after it had been stored in addresses 0, 1, 2, and 3.

Conclusion: While you might argue this was bad user interface design, it required a hardware fix. Therefore this falls into the category of a hardware bug.
Once the card identifies itself, it is then picks its relative address and the protocol clock can speed up from 400kHz to 25MHz. Later, if I want to restart things, I might wish to slow the clock back down to 400kHz. Design bug: Along the way, I discovered that my register design provided me no way of knowing what the current clock speed was. Hence, I might change the clock speed, but never know how long to wait until that new speed was active.

I solved this by adjusting the Wishbone controller and clock divider so that the controller would return, upon a read request, the current clock divider setting, not necessarily the most recently commanded one. Once the two matched, I could then know the clock rate had properly changed and so I could move on.

This still creates a sudden clock change. Were the card to try to lock a PLL to this clock, it wouldn’t have time to lock it before I was sending the next command. On the other hand, the specification does say that the clock can be stopped or paused at any time if need be, a criteria that would probably preclude such an implementation.

Conclusion: This was a flaw in my user interface design.
SD Cards have two command sets. There are regular commands, called CMDs and followed by a decimal number, such as CMD0 (GO IDLE) or CMD17 (READ SECTOR). There are also application specific commands, or ACMDs. To send an ACMD, you first send a CMD55, and then the following command is interpreted as an ACMD. I now needed to issue an ACMD6 to set the bus width to four bits. However, much as I tried, I couldn’t get the card to respond to my CMD55 at all. The bug: It was only after much frustration that I looked up the CMD55’s argument, only to discover I was supposed to address the card in the CMD55 via the card’s relative address–and I was just setting the address field to zero. This was appropriate earlier in the setup, before the card had assigned itself a non-zero relative address, but not once the address had been assigned. No wonder it wasn’t responding–I wasn’t addressing it.

Conclusion: This makes for a third software bug.
At this point, I was finally at the point in the sequence where I could issue a command to read a sector from the SD Card and … I got stuck here again. I kept issuing read commands, only to have them end in a failure with a CRC failing error code. In the end, this turned out to be a couple of bugs.

The bug: The first problem was that I couldn’t tell the difference between a read failure and a command response failure. Both might return the same CRC failure code, both shared the same three bits.

Conclusion: I really need to adjust the user interface here, so I can tell the difference between failures on the command wire, and read failures on the data lines–whether they be timeout errors or actual CRC errors.

Now for the other problems …
I also made the mistake at one point of not enabling the FIFOs. Sure enough, by design, the read wasn’t enabled because the FIFOs hadn’t been enabled.

Conclusion: This was a software bug, caused by my thrashing around trying to determine if I had a read error or a command wire error, and so I had turned off the FIFOs to get the command to end early enough that I might trigger the internal logic analyzer on something useful, and then I later forgot that I had them turned off.
At this point, I still wasn’t able to read a sector from the device, and it took a bit longer to figure out why. Not only that, I had to dig into the trace from my internal logic analyzer to discover the next bug. Remember how I said when discussing the receiver design, that the PHY would remove the start bit? Well, in order to do that, the PHY needs to be told when to expect a packet so it can reset its start-bit search algorithm. Nothing in my internal interfaces allowed for this communication–I just hadn’t foreseen the need.

Conclusion: This was definitely a hardware bug.
Then I got lucky. The bug: I just managed to (by chance) adjust the scope enough that I could see there was a packet (i.e. the sector) coming back across the interface, but the design just wasn’t seeing it. This was key, because it told me I wasn’t somehow messing up the command. I had the command sequence right, and the card was returning data, I just wasn’t seeing it.

Was it a problem in the receiver?

No. The formally verified receiver worked nicely as designed.

The problem was in my watchdog timer. The timer was set, by a parameter in the receiver, to wait a maximum of 8M clock cycles for the first data bit. That timer was overridden at the top level, so that it would only wait 64 clock cycles for the first data bit. Needless to say, the card didn’t respond that fast.

Conclusion: Yeah, this was another hardware bug. This time, it was in the design’s configuration.

At this point, I’d like to write that all my formally verified modules worked as intended. Was this really the case? Let’s work through the issues identified above. Issue three wasn’t clearly a bug. Issues one, two, seven, and nine were all software issues, and issues six and eight were user interface design issues.

That leaves four hardware issues that were revealed during bringup. The first was the 128-bit register CRC issue. This flaw passed both formal and simulation based verification–the design did what I told it to, I had just told it to do the wrong thing. The second issue, that of the FIFO pointers, should’ve been caught when I verified the user interface, or at least when I ran the whole design in simulation. The last two issues, that of when to start looking for the start bit and how long the watchdog timeout should be, were both issues rooted in the PHY. They were missed simply because 1) I didn’t formally verify the PHY, and 2) my simulation never checked more than one packet, and 3) never waited a significantly long period of time before returning a packet. (Given how expensive simulation can be, I dislike waiting if I don’t have to. Unfortunately, this led to missing two bugs in simulation that had to be caught later in hardware.)

One item I haven’t yet mentioned is how long this hardware bringup session took. Once everything was formally verified, I managed to run through all the hardware bringup over the course of two days. The work often took place when another project was running simulations, and it had to be paused for nearly a whole day in the middle due to internet connectivity issues. So, let’s say hardware bringup–from synthesis to reading a sector took no more than a day of work in total. This is in contrast to my first attempts to bring up this controller, taken before either simulation or formal were accomplished, where I just embarrassed myself when not only did the controller not work but I had no idea why not. At least this time I had more confidence in what was going on.

I should also caveat this list by pointing out that I haven’t (yet) verified either my software driver, or the hardware’s ability to write a sector. So far, I’ve only tested a piece of well instrumented test software–nothing that can really be used beyond initial hardware verification. My work, therefore, isn’t complete yet.

Conclusions

I like to ask myself, after going through all this pain, was it really worth it? Was it worth all the pain of going through a formal verification process? Were the bugs I found ones that justified the extra work?

In this case, let’s think of the alternative. The alternative is that I would run this design on silicon, convince myself over the next week or two that it worked, and then put this design away, in my library, containing bugs associated with features never tested in silicon. I’d be proud of my work, and pat myself on the back. Then, sometime later–perhaps a year or more, I’d come back to this design, remember how proud I was of it, pull it off the shelf, and place it into a new design using an IO mode that had never been properly tested, only to discover things not working. I’d then re-run the simulation, get a “success” result, and convince myself that some other part of the design must be in error. Then, after a painful week of debugging–perhaps even two–I’d be forced back to this portion of the design only to kick myself for allowing such bugs to be left in my “library”.

This is definitely one of those cases where an ounce of prevention is worth a pound of cure. It’s certainly easier to debug a design shortly after writing it than it is to come back to it years later wondering what’s wrong with it.