Flash is an important component in any FPGA design, and a vital component in any soft-core CPU design. First, many FPGAs load their configurations on power up from flash. Thus, if you want your configuration to load from non-volatile memory and without using the JTAG, placing your design into the flash memory is often a requirement. The good news is that this means most FPGA development boards already include a flash memory for configuration. The even better news is that flash chips are cheap enough that there’s usually plenty of space available for user applications in addition to the configuration memory.
Just think through those possibilities: what would you do in your design if you had non-volatile memory available to you? Even better, what if you had 8-16MB of non-volatile memory available?
Now, before you get too excited, be aware: there’s usually a beginner out there
who thinks that flash can be treated
like normal memory. No, I’m sorry, it can’t. While you can read from
flash fairly quickly, writing to
flash is more problematic. If you want
to change something in flash, you have
the choice of erasing the flash–that
is to turn bits to ones and hence the bytes to
8'hff, and then programming the
also known as turning bits to zeros. Of these two, the erase is the most
problematic. Depending on the flash,
you might only be able to erase 64kB blocks at a time. (Yes, some
flash chips allow 2kB sub-sector erase
operations.) Worse, a sector erase command will take anywhere between a half
second and two seconds. It is slow.
For these reasons, flash memory make a good ROM addition to your design.
Today’s story, though, starts with the Arty board, now sold by Digilent under the name Arty A7. This is a wonderful starter board for anyone who wants to try building their own embedded CPU: it has a decent sized FPGA, DDR3 SDRAM, a 100Mbps ethernet port, a couple of switches, buttons, LEDs, some color LEDs, four PMod ports and … 16MB of flash memory.
When I first built my own design for the Arty, Digilent shipped it with a flash chip built by Micron. Sadly, my original flash controller couldn’t handle this Micron flash device. Why? Because for a common standard such as Quad-SPI (QSPI), the Spansion and Micron chips were just too different for my controller. Well, that and I originally wanted to build a 200MHz design, but that’s a different story for a different day.
Micron’s flash implementation had the problem that it was difficult to reset. There were modes the flash could get into where, if you reloaded your FPGA design, the flash might no longer respond the way you thought it should. Worse, Micron’s design offered settings under which the flash might power up into a state unknown to the design. These extra modes were “features”, designed to help you achieve high speed operation immediately on start up. To me, however, they were liabilities, since it became that much harder to know if my controller would work. I complained about this, and sometime later Digilent modified the board to use a different flash chip.
It was now time to build a new flash controller. Again. The question before me, though, was whether it might be possible to build a single Quad-SPI controller that I could re-use with any flash device I came across.
This blog article is about the design and verification of that new Quad-SPI flash controller.
No, I do not believe in top-down, requirements driven development. As we discussed in the last article, this flash controller is not the first flash controller I’ve ever built. Reality seems to dictate that spiral development, or other iterative development approaches work better. Indeed, I’m slowly becoming a believer in incremental design approaches.
Still, it makes sense to start the story off with a discussion of what a “better” flash controller would look like. What should it do, and what functions should it support?
When we last built a SPI flash controller, it could read one word every 64 (8+24+32) clock cycles, as shown in Fig. 2 below.
On the other hand, if you want to build a flash controller that stands out when compared to other controllers, a one-size-fits most controller, or even, as I’ve started to call this, a Universal QSPI flash controller, then you need to do more than just read values from the flash: you need to read them fast. How fast? As fast as the device will support a read using logic synchronous to the rest of the design.
This will force us not only into the
flash territory, where four
data lines are strapped together, but it’s also going to have us looking
at whether or not we can keep the
clock running at the same
rate as the system clock rate. For my OpenArty design, this means I am going
want to run my
flash at a 100MHz clock rate–twice as fast as most
controllers. Since most
flash devices support 108Mhz, I figured I should be good here. Hence,
this was my second criteria: running the in
mode, with the
running at the system clock rate.
This gets us down to 28 (8+6+6+8)
SCK clock cycles per read.
If you are working to achieve speed, however, this still isn’t fast enough.
Most flash devices offer a mode where, after one flash command you can leave the flash in some sort of eXecute-In-Place (XIP) mode. In this mode, the next flash command starts immediately by sending a 24-bit address, then after some amount of wait, you can read your data. This will save us 8 flash clock cycles by not needing to send a new flash read command.
We’re now at 20 (6+6+8)
SCK cycles per read.
This still isn’t fast enough for me: I wanted to build a flash controller that can handle burst reads.
By “burst reads”, I mean I wanted my flash controller to be able to read multiple words in the same transaction. The first word will require sending an address and several dummy cycles, before using up eight clocks for the data. If we want to keep reading, we can then arrange for the second and subsequent words to take no more than 8 SPI additional clocks each.
This brings us to 12+8N (6+6+8N)
SCK cycles per N reads,
asymptoting at 8-cycles per read.
Now that’s a fast QSPI flash controller!
But what about programming the flash? Sure, I could use the vendor tools to program my flash, but … I like to have as much control over my design as I can get. Therefore, I want an option whereby I can erase and program my flash device via my own controller.
In addition, modern Flash devices support many features beyond just erasing and programming their memory region. Many of them also support an identification code, whereby you can determine the make and size of your device. They might also support “One-Time-Programmable” memory regions–allowing designers to place special, often build-specific cryptographic data into the devices they then send to customers.
Supporting all of these features would be nice, but only if they didn’t cumber the basic read capability of the controller. So let’s make this capability an option, and then work to make it a cheap option that doesn’t expand our controller by all that much.
As we’ll see later, the read-manufacturer ID command support didn’t turn out to be an optional feature. Indeed, I needed to use it to get the design working in the first place–but more on that when we get there.
I thought so at first. So, after building it, I was quite pleased with my work.
Then my requirements started falling apart.
Most Xilinx designs, you see, require that the
Xilinx startup sequence be able
to control the flash I/O pins
independent of the design. The unfortunate result is that you can only access
serial clock) pin through a special STARTUPE2
Should you need to use this primitive, you’ll lose your access to the ODDR
necessary to control the clock.
is an unusual development board, in a good way, because it doesn’t have this
problem. Digilent created a second I/O pin
which they also tied to the
Hence, we can still get system clock rate I/O (100MHz) from our flash
My Nexys Video board
wasn’t so lucky. Neither was my
Both of these boards require that CCK line (configuration clock) going to the
SCK pin must go through the
Hence, these design will need to use a
Strangely, that wasn’t my first problem.
ASICs!!?! That changes things a lot! ASICs tend to run at higher clock rates, whereas most flash devices max out at around 108MHz. Worse, an ASIC chip may (or may not) have an ODDR I/O controller in the first place. To even dream of ASIC device support, I really needed an arbitrary clock divider.
Since I don’t normally build ASIC designs, I’ll be up front: this new design doesn’t (officially) support ASICs, although I think it could easily be modified to do so. In particular, I’ve discovered several flash devices have different numbers of “dummy” cycles. Were I to rebuild this design for an ASIC part, I’d want to support a varying number of dummy cycles. I might also want to support a run-time adjustable QSPI clock speed.
That wasn’t my last problem either.
As it turns out, if you want to operate using DDR I/O modes, you may need to register your outputs and then your inputs for better performance. This places a delay between when the logic is valid within your design, and a later time when the value comes back from the pin. This delay is non-zero. On a Xilinx chip, there’s a rough 3-clock delay. (I’m still investigating whether or not I can drop it two 2-clocks.) Intel chips can do this with a 2-clock delay. In other words, the delay needs to be parameterizable.
This was when I started wondering if my requirements had diverged so much that I was now building a “FrankenIP”. Nevertheless, I pressed on, being certain that somewhere, within this Universal IP core that there was a QSPI flash controller trying to break free.
Did I mention that, on top of all these other requirements I wanted a controller
that only had a minimum amount of logic? This is going to dictate, below,
that all of these options we are working with will need to be created using
parameters and generate blocks, but that’s still easy enough to handle.
The difficult part is going to be verifying that all of these various configurations work, while only having the hardware to test a couple of them.
Before leaving this section, let’s summarize our choices in terms of latency and throughput in Fig. 10 below.
The top of this chart shows the number of clocks required for each part of a QSPI interaction. The basic transaction costs 8 clocks for the command word, followed by another 6 clocks to send 24-bits of address 4-bits at a time. After this point, the flash chip might require between 1 and 10 “dummy cycles”. This is an annoying requirement necessary to support Micron flash chips. Winbond and Spansion flash chips have a fixed number of dummy cycles: six–so we’ll use that going forward for our calculations. Following the dummy cycles, it takes 8 clocks to transfer one 32-bit word of information.
These are the numbers we’re dealing with.
Now for the controller options. Our last controller, operating in
alone, took 64 clock cycles to transfer a word. Switching to
alone brings us down to 28 cycles. Using the eXecute In Place mode allows
us to then skip the eight clocks of the
I/O command, bringing our access time down another 8 cycles to 20 cycles. The
next request, however, will require another 20 cycles. On the other hand, if
we can string multiple requests for sequential addresses together into the
same request, using the pipelined mode of the
we can drop our access time from
20N clocks for
N values down to
This is about as fast as a QSPI controller will get.
If you have to use the CCK port of a Xilinx
you’re clock rate will be
slowed down by at least 2x. I placed another clock in the chart above, to
SCK to go low after the
CS_n line becomes active. Then, if you are
using the registered Xilinx
you’ll be required to slow down another three clocks.
The “Universal” QSPI flash
can therefore provide performance somewhere between
12+8N clocks and
28+16N clocks per word, depending upon how it is set up,
how your board is designed, and the
chip on that board.
Timing control loop
In our highest speed configuration, we’re going to want to handle an
signal equal to our controllers clock rate. Such a signal might generate
an output looking like Fig. 11 on the right. In this figure, the
control signal is being used to control the final
o_qspi_sck output. (I’ll
also confess, these are top-level signal names. Within
o_qspi_sck to reflect the signal shown at the right labeled
w_qspi_sck. In the non-ODDR modes, there’s no difference between these
two signals, only in the ODDR mode.)
Of course, we’ll want to be able to slow this clock down as well, so let’s create several signals from a basic clock divider circuit that we can use to control our logic below in the presence of a slower clock. These extra signals are shown in Fig. 12 below.
SCK signals are part of the basic
is inactive (high), then the other
signals, primarily clock and data, are allowed to be anything so that they
can be multiplexed together in order to control several chips. The
signal controls the basic data transfer, and so we’ll focus on making sure
data values only ever change when
The other clock control signals are:
ckstb: True when it’s time to move to the next set of output values.
In a DDR output mode, this will be true on every cycle during a transaction. If we are dividing the clock by two, this will be true every other cycle.
ckneg: True when it’s time to set the
Since we’ll only set
SCKlow at the beginning of a cycle, this signal is just a pseudonym for the
ckpos: True when it’s time to set the
This will take place mid-cycle.
Of course, if we are in a DDR output mode, that is with
SCKtoggling at the system clock rate, then this doesn’t nearly have as much meaning, so we’ll leave it high.
ckpre: Some of the logic below will require an extra clock cycle to prepare for the next transition. This is the purpose of
ckpre. It is designed so as to be true on the clock cycle prior to
The code within this section is parameterized by several pieces. The first is
the clock division parameter,
OPT_CLKDIV. We’ll use this to control a clock
divider in a moment. The second parameter, really a
localparam, is the
OPT_DDR parameter. We’ll set this any time
OPT_CLKDIV == 0, and use
it to indicate that we are driving the
SCK line at our full system clock rate,
ODDR output primitive.
In the case where
OPT_ODDR is true, there’s only ever one clock per
SCK cycle. Hence, we’ll set all of these values true on every clock cycle.
If we are dividing our clock by two, such as in order to use a
through a STARTUPE2 primitive,
then we’ll set
OPT_CLKDIV to 1. We’ll also need to toggle these
signals–but only while the port is active. That way we can respond
to a request no matter what phase of the counter we are in.
For the sake of brevity, I’m going to skip the discussion of what it takes to divide the clock down slower then a factor of two. Fig. 12 above should show you what these signals need to look like in that case.
Before leaving this section, I’d like to draw your attention to the presence
of the three generate blocks for this timing: one for
OPT_ODDR when the clock
divider is set to zero, one for when the clock divisions is set to one, and one
more for all other cases. That means that, when we get to
we’ll have to make certain that our formal
work gets applied to each of these three blocks separately.
Reading a word
Now that we have the clock divider out of the way, we can turn our attention to the fun part: building the actual flash controller logic.
If you ever find yourself needing to build your own controller, whether it be a SPI, flash, SDRAM or whatever protocol, the way to do it is usually straight forward: Find the specification sheet for the device you wish to interact with, search through the sheet for the timing diagram illustrating the interaction or interactions you wish to implement, and then build a state machine whose trace matches that diagram.
Building a flash controller is no different.
In this case, we’ll be implementing the QUAD I/O READ function. If you look this function up in the data sheet for your device, you’ll find two sets of protocols. The first describes how to get into the QUAD I/O XIP mode. The second I/O function shows a timing diagram describing the QUAD I/O XIP mode we’ll be using.
For example, here’s what the timing diagram looks like for a Spansion device.
Here’s another one describing how our operation needs to work for a Winbond device.
The data sheet for the Micron flash that I have doesn’t show the Quad I/O read from XIP mode, but it does show the Quad I/O read starting with the 8-bit command.
We’ll need to use this mode to get into the XIP mode, but more on that in a bit. For now, let’s just assume we are in the XIP mode where we can start immediately by sending the address to the flash device.
In all cases, we’ll need to go through several steps, and we’ll need to
control the chip select (negative logic),
SPI clock signal
o_qspi_sck, and the four outgoing data wires,
o_qspi_dat. Since these wires will eventually be bi-directional at the top
level, we’ll use a third signal,
o_qspi_mod, to control the final I/O driver.
We’ll also need to read the data lines from the
i_qspi_dat, once it starts returning information to us.
That means that we’ll need to support 3-I/O modes using
NORMAL_SPI = 2'b00:
DATis an output,
DATis an input, and
DAT[3:2]are both set high.
QUAD_WRITE = 2'b10: All data wires,
DAT[3:0], are outputs of our FPGA.
QUAD_READ = 2'b11: All data wires,
DAT[3:0], are inputs into our FPGA.
How the design interacts with the devices I/O controllers is typically beyond the scope of any of my QSPI flash designs, although it is required to actually implement them within any hardware. At one time, I would specify specific I/O connections in the toplevel:
I’ve stopped setting my I/O pins in this manner, however.
The primary reason for this is that Arachne-PNR would never guarantee that the placement of this final piece of combinatorial logic would be anywhere near the pin. As a result, I’ve now counseled several individuals who have declared Yosys broken when their design gets sufficiently large that the I/O logic no longer gets placed adjacent to their pins.
While I’m told that this is fixed in
NextPNR, I haven’t (yet) taken the time
to go back and verify this. Instead, I’ve gotten used to configuring the
vendor specific I/O buffers to handle this. Further, since I’m already using
one for the ODDR
SCK pin, it makes sense to use the same thing for all of the
pins–if for no other reason than to keep the timing matching throughout the
Hence our design will need to control
o_qspi_dat[3:0]. We’ll also control an intermediate value,
to tell an external I/O controller how we want the I/O handled. Similarly,
we’ll be reading from
i_qspi_dat[3:0], the data lines that come from that
external I/O controller.
So let’s go back to how this controller will need to control these various wires.
The basic logic is that upon any bus request, we will need to work our way through a sequence of steps.
Just like with the state machine examples in my tutorial, I often find that using a counter to control the steps in the timing diagram feels the most natural–especially in a particularly long sequence such as this one. In this case, the counter idles at zero, and starts counting down immediately following a bus request. Once the counter reaches zero, the interface will return to idle and we should be producing our Wishbone (WB) acknowledgment.
You can see this counter,
clk_ctr, and how it relates to our design in Fig. 16
This one counter controls everything, so let’s walk through the steps of how it works.
Our logic starts with a bus request,
where we set our counter to
14 plus the number of dummy cycles,
NDUMMY. This number of dummy cycles also includes two cycles for
the mode bits,
Note, above, that if we are not running in
OPT_ODDR mode, that mode that
SCK at the system clock speed, then we take an extra step to lower
the clock line after activating the chip select. This will cost us one extra
clock, and so this first value of
clk_ctr depends upon
OPT_ODDR in addition
On the other hand, if we are running in ODDR mode, then it feels like a waste
to spend a whole cycle to lower
SCK, so both
SCK will drop
together, as shown in Fig. 18 on the right.
Once set, then on any following step during this operation, we’ll decrement our counter until it reaches zero.
Once it reaches zero, we are idle.
Now that we have this counter, we can hang all of the rest of our logic upon it.
For example, here’s the outgoing chip select bit. Remember, this is an active low bit. On any bus request, we’ll clear this bit.
Then at the end of every clock interval, we’ll check the
know if this operation is over. Once the counter gets to one, we’ll set
o_qspi_cs_n again to indicate the end of the operation on the next cycle.
clock is a bit more difficult, particularly because of our requirements
creep. If we are running in
OPT_ODDR mode, where the outgoing
determined by an
ODDR I/O primitive, then we’ll output a
we want the clock to toggle.
If you look closely, you might argue that this
o_qspi_sck signal is
identical to the
o_qspi_cs_n signal. At this most basic mode, and only if
OPT_ODDR is true, then these two signals could share the same logic.
I should point out that I ended up using
parameters quite extensively in this
OPT_ODDR isn’t the only one. I did this for reasons of code
optimization. By using
OPT_ODDR, the synthesis tool can quickly recognize
if (OPT_ODDR) statement, and that the else following will never get
used. Hence, the synthesis tool will remove the rest of this nested if.
OPT_ODDR isn’t true, this part of the
if will get removed
and not count against the logic used by this core.
OPT_ODDR isn’t true then things get just a little more interesting.
As per the SPI protocol we are following, the clock idles at
1'b1 over the
wire, and so it idles at
1'b1 in non-ODDR mode where we
are directly controlling the over-the-wire interface. On the other hand, if
we are running in
OPT_ODDR mode, we are only controlling whether the clock
toggles. Hence in
OPT_ODDR mode, the clock pin idles at
(See Figs. 11 and 12 for clarification)
Now, if the clock is low, and
ckpos tells us that it is time to raise it,
then set it high. This will occur in the middle of our
interval, and only if
OPT_ODDR isn’t set.
Finally, if our clock divider tells us it is time to lower the clock, that
ckneg is true, then lower the
o_qspi_sck output–but only if our
SPI cycle isn’t finished. Notice the check here, as above, for whether are or
are not still within any I/O operation.
Now that we’ve set the chip select and the clock, we can turn our attention
to the data bits. These get set on any bus request, and
then shifted at the end of every clock interval. Ideally, that would mean
we’d set this anytime
i_wb_stb && !o_wb_stall. However, I’ve become somewhat
of a stickler for low-logic solutions, and the reality is that these bits are
don’t cares if
!i_wb_stb && !o_wb_stall, so I just check for
Perhaps if I were interested in building a lower power design, I’d want to eliminate any extraneously toggling data.
But low-power isn’t my current goal.
For now, you can see how
LGFLASHSZ address bits get set, the lower two
address bits get cleared (since we are responding to a 32-bit data
bus request), and the
mode bits get set. These mode bits will help to guarantee that we don’t leave
XIP mode once we’ve gotten into it.
Sure, the entire QSPI read operation is longer than this, but there’s never a time when we’ll need to output more valid bits than this. Indeed, after these bits get sent, the controller will switch the I/O lines from output to input modes, so again there’s no reason to care about these bits after the output duration of this operation ends.
The last QSPI I/O interface item that needs to be controlled is the I/O mode, to be used in determining which bits are set to outputs and which bits to inputs in the external I/O controller.
We’ll start in
NORMAL_SPI mode, and then transition on a bus request
QUAD_WRITE mode in order to send the address of the data we wish to read.
Once we get past the address and mode bits, we can then go into
mode to read our data.
That’s the logic necessary to control a read.
But what about the rest of our bus logic? While all of this is going on, we need to be doing a couple of things. First, the bus must be stalled. Second, we need to be collecting data from the QSPI data lines to return to the bus. Finally, once the operation completes, we need to acknowledge the bus request, signaling that the data we’ve collected is now valid.
Let’s start with the stall signal. On any bus request, we’ll set the stall signal high since it will be many cycles before we can respond to another bus request.
Then, at the end of every clock interval, we’ll adjust the stall signal so that
it remains high until our operation is done. Once
clk_ctr == 0, we’ll both
(potentially) acknowledge the request, and drop our stall signal.
We’ll come back to this in a bit and discuss how to handle the register delays
on our input wires, since that will force us to keep the
line high even after our transaction has finished.
The Wishbone acknowledgement signal looks simple enough. Following the
clock cycle where
clk_ctr==1, we’ll acknowledge this request.
Only … this is where we start to get in trouble with reality.
First, a formal proof
of this logic fails if the master drops the
i_wb_cyc line before we have the
chance to set this acknowledgment. We can’t interrupt our
flash I/O cycle when this happens,
lest we fail to output the
0xa0 mode bits and the
flash chip get placed into a
state other than the XIP one. Therefore, we’ll need to keep track of
whether the bus master has dropped the
i_wb_cyc line and then suppress any
acknowledgments if it had.
pre_ack logic below keeps track of whether or not we are still
within the original
That way we can use it, within our calculation of the bus acknowledgment, to return a proper value.
This isn’t quite the last of our problems either. What if a user wants to write to our read-only memory? Such an operation is undefined, but we can’t allow the bus to stall waiting for the result of an unsupported operation. If I believed in bus errors, I might raise one here–and there are a lot of good reasons to do so. I chose instead to quietly acknowledge any write request without doing anything instead.
Our last step is to set and return our data value to the WB bus
That one’s easy, right? Anytime there’s a value to be read, shift it into our data register.
But, when is there data to be shifted in? Here, I use a separate signal,
read_sck, to capture this logic. While I could have used
certainly did initially, I had to adjust this approach later to make certain
o_wb_data never changes unless we are mid-operation with
clk_ctr > 0
If we are in
OPT_ODDR mode, the mode where
SCK can toggle at the system
clock speed, then anytime the output clock is active, we should be reading
into our shift register.
You can see the resulting waveform trace in Fig. 19 below.
This would be catastrophic, though, if we only wanted to shift the data in
on every other clock. Hence, if we are dividing the clock by two, then
we want to read on the last clock of every clock cycle. One clock before
o_qspi_sck will be low.
This extra logic is shown in the last line of the trace shown below in Fig. 20.
Finally, if we are dividing by anything more than two, then we’ll register the
read_sck signal, and use the
ckpre signal as our indication that we
need to sample on the next clock.
Again, this is shown in Fig. 21 below. Notice how
ckpre is true one clock
read_sck, as required to make this work. Notice also how the
clock goes through its negative cycle first, leaving the
To verify that this logic works, I used one of those “poor man’s sequences” that I discussed earlier.
Why not use a regular SystemVerilog sequence? Well, I started out using the more traditional SVA sequences. However, ultimately it was the variable clock rate that made using SVA sequences impossible, and so I had to switch to the poor man’s sequence approach.
By a poor man’s sequence, I mean something like the following:
First, I define how long this operation will take in logical steps, not
clock steps. This
includes the first step, found only when
OPT_ODDR is low, where the
is low and
SCK remains high, followed by the six clock intervals of
the address. This is then followed by a parameterizable number of dummy cycles,
and then our eight data read cycles. We can capture this total length with the
We can then define a sequence vector of this many states, plus one more for the acknowledgment cycle, with the meaning that if any of the bits in this vector is a one, then we are in that state.
The logic to control this sequence is actually fairly simple. On a reset, the sequence is cleared.
Otherwise, we advance the sequence at the end of every
SCK clock period.
There’s one problem with only stepping the sequence at the end of every
SCK clock interval: what happens to the acknowledgment?
o_wb_ack can only
be high for one clock cycle, not for as many cycles as there are in an
SCK clock interval. Therefore, we’ll need to clear the upper bit
if our clocking is extended at all.
The last step is to start the sequence. We’ll start it on any bus
request. Well, almost. For reasons we’ll get into later we’ll start this
only on a bus request where the
CS_n line is idle (high).
From here, we’ll shift this left one step per every state transition. Fig. 35 below shows an example of this, but only after adding in several more features, so let’s work our way up to that point.
Now we can make assertions about what’s supposed to happen in each step.
For example, we want to make certain
o_qspi_sck is high during any
We might also wish to assert that we start out this sequence in
mode, and end it in
QUAD_READ mode. There’s a couple steps in the middle
where our I/O mode doesn’t matter, but otherwise this fully constraints our
In a moment, we’ll need a copy of what was read from the
i_qspi_dat set of
input pins in order to verify that we received the right values. So let’s
create a second copy of the incoming data for that check.
Next, let’s examine those first six clocks. These are the ones where we need to be outputting the address we were given from the bus. The first step to this check is making sure we have a copy of the last requested address to check against.
We can now use this to compare against what we are presenting across the port.
For example, during our first time interval, we’ll want to output address
Don’t let the
OPT_ODDR scare you in this expression. This just references
the extra clock cycle used in the slow clock mode before lowering the
line. During that cycle, output data values are don’t cares, so we don’t
check them here. However, if we are running in the faster mode, then we
don’t take an extra cycle–hence the reason for checking whether or not
OPT_ODDR is set here or not.
Here are the rest of the checks for the rest of the address intervals.
Don’t forget that, because we are addressing the flash with 32-bit word addresses, that the bottom two of twenty-four bits are necessarily zero.
Or, likewise, if we want to stay in XIP mode (and we do), that we have to
4'ha following the address.
Now let’s turn our attention to the returned result.
OPT_ODDR mode, we get a result every clock tick. In this case,
$past() function is ideal for checking if we are returning the right
On the other hand, if we haven’t yet reached the end of the sequence, then both the stall signal should be high and the acknowledgment signal should be low.
If we are running in a slower clock mode, then
$past() won’t work for us.
Instead, we can use the copy we just made of the incoming data to prove
that we received the right value.
The rest of this logic should match the logic above.
We also want to make certain that, on the very last clock tick, the counter has properly returned to zero.
We’ll use one final assertion to double check that
f_memread only ever has
one value active at any given time.
Finally, just to get some assurance that this actually works, we’ll add a
cover() statement to check that, yes, we truly can perform this operation.
We now have a basic, functioning, QSPI flash controller. Or do we? So far, I’ve only presented how to handle requests once we’ve already gotten into this special XIP mode. We’ll have to come back to the question of how to get into this mode in the first place still. Similarly, we haven’t discussed how to send or receive arbitrary commands yet, or how to handle I/O delays. Let’s push those topics off for a bit longer, and look at how to read a second word without needing to go through the address cycle again.
Reading another word
With the logic above, we can now read a word from our flash chip. We can do this at the system clock rate, or any arbitrary division of it. In this section, let’s instead focus on what it takes to read data from the flash using the pipelined features of the Wishbone bus.
While you might wish to call this a burst bus mode, unlike other burst modes that I’ve worked with (WB, AXI, etc), this one doesn’t carry a burst length parameter, burst size, or even address increment information. For this reason, I often call this a pipelined mode rather than a burst mode, even though there are some obvious similarities between the two. As a result, you’ll find I often describe these as “pipe” or “piped” requests.
Within my design, this pipelined mode is controlled by the
Further, unlike many bus burst
modes, these piped requests are controlled on a beat by beat basis in the
master. There’s no pre-announcement of the number of values to be read, such
as in the AXI master specification
or in the Wishbone burst modes from the B3
specification that I’ve
carefully chosen not to
implement. Instead, we’ll
need to determine on a beat by beat basis if the next read request continues the
burst, or if we need to raise
o_qspi_cs_n and start over with a new
Here you can see the definition of the
OPT_PIPE parameter controlling whether
or not we support this mode in the first place.
If this parameter is set, the controller will respond to requests for subsequent
addresses. Hence, if you request a read from address
A, and then while the
controller is busy making that happen you request a second read from address
A+1 (i.e. one word, or 32-bits later), then the controller should recognize
and honor this request before closing up the interface.
Sadly, that means we’re going to need to go back over a lot of our logic above and adjust it to make these subsequent reads possible.
The first step, though, is a bit of complicated logic determining if a subsequent read is even pending that would extend our burst access in the first place.
An important part of this check is to know if a bus request is pending for
the next address. The first step of that logic is to calculate what that
next address, or
next_addr, will be. In particular, this address is defined
as one more than the last address accepted. Hence, anytime
can create a copy of the incoming address plus one. (Notice we dropped the
i_wb_stb again.) Following requests for this
will then be honored without closing the interface.
This will capture the
next_addr from not only the beginning of our first
request, but will also update it at the beginning of any subsequent address
as well, since the logic above, based upon the
!o_wb_stall signal alone,
doesn’t care which of the two it is responding to.
A pipe request requires several things that all need to be true.
First, this has to be part of the last transaction. Remember how we used
pre_ack to keep track of whether the last transaction was aborted?
pre_ack must be true–indicating that the last request was never
aborted. Second, there must be an outstanding request, so
be true as well. The new request must also be a read request, so
Further, it must be a request while we are already busy, and so the
must be active so
!o_qspi_cs_n. The clock counter must be greater than
zero, and the outstanding request must be for the next address.
This all makes sure that we are not only receiving a next address read request, but also that we are getting that request while we are still reading from the last address.
Since that’s a lot of logic, we’ll register it to keep it from slowing down the rest of the core.
Of course, if we aren’t supporting burst reads, then this value needs to be kept at zero–so the synthesizer can optimize away any unused logic.
Registering all this logic is going to change our timing diagram somewhat, as shown in Fig. 23 below.
Notice from the figure that the logic recognizing a pipelined request needs
to first notice the request when
clk_ctr == 3. Then
pipe_req gets set
one clock later, when
clk_ctr == 2, and so the
o_wb_stall line gets
clk_ctr == 1. This is all set up so that
clk_ctr can then
jump back from
clk_ctr == 8 to start the second read.
The formal tools,
however, discovered the error in this basic set up. If
I ever take more than one clock cycle per
SCK, then it might be that
clk_ctr == 1 for multiple cycles before
o_wb_stall needs to be lowered.
Thanks to the formal
I think I found all of the missing logic tests.
All that’s left then is to patch this into our prior logic. The biggest
changes will be to our counter,
clk_ctr, and our stall signal,
o_qspi_cs_n logic nor the
o_qspi_sck logic needs to change,
since both of these are already set appropriately on any bus request.
Let’s start by updating
clk_ctr. Before, on a read request, we set the
14+NDUMMY+(!OPT_ODDR). Now, if
true, we’ll need to set it to
8 just before the operation ends.
At first, updating the stall signal is easy. We still raise the stall signal on a bus request, regardless of whether or not its a piped (burst) request.
Where things start to get difficult is when determining when to drop the stall line in order to accept this transaction.
It turns out that there’s two separate pieces of logic required. First, if
we are in
OPT_ODDR mode and hence running at the system clock, then we’ll
need to drop
clk_ctr == 2 so that
o_wb_stall will be low
clk_ctr == 1 as shown in Fig. 23 above.
Remember, we can’t make a mistake here, and timing is critical. Once mistake,
o_wb_stall is low for one too many clock cycles, and we might
accidentally accept an extra request that we have no intention of processing.
On the other hand, if we are running slower than our clock speed, then we’ll
need to drop the stall signal while
clk_ctr == 1 as discussed above.
This needs to be done one clock before
ckstb when all of our states change,
and so we use the
ckpre signal for that purpose. Notice that, if
ckstb in the above condition will always be true, so this next bit
of logic will get ignored.
Again, if you get confused by this logic at all, refer back to Figs. 11 or 12 above.
Verifying the piped reads follows much of the same logic as the original memory read verification: we use a poor man’s sequence. This sequence is only ever nine steps in length, since all the variable length stuff was handled above. These eight states represent the eight new steps on the QSPI bus, as well as a final one to return an Wishbone acknowledgment.
Now we can define a shift register with eight states (plus one for the acknowledgment), and step through it every time a clock period completes. This should look very similar to the shift register associated with the poor man’s sequence for reading in the first place.
Of course, if the states last longer than a single clock, then we’ll need to make certain that any bus acknowledgments still don’t last any longer than a single clock.
Using this state sequence vector, we can now make assertions about this second part of our state machine. For example, on that last beat of the sequence, either the data is right, or the acknowledgments must be low–in which case we don’t care what’s in the data.
Now let’s look at the rest of the steps in the sequence. Prior to our
acknowledgment, we should be stalled until the end of the
SCK clock cycle.
Once we hit the end of the
SCK clock cycle, we should still be stalled for all
stages except the one before we are done. That one exception is the stage,
shown in Fig. 23 above when
clk_ctr == 1, where we might possibly accept
Finally, unless we are acknowledging the last memory cycle, the acknowledgment line must also be low. (Remember, we checked for our own acknowledgment cycle earlier in this cascaded if statement.)
One last assertion is necessary to tie our
f_piperead vector to the clock
counter. On the very last cycle of
clk_ctr should be at
zero, unless we are extending into an additional burst read following this
one in which case
clk_ctr should be eight.
For all other cycles, the
clk_ctr should specify which of the
bits is on.
As one final step to know that our core truly passes, we’ll add a cover statement to cover the acknowledgment from one of these pipe reads.
The Startup Sequence
Our core now possesses all of the functionality necessary to read from the flash, just not any of the functionality necessary to get into the Quad I/O XIP read mode that all of our reads will start from. Once there, we can read at full speed (or slower) upon any request, and we can continue that read request as long as the master continues issuing subsequent read requests. Getting into this mode in the first place will be the topic of this section. Well, that and how to patch the logic for such a startup sequence into the logic we’ve already written above.
Before getting into the details, I should note that I’ve built more than one of these startup scripts before. Sadly, they all end up being very device dependent, often because different Flash devices support different reset commands, and some need special instructions to set chip specific configuration registers. Hence, while the previous two sections are all (fairly) device independent, and while they all apply to any flash device that supports the Quad I/O XIP read mode, things become quite device dependent in this section.
When I built my first startup script,
I built my startup sequence from a giant counter. After letting the
flash idle for a period of time,
following the Spansion
specification I was following for starting the
flash, I would then toggle the
line as a form of a reset sequence, and then issue a single read command.
Sometimes this required setting the Quad-SPI enable bit in the configuration
This all worked until I tried using the Micron
flash chip. In that case, toggling
CS_n line without toggling
SCK wasn’t guaranteed to do anything useful.
Worse, before setting the Quad-SPI enable bit, you had to set the write enable bit. And, if that wasn’t
all, the Micron
required up to 10-cycles between the address and the
data. Not only that, but that number of cycles is clock rate dependent. If
you didn’t run at 100MHz, you might be able to use fewer dummy cycles–making
the number of dummy cycles not only vendor but also clock rate dependent. If
that wasn’t enough, my 100 MHz flash
implementation required setting the drive strength, measured in Ohms, in order
to actually get up to 100MHz, and that requires setting the write enable bit
Because of the number of times I’ve ended up rebuilding this startup script, I chose to rebuild it this last time using an array of startup micro-commands rather than a counter driven script. While this might not be as low logic as I like, it will at least be easy enough to adjust from one flash device to the next.
This, therefore, is the one piece of our “Universal” flash
that remains device dependent–not counting the number of dummy cycles,
dependent number of wait states on registered I/O, the device dependent
SCK rate, or ….
Here’s how our micro-control commands will be formatted.
We’ll use one bit to select between a command to be send to the device, and some number of counts to wait idle before the next command. I call this the wait bit,
M_WAITBIT, within the code and marked it as
Sfor sleep in Fig. 25 above. If this bit is set, the other 10-bits of the command word will indicate the number of counts to remain idle with the
CS_nline inactive (high). If the bit is not set, the
CS_nline will be made active (low). Indeed, this sleep mode is currently the only way to set
CS_ninactive between commands.
The next two bits, shown as
Mabove, will select the mode the command will be in, whether
The final 8-bits will record an 8-bit data byte to be sent to the device–in either high or low speed, or ignored in
I’ll admit, this is even my second version of this micro-code interface. My first version was a basic bit-banging microcode interface. I switched to the more complicated command interface when the bit-banging one started to become difficult to maintain. Now, with all of the commands specifying 8-bit byte values, the command script has become much easier to read and check by eye.
The good news is that we will barely need to adjust anything else in our design to make this startup script work once it comes time to integrate it.
The startup script begins with the array of instructions, each 11-bits long.
These words are set within a giant initial block. In general, this block needs to start by placing the flash chip into a known state from which we can send an SPI command to enter the QSPI XIP read state,
and end with a Quad Read I/O command,
24-bits of address (I set these to zero in general), a mode command,
some number of dummy cycles as determined by your specification sheet, and
then reading one or two bytes for good measure.
Some chips will also require you to set the
Quad I/O bit in a configuration
register. That annoying Micron chip requires that
we first send a write enable, and then set the enhanced configuration
register, followed by sending the write enable again and the setting the
enhanced volatile configuration register before we can start our flash
command. In other words, check your
chip vendor’s data sheet to see what information needs to be sent.
The startup interface within our controller centers and revolves first around
an internal signal I call
maintenance, because in this
startup) mode the design is offline for
maintenance. Once the
flag clears, we’ll enter into our normal operations.
We both start out in maintenance mode, and we return to it upon any reset.
Then, whenever it is time to move forward to the next word, we step forward
one index into our microcode array,
m_cmd_index, stopping only when we get
to the last word in our sequence.
M_FIRSTIDX above is used to help speed us through
making it so the design skips the first several commands (mostly sleep
commands) and then goes directly into the startup sequence. That way, we can
cover() statement to generate a
trace showing us the whole
But I’m getting ahead of myself.
m_final register above will be true when we get to the end of the
sequence. More on that in a moment as well.
Now that we have a command index into our micro-command table, we’ll want to use it to read from our array of startup commands.
We’re also going to need a flag to tell us when we are on the last command
word. We’ll call this
Next, let’s implement our sleep or wait counter. This is the one that counts
down some number of sleep cycles, with
o_qspi_cs_n held high (inactive).
Of course, the counter resets to its longest count,
-1, on reset, and it
starts in the middle of a sleep cycle.
Then, when it’s time to step to the next state, and time to move to the
next micro-command word, the counter starts up only if the
the sleep bit) is set within the command word and the sleep count is greater
Once set, the timer counts down to zero. Likewise, the flag
will reflect that we are waiting for the timer to complete.
m_midcount flag clears, we can then move to the next microcode
instruction. This is also why the logic above depends upon
itself is only true if
!m_midcount: we only move forward to the next
instruction if our counter has reached zero.
What about the
CS_n line and the mode bits? Let’s set them here, as well as
m_bitcount to keep track of which bit within our eight that we are currently
ckstb, we’ll move forward to the next step in our sequence.
Once every instruction has been acted upon, if this is the final instruction, then let’s cause these values to stop toggling.
Otherwise if we are in the middle of a timer count, or if we are about to start
a timer count down, then again set the bits to idle.
CS_n is deactivated,
and the port is placed in a
NORMAL_SPI mode. The bit count is also left
Finally, if we aren’t mid byte, and if this isn’t the last byte, and we
aren’t in a sleep cycle or about to start one, then we can accept
a new byte to transmit.
CS_n is activated (lowered) automatically, and the
mode is drawn from the next two bits of the word. The bit count is set to
the number of remaining
SCK clock periods necessary to send this word,
either 1 for a two-cycle word, or 7 for an eight cycle word.
Well, almost. If we aren’t running in
OPT_ODDR mode, and we aren’t continuing
a previous command, then we’ll add in one extra clock cycle for
SCK to be
high before dropping.
But what data should be sent? That comes from the rest of the bits in the
micro-command word, bits
7:0 as outlined in Fig. 25 above.
On any new command to send data to the
flash chip, we’ll set the outgoing
m_dat to the top four bits of the word for the
Otherwise, if we will be transmitting in
NORMAL_SPI mode, then we’ll instead
set bit zero to the top bit, and the other three are don’t cares.
then used to capture the remaining bits to be sent.
Finally, while we are within a word, we’ll want to shift the
over by either one or four bits in order to grab the next bits to send.
The last wire to set is the clock register,
m_clk, that will be used to
SCK pin. If we are in
OPT_ODDR mode, where we are
running our clock at the system clock rate, this is as simple as setting the
clock to be identical to the negated
m_clk pin will set the
o_qspi_sck and hence the
directly, so we’ll need to spend a bit more time at this. On a reset, the
SCK clock wire needs to idle at one. Otherwise, whenever
m_clk is already
low, then the clock is raised on the
ckpos signal. Further, in the middle
of a count down, the clock is kept idle (high). Otherwise, the clock goes
low if there’s another bit (nibble) to be sent.
That’s the startup logic.
Since it doesn’t depend upon the inputs at all, it’s easily tested by a basic testbench. Alternatively, the one cover statement shown above will calculate a trace for us, showing what this startup routine does.
But how shall we integrate this within the rest of the design?
Actually, that’s the easy part, and part of the magic of using the
maintenance flag. First notice that there’s no feedback path from the
flash chip to this micro-code startup
design. That means that an extra clock cycle (or two) won’t affect our logic.
This makes it easy to adjust each of our basic controller output port logic
block to respond to the
maintenance flag when it is set, and to ignore
the startup registers if not.
For example, in the case of
o_qspi_cs_n, we’d have
In the case of
o_qspi_sck, we’d have
The same applies to
o_qspi_mod, the bits used to control the external I/O
the Wishbone stall register,
and so forth and so on.
If we’ve done this all right, we can then get a cover trace showing that our startup script works using a simple,
as I mentioned above.
Only, this doesn’t practically work.
The first problem is that I start the script with a very long set of delays.
These are required by some
flash chips. The problem with these
long delays is that the
can’t practically work through that many cycles. So, to cut these delays
down, I introduced
M_FIRSTIDX above–as a way to start the startup sequence
in the middle–but only during formal
The second problem was the delays within the control structure, and this is
a problem for the same reason as the long delays upon startup. To deal with
these, I arbitrarily kept the maximum number of counts to 3, but only during
My third problem was that even with all this help, the startup design
still didn’t pass
If you’ve ever had to debug a
cover() failure, it can be quite annoying,
provide no information to you telling you why the
cover() request failed. Instead, all you learn is that the
part of the proof failed.
The secret to solving problems like this with
cover() is to break the
cover() problem up into smaller problems, to help you bisect and find the
While this was my approach, I may have also gone a bit overkill at it, as you’ll see below.
This way, if
cover(m_cmd_index == 5'h12) passed, but
I could look at the number of steps between states and estimate how many
more steps the
needed to reach the ultimate
cover(!maintenance). When dividing the clock
by six, such that
CLK_DIV == 5, this meant checking 560 states before the
proof would complete!
In the end, I also created some poor man’s sequences to describe the various possible micro-commands and make certain that each were properly carried out. We’ll skip these, since they basically follow the same form as the others above.
The next step in implementing this core was to create an optional
configuration port through which arbitrary commands could be sent to
flash chip. Further, I chose to use
OPT_CFG, to control whether this arbitrary command port should
be integrated into the
There are several reasons why we might want such a port. First, if we don’t implement any start up sequences, the arbitrary command capability can be used to create a startup sequence to place us into the XIP mode where the flash will respond to a sequence starting with an address instead of a command. Second, arbitrary command sequences are necessary for erasing and programming the flash, should you want that capability. Finally, while debugging the I/O, to see what is working and what is not, arbitrary commands are an absolute necessity to get a perspective of what is going on either right or wrong.
Of course, to do this, the flash controller will need to be able to place the flash chip into a state where it would no longer respond to read requests. This will necessitate that we add at least two more states to our basic state diagram, as shown in Fig. 26 below.
In the new configuration mode state, any attempts to read from the flash memory will be erroneous–sort of like any requests to write to the flash were erroneous earlier. As before, such read attempts could be responded to with a bus error, although I have chosen to instead return an empty acknowledgment instead. This means that any software controller will be responsible for making certain reads from flash memory aren’t attempted during the configuration mode.
Of course, if you read through any flash chip specification, this will appear backwards. Most flash chips support many modes, of which the read mode we are using is a subset of the “Quad I/O read” mode. Instead, as far as our controller is concerned, our read mode is our primary reason for being. That’s why it is our primary mode in our state diagram above.
When I first started designing this configuration port, I was only
interested in implementing traditional
instructions with this port: send 8-bits of data on
receive 8-bits of data on
The problem with this initial view is that switching back into our QSPI mode requires sending the following:
The address, written in QUAD output mode.
A mode nibble (sometimes byte) of
4'ha0). This needs to be sent in QUAD output mode, while driving all of the wires.
If the mode byte is not sent in its entirety across all four bit lanes, the flash chip will not return to XIP mode following this interaction.
Dummy bytes, where the clock is ticked. This can be sent in any I/O mode, but must allow the I/O direction to be switched.
Some amount of data, read in QUAD input mode, so that the flash chip fully places us into the XIP mode we want for everything else. This must also be done in QUAD input mode to avoid contention on the various wires.
In other words, in order to support an arbitrary command interface, we need to
support all three modes:
QUAD_READ just to be
able to return our interface to the state where our logic above will apply.
Not only that, but some commands require 8-bits, some 16, some 24, and some
more bits–such as the command we need to send to return to XIP mode. To keep
this interface simple, I chose to only support 8-bit transactions, in a way
where larger/longer transactions could be composed from multiple 8-bit
transactions. That means that the configuration port
must support leaving the
CS_n line low at the end of every
transaction, and then only raising it later upon command. Further, at the
end of every transaction, the port should be stable:
o_qspi_cs_n will be
high or low as specified in the transaction, and
o_wb_data will be constant.
A traditional 8-bit SPI interaction.
This would be started upon a write request, but would end with
o_wb_datawould maintain, in its bottom 8-bits, the values read from the flash.
I called this a low speed configuration request.
To initiate such a request, one would a single word to the control port. Of this word, the lower 8-bits would contain the data to be sent, the
CS_n) bit would be low, the
Sbit (Quad I/O rate) low, and the
Mbit (Configuration mode) would be set high. Once the operation completed, you could then read the results back from the data word. A second write to the configuration port setting
Mlow would exit the configuration mode and raise (deactivate) the
CS_nline. The direction, or
Dbit, would be a don’t care in this operation.
A Quad-SPI 8-bit interaction to write 8-bits to the port.
Fig 28: Sending 8-bits using Quad I/O
This is a two cycle
SCKrequest, also leaving
o_qspi_cs_nactive (low) at the end. This two-cycle operation would begin, as before, by writing a command word to the configuration port. The bottom 8-bits of this command word would specify the data bits to be sent to the Quad-SPI port. Likewise, the
Mbit would be high placing us into configuration mode, the
CS_n) bit would be low, the
S(speed) bit would be high to send us into QSPI mode, and the
D(the direction) bit would be high to indicate a write operation. As with the traditional request, the Quad-SPI port would be left with
o_qspi_cs_nactive (low). Further, the port will be left with the mode bits set so as to continue this active write until either the next command, or until
o_qspi_cs_nis deactivated (raised).
A Quad-SPI 8-bit read interaction
Fig 29: Reading 8-bits using Quad I/O
This is essentially the same as the last interaction, only the goal is to read 8-bits of data from the port, four at a time. The big difference is that the direction bit,
Din Fig. 27, of the command word needs to be clear. As before, the I/O mode will be left in its last mode,
QUAD_READ, and the
o_qspi_cs_nline will be left active (low) until the next read.
A read from this configuration register port should return the last 8-bits read from the device.
Fig 30: Flash controller bus connections, showing two shared ports
Here, I got a bit greedy. I merged the two return ports together, as shown in Fig. 30. I set it up so that the bus return signals,
o_cfg_data, would be shared between the flash memory and configuration ports. I also placed the current configuration port state in bits
16:8, with the last 8-bits read placed into bits
Much to my surprise, this came back to bite me later when I was working on improving the address decoding within AutoFPGA. Perhaps I shouldn’t have been surprised. The configuration port, as currently designed, rather breaks the rules of the Wishbone bus, specifically one request should return one acknowledgment only, and that acknowledgment should come back on the port where the request was made. This meant that I had a problem when my updated and improved AutoFPGA interconnect later looked for an acknowledgment specifically from the configuration port, when I was sending it over the flash memory port.
Finally, one final but necessary operation is to deactivate
o_qspi_cs_nand possibly, but not necessarily to close the configuration port at the same time.
o_qspi_cs_nsimply means writing a word to the port with the
CS_n) configuration bit set. Closing the configuration port means also writing a
Mor mode bit of the configuration word, after which the design will return to its normal mode for reading from the flash memory.
Do be cautioned: for reasons of space within the design, the software driver must be careful to place the flash back into Quad I/O XIP mode. The flash controller does not do this automatically. On the other hand, this isn’t that hard to do from the configuration port.
When we get to our formal properties, we’ll need to cover each of these separate operations.
The transactions themselves are built around a configuration
to our core, shown in Fig. 30 above, consisting of only an additional
Wishbone strobe bit,
i_cfg_stb, as well as
a set of “special” bits used to decode the instruction word shown above in
Fig. 27 above and defined within the core below:
To highlight how this interface might work, suppose we wanted to read the
manufacturer ID (SPI
8'h9f) from the device. We would need to:
First exit from the Quad I/O XIP mode the core is normally in. That means we’d need to write, to the configuration port:
a. One word of a potential address:
8'hffdata bits are carefully chosen to send an “undefined” command to the SPI in the case that we aren’t in Quad I/O XIP mode. As a result, this is also a low-speed command. At high speed, these would be interpreted as 24’bits of an address, followed by two mode nibbles–all with the low-order bit held high. Since the last two bits are set, this will clear the mode word, so that the flash chip will exit XIP mode at the end of the command.
b. We’ll send one additional word for good effect, just to guarantee that we actually complete the read command. (On a Micron flash, you might need to send more. Hence, we’ll write 32’h10ff to the port again to send another 8 clocks.
c. Writing a 32’h1100 to the port keeps it in the configuration mode, but deactivates the
CS_nbit–so that we can now transition to our next command.
Fig 31: Exiting from XIP mode
Fig 32: Sending a 9F via normal SPI mode
Another 8-clocks are necessary to read the manufacturers ID from the port, so we’ll send an additional
Fig 33: Reading the byte following the 9F via normal SPI mode
Writing 32’h1100 will clear the port and deactivate (raise) the
CS_n, but leave the controller in its configuration mode.
Fig. 33 above shows all three of these transactions. First, the read. Notice how I’m only showing
io_qspi_dathere. This is the traditional SPI MISO channel. The bits in this channel are then accumulated into
o_wb_data, which is then read on the second configuration port transaction shown above. After the third transaction, the SPI port is returned to idle.
Once we are done with our configuration commands, whatever they might be, we’ll need to place the design back into Quad I/O read mode–so the controller can go back to what it was doing before. Doing this may require some device specific setup, as we discussed in the setup section. Once accomplished, it then requires sending a command to the controller from the configuration mode.
First, we send the Quad I/O read command,
Then the address. In our case, this is a simple dummy address–anything will work, so we send three bytes of zeros. The trick is–these need to be written to the port at high speed. Hence, we set the speed bit and the direction bit, so we write
32'h1a00three times to the configuration port.
Now we send the mode bit, by writing
Depending upon your flash, you may need to clock it up to eight more times. (Thanks Micron!) These dummy cycles can be in either read or write mode, though, so we’ll write
SCKtwice several times over.
We’ll then read one byte of data from the flash by writing a
32'h1800to the port.
The configuration port is then closed by writing
32'h0to the port.
Once complete, all of the above read commands that start in Quad XIP mode will work.
At least, that’s the idea. We still need to make all of this capability happen. Further, we need to make it happen without disturbing any of the capability we’ve already built above.
To make the logic easier to read, I created a series of simplifying assignments.
The first one,
bus_request, is very similar to the logic we discussed earlier.
The difference now is that we only accept a
bus_request to read from
flash memory when we are not in our
After that, a couple of more signals. First,
cfg_stb simplifies checking
for a bus request on this configuration port.
Many of these requests, such as reading from the port, releasing the port,
CS_n high can be acknowledged immediately. This includes
request of the configuration port when our
OPT_CFG parameter was low,
describing the case where we haven’t built the configuration port into the
design at all. We’ll capture these empty interaction requests with
The rest of these simplifying assignments describe actual requests.
Well, not quite. They are user requests as long as the
CS_n bit is set
The other three commands will require require some amount of flash interaction. Primary among these are the writes that place or keep us in configuration mode.
Here are the three types of interactions we’ll support from here:
cfg_hs_write, a high speed write request is made of the configuration port. This will cause 8-bits to be transmitted to the flash over two clock cycles.
This was shown in Fig. 28 above.
cfg_hs_read, a high speed read request is be sent to the flash. This will create two
SCKclock cycles, after which the 8-bits read across those cycles can be read from the wishbone port. Note that, despite this being called a read command, it is actually a Wishbone bus write that commands a QSPI read. Therefore, a second Wishbone operation is still required to read the results back out.
This was shown in Fig. 29 above.
cfg_ls_write, this signals a basic SPI flash command. This will cause us to write 8-bits to the SPI port, and read 8-bits back, across 8 SPI clock cycles. These 8-bits can later be read from the configuration port via the Wishbone bus.
This was shown in Fig. 6 above.
Those are our three primary operations that we are going to need to support in order to support an arbitrary read/write configuration interface directly to the flash.
I should point out that this is a second generation version of this interface. The first one was based upon bit-banging the various flash wires. I have since abandoned that interface in favor of this current one since: 1) “most” of these operations are already supported with our current basic logic, and 2) bit-banging over a slow bus driven by a serial debugging port is highly inefficient. (Can I highlight the highly in inefficient?)
That’s our vision. Now we just need to integrate this into the rest of the design.
The first step is to keep track of any enduring modes that will last beyond a single request. In particular, this subset of the design requires tracking four mode bits. One to determine whether or not the configuration mode of the interface is active,
one to determine the value of the chip select in this mode,
and the last two in order to determine the speed and direction of the I/O pins.
Note the use of the
parameter. As before, if
OPT_CFG is not
defined then this let’s the synthesis tool know that it can remove all of the
logic surrounding these values and replace them with constants–simplifying the
rest of the design along the way as well.
With these adjusted bits, we can now return to our basic design blocks.
The first one we’ll adjust to support this mode is the
we’ll add two more options–one for a low-speed request that will take
eight clock cycles, and one for a higher speed request that will take
only two clock cycles.
OPT_ODDR is false, then we take one extra clock cycle after
CS_n becomes active for the clock to lower and begin our first cycle.
For the most part, the
SCK logic doesn’t change at all. It’s essentially
what it was before.
The chip select line needs adjustment, however. In particular, this line needs
to respond to both read commands, which we’ve discussed above, as well as
configuration writes. Hence, on a write to the configuration port,
CS_n is now
After the write to the configuration port, the chip select pin follows the last written value.
Otherwise the chip select is controlled in an identical fashion to what it was above.
Port direction control starts out as before: on a request to read from memory, we start out writing to the port, so we can send the address.
On the other hand, if we have a burst continuation or pipe request then we need to keep reading. Likewise, if there’s a configuration port request to read at high speed, then we also go into high speed read mode.
The next two adjustments are basic. On a high speed write request, we set all pins to outputs,
whereas on either a low-speed request or any time the bus remains in
configuration mode at low speed, then the port I/O modes transition back
to normal SPI
io_qspi_dat[3:2] are outputs,
io_qspi_dat is an input, and
io_qspi_dat is our output data pin.
The last piece of logic we’ve discussed before: After sending the address and the mode command, the wires should become all read wires. This only applies, however, if we aren’t already in any configuration mode. In that case, we need to maintain whatever I/O standard that we’ve been commanded to remain in.
o_qspi_dat logic is the last of the
registers that needs to be adjusted.
As you may recall, we set this value any time
o_wb_stall was low,
The difference is that now we need to set the data bits associated with any
outgoing data. Note also that we don’t need to check the direction of the
operation, in case it is
QSPI_WRITE, since this will be
handled by the vendor-specific I/O drivers external to this this
The rest of the data logic is as it was before. On a
ckstb signal, we shift
everything left by four. This includes when we are in
That’s why we wrote to every fourth pin above.
Alternatively, we might have shifted a variable number of bits on each clock, either one or four. I’ve chosen this approach to minimize the logic required, but we’ll have to check in a moment how effective this approach was.
During our startup script, we can just copy the startup data into the top four
values of the
data_pipe–leaving the rest of the bits as don’t cares.
These same top four bits are then used to drive our data wires,
Notice the use of the
4*(OPT_ODDR ? 0:1) expression above. This simply gives
us four dummy output bits for the case where we take an extra clock to drop
SCK line after the
CS_n line goes low–as shown above in Fig. 12.
Moving on to the bus logic, we’ll start with the stall line since it doesn’t change much with this new capability. The big new difference is that, upon any configuration request, whether it be a regular SPI operation or a high speed one, the stall line goes high.
Our bus return logic needs to change just a touch as well. As before, we’ll want to acknowledge any request as soon as it completes.
Similarly, we want to acknowledge any memory write requests–requests that we are not going to act upon, immediately as well.
The one change is that, following a configuration write where
CS_n is either
not activated or deactivated, or following any read from the configuration
port, we’ll want to acknowledge such requests immediately.
One other signal changes to create this configuration port capability, and
that is the
o_wb_data signal containing the data to be returned to the
bus. Unlike before,
we now have to shift our data by either one bit or four bits, depending
upon the mode we are in. Here, we’ll use
o_qspi_mod, the bit that
determines whether we are in
NORMAL_SPI mode or either
QUAD_WRITE modes to determine how many bits to shift in.
Of course, you don’t want to forget that, in
NORMAL_SPI mode, the incoming
data bit is bit one, as in
i_qspi_dat, and not bit zero or
i_qspi_dat. This just follows from the typical
As a last step in this process, if we are in configuration mode, then we’ll set the next nine bits to indicate that fact so we can read back off the mode we are in.
Sadly, these bits are somewhat ambiguous, since I merged the two bus return ports together as shown in Fig. 30. Following a proper data read from the flash memory, these bits may be set to anything–depending upon what was read from the memory. A configuration port read would then return this same value. However, without splitting the output between the two ports, something we chose not do, we are stuck with this ambiguity.
Since there are three basic extended operations we are supporting in this
section, we’ll create three new
poor man’s sequences:
f_cfghsread. By now, though, you should
have the hang of these. First, there’s a logic block defining the sequence
logic, then another one defining how the rest of the core needs to behave
during the sequence, and lastly a cover statement to make sure the
acknowledgment at the end of the sequence can be reached.
I’ll admit, by the time I got to this point in my design process, I was feeling pretty good. My design was “working”, the logic did what I wanted in simulation, and all of the formal proofs, were passing. I just needed to place it onto my hardware to try it out. What could possibly go wrong?
Sadly, everything could go wrong.
High speed I/O, such as at DDR rates and above, really requires for design stability purposes that the outputs be registered and that they go through a vendor specific I/O module, like this one for Xilinx or even this one for Intel. Registering the outputs, though, breaks all my logic above. Registering the inputs also costs another clock cycle.
If this design hadn’t become “FrankenIP” yet, it was about to do so now.
Yes, this was also the day I just gave up in frustration. I had worked this design to perfection, and now reality didn’t agree with me.
The next day, though, I’d figured out how to move forward.
The key is that only the inputs need to be delayed. None of the control logic
o_qspi_dat, is dependent upon
any inputs, whereas
o_wb_ack are. In other words, if I just
separated the read logic from the write logic by a programmable number of
clocks, then everything should work as before.
Let’s call this extra read delay,
RDDELAY, and make it a parameter.
That way it should be easy to re-target this design from one device with one I/O delay to another.
The next step was to delay all of the input data processing. If you recall,
the timing of the input data processing was dependent upon two signals:
read_sck–in addition to the more obvious
o_wb_ack we set as soon as the clock counter reached zero.
My first step, therefore, was to rename the
o_wb_ack logic so that it produced
an acknowledgment that would need to be delayed. I called this new signal
dly_ack. If the
RDDELAY was zero, the two would be identical.
RDDELAY was non-zero, I’d delay
dly_ack using a shift register.
There’s a couple things to note about this logic. First, if the
aborts the transaction, then the acknowledgment delay shift register is set to
zero. Second, if
RDDELAY==1, the delay is just a single clock delay.
Otherwise, we have to reference values from
RDDELAY-2 down to zero. My
original plan was to use Verilog’s rules of assignment:
If an N-bit value is assigned to a less-than N-bit register, the upper bits
are ignored. Unfortunately, while this worked with some tools, it failed with
others. Eventually, I came up with the logic above that has (so far) worked
in all of my tools.
Finally, at the end of this shift,
o_wb_ack can be set to its delayed value.
I then repeated this logic with the read clock, creating a new value I called
actual_sck to describe the outgoing read clock, and then delayed
to describe the sample time on the input. Now, if I updated
our (now delayed)
read_sck signal, I could use the same logic as before.
The sticky parts, however, turned out to be the bus access.
The first problem was that my set of Wishbone formal interface properties counts the number of outstanding accesses, and in order to pass induction the flash controller controller has to assert that it’s own idea of how many accesses are outstanding needs to match those of the bus interface properties. Once I delayed acknowledgments through this pipe, my counts were all off. It might be that, while processing a QSPI flash read, some value gets acknowledged from a prior read.
Fixing this required maintaining a count of how many bus acknowledgments were in the pipeline.
Yes, I’ll admit to some amount of cringing as I created a for loop like this. I’ve just told too many individuals not to use for loops in their Verilog code. This loop, on the other hand, is actually somewhat short and so, if you look at the logic, it can be implemented with a simple lookup table. Of course, this value is also defined only in the formal context, so I really don’t need to be worried about meeting timing here either.
Those were the easy changes.
The harder change was the
At issue were the immediate acknowledgment signals, such as when you read from
the configuration register, or write to it without setting the
active. Similarly, I grouped the attempts to write to the
flash memory in this group. According
to our design above, all of these bus requests get acknowledged immediately.
o_wb_data gets changed immediately following any configuration
write, and so the
cfg_dir and so forth bits get set
immediately upon the write.
The formal tools again showed me this bug: If I set the configuration state on a bus request immediately following the read request, the outgoing read data might not match what was read from the flash. This would be a catastrophic error, violating the whole purpose in designing a flash controller–even if it would only ever be a very rare event.
I solved this problem in two steps. First, I adjusted the
logic to stall on any incoming request if
RDDELAY was greater than zero.
Second, I added a flag I called
xtra_stall to indicate that there was
an extra stall cycle, based upon the
RDDELAY value that needed to be
placed into the cycle.
xtra_stall was true, the design would now wait for any
to clear the final pipeline before releasing the stall line.
xtra_stall calculation was simple if
xtra_stall = 0, it was a touch more complicated otherwise. The first problem
is that, unlike the acknowledgment, the extra stall had to be active if any
stall request was in the pipeline–not just if there was one at the end of the
pipeline. After writing this logic over and over a couple of times, I eventually chose to make it work with a pipeline similar to that of the one necessary
for synchronizing an asynchronous
reset. Then this didn’t
work, I returned to a more traditional shift register configuration–such as
the one we used above.
The first part of this logic set a value,
not_done, indicating that we’d
want to stall an additional cycle. On any bus request, if
RDDELAY > 0, we’d
want to stall an additional cycle. Second, if any interaction with the
was ongoing, we’d want to stall an additional cycle. This includes not only
those cases where we haven’t yet gotten to the last state of the transaction,
clk_ctr > 1, but also those cases where we are on the last state, but
we’re taking multiple cycles there and we haven’t (yet) gotten to the last one.
Notice that this always block uses blocking assignments, i.e. it uses the
= sign. While I generally discourage the use of blocking assignments
within clocked always blocks, I use them religiously in any combinatorial
blocks–such as the one above. The rule, though, is that you cannot create
a latch in the process. Hence, the initial assignment that makes certain
not_done always has at least some value. Any subsequent assignments will
override that that initial one, and are primarily written that way just to
keep things simple and easy to read.
Now, using this
not_done value, we can set the
stall_pipe and hence the
Notice that setting a value to
-1, according to Verilog’s rules, will set
all the bits in
stall_pipe. It should do this without error or warning.
Sadly, Verific’s parser (used by the major vendors) will create a warning
regarding truncating a 32-bit value to
RDDELAY bits. Still, it gets the
The really fascinating part of this extension to handle I/O delays is how the formal sequences can be adjusted to handle things.
First, I expanded the various poor man’s sequence lengths by creating new length parameters equivalent to the originals plus the new length,
and so on.
Then, I adjusted the driving loop to make it so that the first half of the sequence proceeded at the rate of the SPI interface, but the second half, the half counting our new RDDELAY clock extensions, at the rate of the system clock.
Above, as before, we step the whole register any time the
SPI clock moves us
forward to the next step in our sequence. If
OPT_ODDR is true,
be true on every clock, and so this sequence will step forward on every clock.
On the other hand, if
OPT_ODDR isn’t true, then we’ll step the new register
bits on every clock, rather than just once per
ckstb step above. Therefore,
we’ll step the last
RDDELAY couple of steps at the full system clock speed.
Somethings don’t change. We still need to start the sequence on any request
to read from flash memory, as long as
we aren’t already in the middle of a read, at which point we’d start the
poor man’s sequence
for a pipelined read,
This might make more sense if you “saw” it in action, as shown in Fig. 35 below.
Notice how, for the first several steps of the
f_memread sequence, everything
takes two clocks. Indeed, it is lined up with the output data lines,
o_qspi_dat. The incoming lines,
i_qspi_dat, however are delayed by three
RDDELAY=3. This means that when, in the output time units, it would
be time to read
D[7:4], the data aren’t yet on
f_memread transitions once every other clock. Once it gets to
the end, at
f_memread in this case, it starts transitioning on every
RDDELAY clocks (3 in this case). Then, on the last clock,
o_wb_ack is true.
The same would play out in
f_piperead, the sequence for the continuation
read. Here, in Fig. 35, you can see the beginning of the pipelined read, and so
the port stays active. You may also notice that
are overlapping. We already dealt with some of this above.
Although these changes need to be applied to all of the various sequence vectors, at this point that’s about all that’s left.
Does this mean the design works? Well, sure, it had all of its functionality by this point, and it passed a formal verification check, but … did it work?
Did this formally verified design work on its first time out? Of course not, but it did come pretty close. What I’ve shared above is the result of my debugging work, after all of the pain associated with getting it working.
Care to hear it?
Most of the debugging took place over the configuration port, for the simple reason that the configuration port offers the external user complete control over the QSPI port, and hence complete control over the flash. Even better, I was able to control the configuration port from the debugging bus–allowing me to script commands to be sent to the flash and examine byte by byte any returns from the flash.
The first step was to shut off the start up sequence, by setting
OPT_STARTUP=0. This helps to keep the flash
from interfering with our debugging work on the configuration port.
These commands should place the flash into the right mode. However, when debugging this interaction, I had no real way of knowing (yet), since none of these commands returned responses.
The second step was to request the manufacturer ID from my device. This is an
command, after which every byte clocked through the
interface will return an additional byte of the ID–eventually returning not
only the manufacturer, but also the product number and the size of the
If you remember from our previous discussions of
wbregs address data writes
data to the address given by
wbregs address reads the value from
address and returns it
as a result. Hence, this set of
commands first writes
8'h9f to the
8'h00 to the
and reads the returned result.
This is then repeated three more times, before we issue the command to
CS_n, while yet leaving the configuration port active. (Remember
the bit fields definitions from Fig. 27 above?)
If all goes well, at this point the number should match those from the data sheet for your flash chip.
In my case, things didn’t go quite so well.
No, this didn’t surprise me either. While I had verified much of the controller’s functionality, I hadn’t verified that the Xilinx I/O driver was working with this design. As it turns out, there were some other bugs in the AutoFPGA configuration script for the flash controller as well.
A trace is worth a thousand LEDs in so many ways.
Further, because the commands were separated so far apart in time, I used the compressed version of the Wishbone scope, and so I was still able to capture (roughly) the entire ad-hoc interaction.
The problem was that it was the wrong ID. Looking at the trace again, I could
see that the right ID was getting returned, only that I had the wrong
RDDELAY value. This helped me get the final shifting for the ID right, so
it was now
0x20ba1810 as I was expecting.
By the way, if you ever have to do debug this kind of interaction, I cannot recommend highly enough that you use this known ID value. The trace returned from the manufacturer ID request confirmed for me that my normal SPI transmit was working, and I could read off how to get the manufacturer ID back out.
However, when I turned
OPT_STARTUP back on, rebuilt the updated design and
loaded it onto the board–it still wasn’t working.
At this point, I switched to simulation–just to check that the design was
OPT_STARTUP like it was supposed to. (In hind sight, I should’ve
as soon as I was done with my
As you may recall, I had formally
almost all of the core–but not the startup sequence. I had committed that to
Sure enough, looking at the
generated trace showed that
startup sequence logic wasn’t doing what I wanted.
Once fixed, I went back and ran the design on the board again. When the design still didn’t work, I returned to the Micron data sheet to see if I was missing anything.
How did I figure this out? By using the flashid.sh script again. This time, after assuring myself that the manufacturer’s ID was (still) correct, I read the status register. This looked good. I read the flag status register. This looked good again. (Yes, I am trying to read random status registers from the chip to see what’s going on.) Reading the Non-volatile configuration register showed that I had not activated QSPI I/Os. I wrote a new value to this register. It didn’t change. I tried again, this time adding the “Write-Enable” command first. Now it changed. Now, when I sent the commands to enter QSPI XIP I/O read mode it worked!
I quickly scripted up a C++
Then I scripted up a C++
My excitement, however, was short lived.
It only mostly worked.
I increased the drive strength on the FPGA pins in question.
Now it worked better, but still not consistently enough.
The problem was that every now and then, the flash chip would fail to return the data I had written to it. On a second read, however, it would then read the right value. This suggested to me that there may have been a synchronization problem between the two.
At this point, I started rolling up my sleeves to build a synchronization circuit to capture the bits in the middle of the eye. This would’ve been quite a fun project, and a fun one to blog and explain about.
Perhaps to my relief, perhaps to my displeasure, someone was kind enough to point out on twitter that the Micron flash chip had options for impedance matching that could be quite important at high speeds. The first change I made fixed everything.
Bummer. That synchronization post would’ve been fun. I might still write one later, but for another purpose.
Of course, I then had to go back and adjust my startup script to include this new setup command.
You might also note, I would test changes to the startup script first using the flashid.sh shell script. (Yes, I love the capability the debugging bus offers for scripting unknowns together to find a solution.)
I’ve now tested this new controller with both a Micron flash chip as well as a Winbond flash chip. I’ve also tested a sister controller to this one that uses Dual SPI mode (two data bits, not four, using both MISO and MOSI in a bidirectional fashion)–all with great success.
Even better, in spite of all the logic we dumped into this core, it still builds
into a rather small footprint, as shown in Fig. 36 to the right. In this
figure, the first line shows the number of CMOS gates, in total, that would be
used by this core with all options on,
second line is the same, but limited to measuring the number of
NAND gates the design would use,
were it to use nothing but
and NOT gates.
The third line, marked as
iCE40, shows how many 4-input LUTs would be required
by a design with
OPT_CLKDIV=0. This may be a rather
misleading statistic, though, since yosys
is known to pack logic into the reset circuitry present in the iCE40
The last line is a conservative estimate of the number of 6-input LUTs that
would be required in a Xilinx design–the actual
number is likely going to be much lower. Even in that case, it looks like
we’ve done pretty well! Indeed, this
has a small logic foot print, just as we had desired from the beginning.
All that said, wow, that was a lot of ground to cover! We’ve now gone over
most of the
details in this flash
from the ground up. We discussed the basic requirements of a good
flash controller, and
how to build one that ran at a high speed. Yes, this does run roughly twice
the speed of the Xilinx default
flash controller–if your board allows you to run the
SCK pin in ODDR mode.
We also went though how to then modify that initial basic controller that we
started with to handle burst reads, getting into the
XIP read mode in the first place, sending arbitrary commands to the
and even how to handle I/O delays from using registered I/Os. Once we were
finished, I discussed all of the steps necessary to debug this new flash
No, I haven’t discussed the vendor specific I/O drivers. You should be able to find a decent discussion of them in the respective vendor literature. Instead, I’m trying to keep this blog somewhat vendor independent.
In practice, while I really like how easy it is to port this flash controller from one design to the next, the debugging bus that this depends upon is horrendously slow. Particularly slow are the steps necessary to determine if an erase step is required, or to determine that either the erase or programming steps were successful. Both of these are ideal tasks for a small program running within the FPGA, so if we continue this discussion that might be where we end up next.
Also, as more of a side note, I don’t normally write blog articles this long. This has taken several weeks to write, and is likely going to take you a long time to read. My apologies to you if this isn’t what you are looking for. I’ll try to keep future posts shorter. That said, my prayer for you is that this post will all be worth your while as well, so that you might either trust my own “Universal” flash controller now, or if not that you would at least have a good idea of where to start from when building your own.
One final note, there’s a reason why I’m calling this a “Universal” flash controller, controller, with the “Universal” in quotations. As currently built, this controller will be able to properly interact with all of the flash chips I’ve seen to date. However, I haven’t tested it on every chip in order to be able to prove that it truly is Universal. Moreover, I am aware of other classes of flash devices for which I already know this controller will not work. Still, I like the term “Universal”–even if I have to place it in quotation marks.
God, who at sundry times and in divers manners spake in time past unto the fathers by the prophets, Hath in these last days spoken unto us by his Son, whom he hath appointed heir of all things, by whom also he made the worlds (Heb 1:1)