Why I like Formal: the ZipCPU and the ICO board
I’ve been working for some time, off and on, on a port of the ZipCPU to the ICO board. The port isn’t complete (yet), and it’s not paid for, so the work isn’t going very fast. However, I came across something very interesting in this process that I thought would be worth sharing.
At issue is the fact that the iCE40 doesn’t support distributed RAM. Unlike other FPGA’s, it only has block RAM.
To understand the distinction, consider the following piece of code from within the ZipCPU.
Background
At issue is how the ZipCPU reads from its register file.
The register file itself is declared as a memory of thirty two 32-bit words. Internally, these are split into two register sets–a user register set and a supervisor register set, but for now all that matters is that this is declared as a RAM.
The ZipCPU reads from this RAM after
decoding the registers
that will be used by the
instruction. Once the
register addresses,
dcd_A
and dcd_B
are available from the
decoder, the
CPU reads
from its register set.
Notice how this isn’t a clocked read. This will come back into play in a moment.
This is the simple part. The next step are the many paths from here to the input to the ALU.
From here, let’s focus on the second register’s path, what I call register B. This rather complicated path is shown in Fig 1 below.
To get from this register read to the input to the ALU, the value needs to go through some logic.
The ZipCPU has
16 registers in its working
set. Of these sixteen
registers,
one is the program counter
(register 15). This
program counter, however,
isn’t maintained in the
register file,
regset
. Instead, it is maintained in one of two separate registers:
ipc
for the supervisor or interrupt level program
counter,
and upc
for the user program
counter. The gie
flag (global interrupt enable) controls which of the two
program counters
is currently in use.
When this information comes from the
decoder,
dcd_pc
contains a pointer to the next
instruction
in the current
instruction
stream. If we want to read our own program
counter for this
instruction,
that is if the gie
flag coming from the decoder or dcd_gie
matches the
register set
in question, then we’ll want the value to be dcd_pc
. Otherwise, it can
only be the supervisor mode reading the the user mode program
counter.
We’ll pick from between these two program counter registers here.
We’ll need this value in another moment, so hang on to this thought. For
now, let’s return to the read from the
register set
above, where we left the result in w_op_Bv
.
The next thing you need to remember from the
ZipCPU’s
instruction set is that
(just about) every
instruction
can have an immediate attached. The two inputs, therefore, to any
instruction
are a register I call operand A
, and a second value B
which includes an
optional register, B
, plus an immediate. If no
register is read, the
instruction
only references the immediate.
To handle this selection, the
decoder
produces a flag, dcd_rB
. This will be true any time the B
operand comes
from a register.
There’s also another special register–the condition
codes. This register
contains a variety of
CPU
mode selection bits together with the traditional flags, zero (Z
),
carry (C
), negative (N
), and overflow (V
). As with the
program counter above,
there are two versions of this register–one for
supervisor mode and one for user mode.
In our next step, we are going to calculate the value of the B input in all respects except the immediate.
First, if there’s no register read, then the input (not including the immediate) is zero.
Second, in the case of a concurrent register write, the register value is the result of the register write. This allows the CPU to bypass the register file, so that ALU outputs can be immediately available for the next instruction–making the CPU run that much faster.
Third, if this is a reference to the condition codes register, then we’ll set that based upon some special flags.
Once we get past all of that logic, we can finally deal with the expected
situation–that the B
operand comes from the result of the register
we just read above.
As a final step, we’ll add the immediate value associated with the
instruction,
dcd_I
, to our value and register the result for the next stage.
Well, not quite. That’s the last step on that clock. Before we send
this value to the ALU,
we’ll need to double check that the value isn’t being written to the register
file
on the same clock (again). In this case,
wr_reg_ce
is a flag indicating that a value is being written to the
register file,
wr_reg_id
is the address of that register, and wr_gpreg_vl
is the value of that register.
If you got lost in the explanation above, relax. My whole point here is that this logic is complicated.
Because of this complexity, this was one of those pieces of logic that I was
most interested in formally verifying within the
ZipCPU. Along the way I came
across several pipeline hazards I wasn’t
expecting. For
example, what happens if you set r_op_Bv
as
the result of a register plus an immediate, only to have the register
updated on the next clock? Pipeline
logic,
primarily captured by the op_ce
flag, needs to prevent that from happening.
Here’s the problem: the iCE40 doesn’t support memory reads unless the result is immediately registered, as in:
While yosys will try to work around this problem using flip-flops and LUTs, it does so at the cost of about 3k or more LUTs on a 7.6k LUT device. That’s bigger than the ZipCPU itself! This so badly broke the bank for the ZipCPU that unless I could change things there was no way it would work on the iCE40.
The Solution
The basic solution to this problem is to move the register read logic one clock earlier–to shuffle logic from one clock to the next. In other words, I needed to read from the block RAM in the decode stage, before I got to the selection logic above. My fear was, how could I patch this change in and remain certain that the CPU would still work?
Hint: this is one of the reasons why I like formal methods.
So, rather than clocking the register address into
decoder
outputs, I left it as combinatorial logic alone–making two additional outputs
for that purpose. I could now read the register one clock earlier, and
clock the value in with the same control signal that was used to clock the
instruction
decoder
outputs, dcd_ce
:
But … what would happen if one of those registers was written on the exact same cycle? (This is actually pretty common.) How can we make sure we have the right value from the registers?
To do that, I captured the result of the last registers write, and created a registered flag to indicate that the correct value wasn’t from the registers set. You can see this pictorially in Fig 2.
See the difference at the top of the diagram? Let’s walk through this updated logic in steps below. Not only do we read from the register file on a clock, but we also keep track of whether the register we need is being written to the register file on this clock as well.
Now, on the next clock we can select which of the two values we should be using.
Finally, we have some code we can use to replace the original register file read, but this time the register file is read using clocked rather than combinational logic.
My fear in all of this, though, was that I would somehow get this logic wrong. I mean, there’s so much that can go wrong with this change, how can I rest assured that the result works?
Enter formal methods.
Within the ZipCPU
core
(on the dev branch currently),
there’s a set of properties rebuilding the logic we worked through above
describing the values going into the
ALU
(or memory,
divide, etc). Getting
the CPU and
these properties right took the majority of the work using formal methods on
this CPU
to date. What makes these properties special is that they aren’t clocked.
They don’t need to be.
Formal properties
don’t need to meet timing requirements like synthesizable logic does. As a
result, I can assert that both op_Av
and
op_Bv
have the right value using simpler logic than the logic we expressed
above. Even better, I can make that assertion independent of any
pipeline stalls or
hazards. The result is that the property tends to find hazards I wasn’t
expecting.
The end of this chain of logic is a short set of assertions.
The important part is in the middle. First, I assert that the value of the
A
register going into the
ALU is as it should
be, then the B
value.
Yes, there are a lot of caveats to this assertion–the values only need to be correct if the illegal instruction flag isn’t set, if the pipeline isn’t being cleared, if the debugging port isn’t going to force the pipeline to be cleared on the next clock, and so on.
The good news? Once I made the change outlined above, I could go back and
rerun the
formal proof–this
time with the NO_DISTRIBUTED_RAM
define
selected. Once the proof passed, I knew the change “worked”.
Conclusion
One of the topics I discuss in the two-day formal verification course I teach is the idea of moving logic between clocks. In the course, we adjust a flag indicating that a counter is non-zero to the clock before the counter’s value in question. To me, this is one of the very important uses of formal verification–being able to tell when you make a change, no matter how complex, that the properties you had asserted before remain true. In this case, I was able to prove that the ZipCPU maintained the correct register inputs, in spite of moving the logic earlier by one clock.
Does this mean that the ZipCPU works on the iCE40? Well, let’s just say that now it builds for the iCE40 whereas it wouldn’t build before. The project remains ongoing. I would like to implement the ZipCPU both on the ICO board as well as on Luke Valenty’s new TinyFPGA. Before I get there, though, I’ve still got a lot of peripheral simulation work to get through.
Simulation work? Yes. To avoid FPGA Hell, I’m not only running my designs through the (incomplete) formal proof, but also through simulation. You can see my process in the figure on the right.
God is our refuge and strength, a very present help in trouble. Ps 46:1