I’ve been working for some time, off and on, on a port of the ZipCPU to the ICO board. The port isn’t complete (yet), and it’s not paid for, so the work isn’t going very fast. However, I came across something very interesting in this process that I thought would be worth sharing.

At issue is the fact that the iCE40 doesn’t support distributed RAM. Unlike other FPGA’s, it only has block RAM.

To understand the distinction, consider the following piece of code from within the ZipCPU.

Background

At issue is how the ZipCPU reads from its register file.

The register file itself is declared as a memory of thirty two 32-bit words. Internally, these are split into two register sets–a user register set and a supervisor register set, but for now all that matters is that this is declared as a RAM.

	reg	[31:0]	regset	[0:31];

The ZipCPU reads from this RAM after decoding the registers that will be used by the instruction. Once the register addresses, dcd_A and dcd_B are available from the decoder, the CPU reads from its register set.

	assign	w_op_Av = regset[dcd_A];
	assign	w_op_Bv = regset[dcd_B];

Notice how this isn’t a clocked read. This will come back into play in a moment.

This is the simple part. The next step are the many paths from here to the input to the ALU.

From here, let’s focus on the second register’s path, what I call register B. This rather complicated path is shown in Fig 1 below.

Fig 1. Original path from the register read to Operand B

To get from this register read to the input to the ALU, the value needs to go through some logic.

The ZipCPU has 16 registers in its working set. Of these sixteen registers, one is the program counter (register 15). This program counter, however, isn’t maintained in the register file, regset. Instead, it is maintained in one of two separate registers: ipc for the supervisor or interrupt level program counter, and upc for the user program counter. The gie flag (global interrupt enable) controls which of the two program counters is currently in use.

When this information comes from the decoder, dcd_pc contains a pointer to the next instruction in the current instruction stream. If we want to read our own program counter for this instruction, that is if the gie flag coming from the decoder or dcd_gie matches the register set in question, then we’ll want the value to be dcd_pc. Otherwise, it can only be the supervisor mode reading the the user mode program counter.

We’ll pick from between these two program counter registers here.

	always @(*)
	if (dcd_B[4] == dcd_gie)
		w_pcB_v = dcd_pc;
	else
		w_pcB_v = upc;

We’ll need this value in another moment, so hang on to this thought. For now, let’s return to the read from the register set above, where we left the result in w_op_Bv.

The next thing you need to remember from the ZipCPU’s instruction set is that (just about) every instruction can have an immediate attached. The two inputs, therefore, to any instruction are a register I call operand A, and a second value B which includes an optional register, B, plus an immediate. If no register is read, the instruction only references the immediate.

To handle this selection, the decoder produces a flag, dcd_rB. This will be true any time the B operand comes from a register.

There’s also another special register–the condition codes. This register contains a variety of CPU mode selection bits together with the traditional flags, zero (Z), carry (C), negative (N), and overflow (V). As with the program counter above, there are two versions of this register–one for supervisor mode and one for user mode.

In our next step, we are going to calculate the value of the B input in all respects except the immediate.

First, if there’s no register read, then the input (not including the immediate) is zero.

	always @(*)
	if (!dcd_rB)
		w_op_BnI = 0;

Second, in the case of a concurrent register write, the register value is the result of the register write. This allows the CPU to bypass the register file, so that ALU outputs can be immediately available for the next instruction–making the CPU run that much faster.

	else if ((OPT_PIPELINED)&&(wr_reg_ce)&&(wr_reg_id == dcd_B))
		w_op_BnI = wr_gpreg_vl;

Third, if this is a reference to the condition codes register, then we’ll set that based upon some special flags.

	else if (dcd_Bcc)
		w_op_BnI = { w_cpu_info, w_op_Bv[22:16], 1'b0,
				(dcd_B[4]) ? w_uflags : w_iflags };

Once we get past all of that logic, we can finally deal with the expected situation–that the B operand comes from the result of the register we just read above.

	else
		w_op_BnI = w_op_Bv;

As a final step, we’ll add the immediate value associated with the instruction, dcd_I, to our value and register the result for the next stage.

	always @(posedge i_clk)
	if ((!OPT_PIPELINED)||(op_ce))
	begin
		if ((dcd_Bpc)&&(dcd_rB))
			r_op_Bv <= w_pcB_v + { dcd_I[29:0], 2'b00 };
		else
			r_op_Bv <= w_op_BnI + dcd_I;
	end else if ((OPT_PIPELINED)&&(op_rB)
			&&(wr_reg_ce)&&(op_Bid == wr_reg_id))
		r_op_Bv <= wr_gpreg_vl;

Well, not quite. That’s the last step on that clock. Before we send this value to the ALU, we’ll need to double check that the value isn’t being written to the register file on the same clock (again). In this case, wr_reg_ce is a flag indicating that a value is being written to the register file, wr_reg_id is the address of that register, and wr_gpreg_vl is the value of that register.

	assign	op_Bv = ((OPT_PIPELINED)&&(wr_reg_ce)
					&&(wr_reg_id == op_Bid)&&(op_rB))
			? wr_gpreg_vl: r_op_Bv;

If you got lost in the explanation above, relax. My whole point here is that this logic is complicated.

Because of this complexity, this was one of those pieces of logic that I was most interested in formally verifying within the ZipCPU. Along the way I came across several pipeline hazards I wasn’t expecting. For example, what happens if you set r_op_Bv as the result of a register plus an immediate, only to have the register updated on the next clock? Pipeline logic, primarily captured by the op_ce flag, needs to prevent that from happening.

Here’s the problem: the iCE40 doesn’t support memory reads unless the result is immediately registered, as in:

	always @(posedge i_clk)
		w_op_Av <= regset[dcd_A];

	always @(posedge i_clk)
		w_op_Bv <= regset[dcd_B];

While yosys will try to work around this problem using flip-flops and LUTs, it does so at the cost of about 3k or more LUTs on a 7.6k LUT device. That’s bigger than the ZipCPU itself! This so badly broke the bank for the ZipCPU that unless I could change things there was no way it would work on the iCE40.

The Solution

The basic solution to this problem is to move the register read logic one clock earlier–to shuffle logic from one clock to the next. In other words, I needed to read from the block RAM in the decode stage, before I got to the selection logic above. My fear was, how could I patch this change in and remain certain that the CPU would still work?

Hint: this is one of the reasons why I like formal methods.

So, rather than clocking the register address into decoder outputs, I left it as combinatorial logic alone–making two additional outputs for that purpose. I could now read the register one clock earlier, and clock the value in with the same control signal that was used to clock the instruction decoder outputs, dcd_ce:

	if (dcd_ce)
	begin
		pre_op_Av <= regset[dcd_preA];
		pre_op_Bv <= regset[dcd_preB];
	end

But … what would happen if one of those registers was written on the exact same cycle? (This is actually pretty common.) How can we make sure we have the right value from the registers?

To do that, I captured the result of the last registers write, and created a registered flag to indicate that the correct value wasn’t from the registers set. You can see this pictorially in Fig 2.

Fig 2. Adjusted path from the register read to Operand B

See the difference at the top of the diagram? Let’s walk through this updated logic in steps below. Not only do we read from the register file on a clock, but we also keep track of whether the register we need is being written to the register file on this clock as well.

	always @(posedge i_clk)
	if (dcd_ce)
	begin
		pre_rewrite_flag_A <= (wr_reg_ce)&&(dcd_preA == wr_reg_id);
		pre_rewrite_flag_B <= (wr_reg_ce)&&(dcd_preB == wr_reg_id);
		pre_rewrite_value  <= wr_gpreg_vl;
	end

Now, on the next clock we can select which of the two values we should be using.

	assign	w_op_Av = (pre_rewrite_flag_A) ? pre_rewrite_value : pre_op_Av;
	assign	w_op_Bv = (pre_rewrite_flag_B) ? pre_rewrite_value : pre_op_Bv;

Finally, we have some code we can use to replace the original register file read, but this time the register file is read using clocked rather than combinational logic.

My fear in all of this, though, was that I would somehow get this logic wrong. I mean, there’s so much that can go wrong with this change, how can I rest assured that the result works?

Enter formal methods.

Within the ZipCPU core (on the dev branch currently), there’s a set of properties rebuilding the logic we worked through above describing the values going into the ALU (or memory, divide, etc). Getting the CPU and these properties right took the majority of the work using formal methods on this CPU to date. What makes these properties special is that they aren’t clocked. They don’t need to be. Formal properties don’t need to meet timing requirements like synthesizable logic does. As a result, I can assert that both op_Av and op_Bv have the right value using simpler logic than the logic we expressed above. Even better, I can make that assertion independent of any pipeline stalls or hazards. The result is that the property tends to find hazards I wasn’t expecting.

The end of this chain of logic is a short set of assertions.

	always @(posedge i_clk)
	if ((f_op_insn)&&(!f_const_illegal)&&(!fc_illegal)&&(!clear_pipeline))
	begin
		if (((!wr_reg_ce)||(wr_reg_id!= { gie, `CPU_PC_REG }))
			&&(!dbg_clear_pipe)&&(!clear_pipeline))
		begin
			if ((fc_rA)&&(fc_Aid[3:1] != 3'h7))
				`PHASE_TWO_ASSERT(f_Av == op_Av);
			`PHASE_TWO_ASSERT(f_Bv == op_Bv);
		end
	end

The important part is in the middle. First, I assert that the value of the A register going into the ALU is as it should be, then the B value.

Yes, there are a lot of caveats to this assertion–the values only need to be correct if the illegal instruction flag isn’t set, if the pipeline isn’t being cleared, if the debugging port isn’t going to force the pipeline to be cleared on the next clock, and so on.

The good news? Once I made the change outlined above, I could go back and rerun the formal proof–this time with the NO_DISTRIBUTED_RAM define selected. Once the proof passed, I knew the change “worked”.

Conclusion

Fig 3. Development Process

One of the topics I discuss in the two-day formal verification course I teach is the idea of moving logic between clocks. In the course, we adjust a flag indicating that a counter is non-zero to the clock before the counter’s value in question. To me, this is one of the very important uses of formal verification–being able to tell when you make a change, no matter how complex, that the properties you had asserted before remain true. In this case, I was able to prove that the ZipCPU maintained the correct register inputs, in spite of moving the logic earlier by one clock.

Does this mean that the ZipCPU works on the iCE40? Well, let’s just say that now it builds for the iCE40 whereas it wouldn’t build before. The project remains ongoing. I would like to implement the ZipCPU both on the ICO board as well as on Luke Valenty’s new TinyFPGA. Before I get there, though, I’ve still got a lot of peripheral simulation work to get through.

Simulation work? Yes. To avoid FPGA Hell, I’m not only running my designs through the (incomplete) formal proof, but also through simulation. You can see my process in the figure on the right.