Debugging your soft-core CPU within an FPGA

We’ve already looked at the requirements for debugging a CPU in general, as well as how to debug a CPU within a Verilator based simulation. Let’s now return to this topic and take a look at how to modify your soft-core CPU so that you can debug it once it is placed within an FPGA.

Fig 1: Soft-Core CPU H/W Debugging Needs

When we discussed the general needs of a debugger, we used a figure similar to Fig 1. to describe a CPU’s debugging needs. We addressed the left column, debugging the CPU while in simulation, in a previous post. Today, the figure at the right has been modified to highlight todays discussion and focus: how to add the necessary logic into a soft-core CPU to support debugging.

As shown in the diagram, the basic operations we’re going to need to support are resetting, starting, halting, and stepping a CPU, as well as examining and changing CPU, state registers. You may wish to review how the ZipCPU handles pipeline control, since the logic we shall discuss today needs to fit nicely into that context.

That H/W Debugging Interface

If you’ve never done this before, please don’t start by trying to implement GDB’s remote serial protocol within Verilog. The protocol is very powerful, and we’ll discuss how to use it later to connect your CPU to GDB. The problem is that the protocol is complex, and it will take a lot of work to process it within hardware. Keep reading, there’s an easier way.

As a first step, think for a moment about what debugging your CPU will require. In particular, you’ll want to be able to read and write both registers and memory.

Reading from memory requires the address you wish to read from as well as a strobe signal to indicate your desire to read

If your address space is big enough, this same sort of command and interface can work for reading CPU registers just like it does for memory.
Writing to memory requires an address, a value, and a strobe to tell you when to write the value to the given address.

As with reading, if you can allocate an address for each register within your CPU, the same interface you used for reading registers could also work to writing registers within your CPU.

The approach can even be expanded to include not only registers values, but also internal (debugging) state variables from within your CPU.

As a third example, a control register could also be used to tell the CPU when to execute an instruction, and when to hold in reset.

Fig 2: Placing a CPU on the Debugging Bus

All of these interactions, therefore, are easily understood as things that could take place across a “bus” with both memory and memory mapped peripherals on it. Therefore, one might consider giving the CPU a bus slave interface, and hooking it up to the debugging bus we’ve been working with (or something similar) as shown in Fig 2.

This approach has a couple of advantages. First, the debugging bus can be used to debug both the peripherals and the memory the CPU will need to work with later. Second, if the CPU and debugging bus are each given the same view of the peripheral set, then no separate address map and decoder needs to be created. Third, this approach creates a means, independent of the CPU, of reading and writing to memory. This could be very important later when building a program loader for the CPU, since it would then allow you to load the CPU’s program into memory and test it, without relying on any internal ROM within the CPU that would cause the design to need to be resynthesized anytime something changes.

The downside of this approach is that, depending upon your implementation, the bus arbiter may slow the CPU’s access to memory by a clock (or two).

Fig 3: ZipCPU's debugging interface

This was the approach taken by the ZipCPU, as shown in Fig 3, so we’ll use the ZipCPU as our example of this approach in our discussion below. The ZipCPU was given two address locations on the debugging bus: a control and data location. (These are both discussed and defined in the ZipCPU specification document.) A small wrapper around the ZipCPU proper, called ZipBones, connects to the control register of the debug slave port and controls the reset and halt lines into the CPU. These are used to implement reset, halt, start, and step operations as we’ll see shortly.

The ZipCPU also has a second wrapper with more functionality to it, called the ZipSystem, but since the logic within the ZipBones is simpler, we’ll focus on it.

Our discussion will focus on the reads and writes of these two locations, the control and data ports, although you may wish to give your own CPU more registers than just these two.

Reseting, halting, and stepping the CPU

Let’s look at the control register for the ZipCPU for a moment. Writes to this control register have the side-effect of controlling the i_halt and i_rst (reset) lines within the CPU. These side effects will cause the ZipCPU to run, halt, step, or even reset as requested.

The first side effect to be discussed is the reset. Like many digital logic cores, the ZipCPU has a reset line going into it. Controlling this reset is also quite possibly the simplest interaction with the bus. Specifically, any time the control register is written with the reset bit set, the CPU is reset. Further, this reset line into the CPU is initialized high, to make sure that the CPU always starts from a reset state.

initial	cmd_reset = 1'b1;
always @(posedge i_clk)
	cmd_reset <= ((dbg_cmd_write)&&(i_dbg_data[`RESET_BIT]));

Inside the ZipCPU, this reset line causes the ZipCPU to reboot. While it only (re-)initializes a minimum of variables, it is enough to get the CPU start from (nearly) known conditions. In particular, all error conditions, cache valid indications, and pipeline valid flags are cleared on reset. Further, the CPU is sent to a pre-programmed address. What doesn’t happen is that the CPU registers are not re-initialized (the program counter and flags registers are though). This allows some amount of fault recovery in software, if desired, prior to setting all of the registers to known conditions.

The second control line going into the CPU is a master halt line, i_halt. This line, if set, will cause the CPU to halt in such a way that no instructions will go into the ALU, memory or divide units, but instructions that have already entered these units will be allowed to finish. It does this by setting the stall logic associated with units, as we discussed during our CPU pipeline signaling post.

The neat thing about the master halt line concept is that the CPU is designed to halt at a stopping point between instructions when using it. Instructions that have entered the ALU, memory or divide stages are allowed to complete, but further instructions are not allowed to enter these stages. As a result, the CPU can be started, stepped, or halted by adjusting this master enable (i.e. i_halt) line.

This i_halt line into the CPU is calculated from a couple of pieces of logic in the ZipBones wrapper. The first is the cmd_halt register which is controlled by writes to the control register. On a reset, the CPU will start in a halted mode (if the boolean parameter START_HALTED is set to true). Ever afterwards, any write to the halt bit in the ZipBones status register will set or clear this bit with two exceptions: cmd_step and cpu_break.

`define	STEP_BIT	8
`define	HALT_BIT	10

always @(posedge i_clk)
	if ((i_rst)||(cmd_reset))
		cmd_halt <= (START_HALTED);
	else if (dbg_cmd_write)
		cmd_halt <= ((i_dbg_data[`HALT_BIT])&&(!i_dbg_data[`STEP_BIT]));
	else if ((cmd_step)||(cpu_break))
		cmd_halt <= 1'b1;

The first exception is the cmd_step logic. If the halt bit is set at the same time the CPU is instructed to step forward by one clock, then the halt request is ignored until the cmd_step has been true for one cock. We’ll come back to this exception in a moment.

The second exception is the cpu_break signal. This is shown in Fig. 3 as the hardware break signal. This is the signal the CPU creates when it has encountered an unrecoverable fault–such as trying to execute an unimplemented instruction while in the supervisor (i.e. interrupt) state. Other faults within the supervisor state will also cause the CPU to fault as well, such as the break instruction, a divide by zero fault from within supervisor mode, or a wishbone bus error. This cmd_halt state captures that fault, and then holds the CPU in a halted state for the debugger to come by and examine it. (Alternatively, the CPU could be programmed to just reboot.)

On that note, let’s return to looking at the step bit. If the step bit is ever set, the ZipBones wrapper will release the halt line for one clock and then set it immediately again. This will cause one instruction to enter the ALU/memory pipeline stage. It works in conjunction with the cmd_halt bit above, so that if the step register is ever true, the cmd_halt register will get set on the next instruction.

initial	cmd_step  = 1'b0;
always @(posedge i_clk)
	cmd_step <= (dbg_cmd_write)&&(i_dbg_data[`STEP_BIT]);

While this description may sound simple, the devil is in the details. For example, what happens when the CPU is in the middle of an atomic operation? What if an interrupt comes in while the debugger has the CPU halted? (It gets ignored.) What if the CPU is in the middle of executing a pair of instructions from a compressed instruction set word? (The ZipCPU has no ability to restart a compressed instruction word mid-way through …) What if the CPU is loading a cache line, and the memory is slow to respond? (i.e. broken)

All of these details can make this halt line difficult to implement.

Clearing the Cache

Before we move on to gaining access to CPU registers, the control register offers one more big capability–that of clearing the cache.

initial	cmd_clear_pf_cache = 1'b0;
always @(posedge i_clk)
	cmd_clear_pf_cache <= (dbg_cmd_write)&&(i_dbg_data[`CLEAR_CACHE_BIT];

This is one of those annoying details that you may not think of initially. If the CPU is halted, the debugger is free to change memory, right? Hence, the debugger might wish to swap a normal instruction for a BREAK instruction or vice versa. The problem lies in whether the CPU has already read that instruction into its cache. If the instruction the debugger wishes to change is already in the cache, then the CPU might not notice the fact that the debugger has changed that memory. (The ZipCPU cache has no bus snooping capability … yet.)

This command also clears the CPU’s pipeline for essentially the same reason–lest the instruction the debugger wished to change was also within the pipeline already and just waiting to execute. We discussed how this was done earlier, when we discussed how the ZipCPU implemented its pipeline logic.

Reading and Setting Registers

While a proper bus protocol makes sense for reading from CPU registers, as we discussed above, the ZipCPU’s debug implementation isn’t quite a full bus implementation. Perhaps this interaction is ready for redesign. For now, I’ll just explain it as it is.

The ZipCPU control register contains a set of six address bits. Writes to the control register can be used to set these six address bits as well other flags such as those we discussed above. These then become the wishbone address of a register within the CPU. Ever after, reads from (or writes to) the ZipCPU data register will adjust the CPU. register, addressed by these six address bits.

Remember how we discussed earlier that a register read from a bus is just a big case statement? The same is true of the ZipCPU. The only difference within the ZipCPU is that 28 of the 32 ZipCPU registers, are stored in an on-chip RAM while the other four are collected from a set of control and status bits, and the two program counters. Reading from the bus, therefore, is almost the same as the big case statement we discussed earlier:

always @(posedge i_clk)
begin
	// 28 registers are normal, and can be read from a memory
	o_dbg_reg <= regset[i_dbg_reg];

	// The PC is a bit different
	if (i_dbg_reg[3:0] == `CPU_PC_REG)
		o_dbg_reg <= w_debug_pc;
	else if (i_dbg_reg[3:0] == `CPU_CC_REG)
	begin
		// As is the flags register
		o_dbg_reg[14:0] <= (i_dbg_reg[4])?w_uflags:w_iflags;
		o_dbg_reg[15] <= 1'b0;
		o_dbg_reg[31:23] <= w_cpu_info;
		o_dbg_reg[`CPU_GIE_BIT] <= gie;
	end
end

Writes are a touch more difficult, since the debugger needs to insert any register writes into the processing chain of the CPU.

The ZipCPU. handles such writes by creating a module parallel with the ALU and memory. This module (really only a register and about 4 lines of code) is only active if the CPU is halted.

always @(posedge i_clk)
	dbgv <= (!i_rst)&&(i_halt)&&(i_dbg_we)&&(r_halted);
always @(posedge i_clk)
	dbg_val <= i_dbg_data;

Likewise, the register (and value) the ALU would’ve written upon completion is modified during a halt as well:

always @(posedge i_clk)
	if (adf_ce_unconditional)
		// A normal register write, if the CPU is running
		alu_reg <= op_R
	else if ((i_halt)&&(i_dbg_we))
		// A debug register write, requiring the CPU to be halted
		alu_reg <= i_dbg_reg

This then sets the write values on the clock before writeback. (The adf_ce_unconditional flag is a piece of the ZipCPU’s pipeline logic that we may come back and address in more detail later in a post on pipelining.)

Finally, so that the debugger can know that this write has occurred, the ZipCPU holds the stall register high any time it the CPU hasn’t completely halted.

assign	o_dbg_stall = !r_halted;

You may notice, if you look at the code, that there’s no acknowledgement line. Indeed, the acknowledgement line is generated at the bottom of the ZipBones file based upon the fact that any request made to the CPU, as long as the CPU isn’t stalled, is successful.

Conclusion

You may notice that the logic above only depends upon a couple of wires, and that these wires have a very simple amount of logic assicated with them. This is how digital design should be. The trick to every problem is knowing how to make the problem simple.

In our case, this problem is simplified by first creating some form of debugging bus to get a bus access point to our hardware and peripherals, as well as understanding several various pipeline strategies, and then second understanding how a simple CPU can use such a strategy.

This still leaves us with many more ZipCPU topics to discuss, such as how to add or remove peripherals by simply adding or removing parameters from an AutoFPGA command line. However, we are going to postpone that discussion until after I have the opportunity to discuss AutoFPGA at ORCONF this year.

In the meantime, then, I’d like to turn this blog’s attention to the DSP topics of both sine wave generation and digital filtering. We’ll come back to the ZipCPU later–if for no other reason than I’ve been asked to discuss how to modify GCC to support a new CPU backend.