The ZipCPU meets blinky
If you’ve never heard of “blinky” before, it’s the name given to a piece of software or even an FPGA design that simply blinks an LED. We’ve discussed building blinky before, as well the more advanced (but no less fun) project of moving an active LED back and forth across a set of LEDs like one of my favorite TV shows as a kid, “Knight Rider”.
Even better, we’ve also discussed how to create a general purpose I/O controller, which could then be used to run a blinky program from within a CPU. In that particular article, we also measured how fast a CPU could toggle an I/O pin as part of such a blinky program. The resulting toggle rates, between 1 and 47MHz, are fairly impressive for a soft-core running within an FPGA with a 100MHz system clock.
Today, let’s return to blinky again, but this time let’s compare and contrast
several approaches to the problem of toggling four separate LEDs: one at 1Hz
,
one at 2Hz
, one at 3Hz
, and a fourth one at 5Hz
.
FPGA blinky
The easiest way to toggle four separate LEDs at once on an FPGA is to create
four separate blinky modules. Each module would have a counter of some number
of bits, say MSB+1
bits, so it might be defined as reg [MSB:0] counter;
.
We could then add a different step in each module, where the step size
is 2^(MSB+1)/CLOCK_FREQUENCY * BLINK_FREQUENCY
, and so create blinking LEDs
at any frequency we want.
Internally, this might look something like Fig. 1 on the right. Fig. 1 shows four separate logic blocks, each similar to the block above, and each toggling an LED at its own rate.
Of course, this doesn’t tie our LEDs together in phase. What if we wanted all of them to have the same phase, so that they all turned on together at the top of a second?
In that case, we’d need to multiply a common counter, set to step at
2^(MSB+1)/CLOCK_FREQUENCY
, by our blink frequency to get the result.
If you’ve never built a design like this before, then I would encourage you to try this. Remember to formally verify it first, and then run it in simulation. The tutorial should help you there if you have any questions.
For a next level challenge, consider removing the multiplies and replacing them with shifts and adds. Since we’re only multiplying by 2, 3, or 5 above, this should be fairly easy.
If you work with FPGAs at all, this test should be fairly basic–perhaps even too easy. If you haven’t, this is a fun place to start.
Today, though, let’s take it up a notch.
CPU Blinky, polled
Moving from simple logic to a CPU is a big step within an FPGA. Even if we use a resource minimized CPU, such as the ZipCPU, you’ll still need a lot of additional infrastructure. At a minimum, you’ll need a ROM to store startup instructions (I like using a flash memory device), a RAM to hold any local variables, a timer to determine what 1Hz is and a GPIO controller to actually set any LED values. We’ll also use a Programmable Interrupt Controller (PIC) as part of our solution. Tying all of these together will require some type of memory and peripheral bus (I like Wishbone), such as is shown in Fig. 3 on the right.
When we last discussed blinking an LED from a CPU, we discussed how to turn the LED on by writing to a particular register from a C program,
and then writing again to turn it off.
If you wanted to toggle an LED across some time period, you could wait in a for loop, and then toggle your LED–as shown below and in Fig. 4 on the left.
Getting the value for WAITTIME just right might take some work. Worse, caches are notorious for providing fast but unpredictable wait times for both instructions and data.
But how might we handle four LEDs each toggling at a different rates?
We could use a timer! Remember our work building an
interval timer some time
ago? Let’s use it now. We’ll suppose the address of this timer is kept in the
constant pointer _timer
,
and we’ll have it create a repeating interval of one millisecond.
We can then use our interrupt
controller, with a
register addresss
at _pic
, to determine if this timer has tripped:
If the interrupt
has tripped, we know a millisecond has passed, so we can increment our
millisecond counter, count_ms
, and then toggle each of our LEDs.
Of course, polling the interrupt controller like this is really the wrong way to do this, but let’s come back to this thought in the next section when we discuss how to do this with interrupts.
One of the neat things about this approach compared to the uncalibrated
for
loop above is that we can now know exactly how many clocks take
place between interrupts.
We also know that, should the
CPU,
be late in processing an LED fast enough, at least the sequence will maintain
its frequency rather than randomly getting later and later and so slower–since
the timer
restarts itself every ms in this case.
For now, let’s look at this toggle_leds_on_ms()
function. How should this
function work? The same way as before! We’ll multiply our counter by
the toggle rate, divide by the number of counts per second, and then
grab the resulting bit of interest from the number of times the counter
wraps.
Remember the way we constructed our GPIO
controller: the
top 16 bits on read are any inputs to our design, while the bottom 16 bits
are outputs. To be able to set particular output bits and not others, we write
a mask of the outputs we wish to adjust to the top 16 bits of the
register–hence the (which_led << 16)
logic from above. That way we can
leave the other I/O registers alone, while just adjusting only the ones
we want to change.
Interrupt driven CPU Blinky
What if we could shut the CPU down, though, when nothing was changing?
This it the purpose of the ZipCPU’s
WAIT
instruction.
We can get access to it from C without needing any assembly by calling the
built-in zip_wait()
function. This instruction sets the SLEEP
bit in the
ZipCPU’s control
register.
Further, if the ZipCPU
isn’t already in user mode (interrupts are enabled only in user
mode), it puts the
ZipCPU
into user mode.
Then, when an interrupt
trips, the ZipCPU will return to
supervisor mode (where interrupts are
disabled) so that we
can process the interrupt.
The ZipCPU
specification
discusses how to do this using a wait_for_interrupt
function. This function primarily deals with setting up the interrupt
controller,
but once done it issues a zip_wait()
instruction for exactly this purpose.
Now we can rewrite our code from above to wait for an
interrupt.
Once done, the CPU
will sleep between its top-of-the-millisecond
computations. The biggest difference in the code below is that
we now issue a wait_for_interrupt()
call at the top of every loop.
Following that call, things should look about the same as before.
Not bad, huh?
This program now functions exactly the same as the last one, save that the ZipCPU is inactive while waiting for the interrupt. This can have two advantages. First, the ZipCPU will stop using the bus, allowing any non-CPU logic to transfer data without contention from the ZipCPU. Second, since the ZipCPU will stop issuing instructions, it can be placed into a lower power state. Stopping the CPU clock at this point might even be an option to lower power–as long as any interrupt source kept clocking, and as long as the interrupt controller could restart the CPU’s clock. Still, it is doable, although it does depend upon how much work you want to do to keep your power down.
What if we wanted to get really fancy, though, and create a multitasking blinky? One where we had a separate program to toggle each of several various LEDs?
Multi Blinky
To build the multi-tasking blinky, let’s step back and just look at the question of how to run multiple software programs on a piece of hardware.
The easiest way to do this, software-wise, might be to build multiple CPUs, as shown in Fig. 7 on the right. Easy, that is, until the two programs need to communicate with each other … but that’s a story for another day. Each CPU might then run a separate program toggling an LED.
Let’s build such a program now. We’ll start by removing the counter
from our toggle_leds_one_ms()
function. Instead, we’ll make the total number
of milliseconds that have passed into a global variable.
We’ll then adjust our toggle_leds_one_ms
function so that it just reads and
references this global milliseconds
counter. We’ll call this new function
toggleled
.
Notice our care to only read the global milliseconds
value once. Since reads
are atomic and since we copied the milliseconds
value, we don’t have to worry
about it changing mid-routine as a result of any routine that might
interrupt
this processing.
A second thing to notice is that the
design we are basing this off of has a
Special Purpose I/O
register,
_spio
. Setting LED’s in this _spio
register is just like using the _gpio
register above, save that the LED area is now the lower 8-bits, and the
“adjust-these-bits” area is the 8-bits above that. Hence, our reference
above to _spio
and (led_bits<<8)
.
Finally, as long as something exists to count and keep the milliseconds
counter for us, we can write a program to toggle our first LED at 1 Hz
as simple as,
Now if each CPU on our circuit board had such a program, they could all toggle their LEDs together.
There are a couple of problems with this approach. First and foremost, the ZipCPU project has been built around small FPGAs and low logic. (I never really had the budget for much more.) Four CPUs, whether on four circuit boards, four chips on the same circuit board, or four CPUs within the same FPGA, has never been within my budget. I’d like to run each of these four separate programs on the same CPU instead.
The solution to this problem is to virtually switch CPUs over time, as shown in Fig. 8 below, through a process called time-sharing.
Here’s how it works: First, the
CPU will start out running one
program–we’ll call this context 1 or C1
for short. Then, after some
period of time, we’ll call it a quanta and set it to 1ms
, an
interrupt
will return the
CPU to its supervisory task, S
.
The supervisory task will switch programs to the second context, C2
, which
will then run for the next quanta.
The key to the whole operation is that each “program” needs to
believe that it owns the
CPU.
In order to support this,
CPUs
are typically designed so that their entire state is captured within a set of
registers.
This
register
set, defining what the
CPU
is up to, is called a context. Each context contains its own
set of registers,
R0
-R15
on the
ZipCPU,
to include the
stack pointer SP
(also known as R13
), condition codes
CC
(also known as R14
), and the address of the next instruction, often called
the program counter PC
or equivalently R15
on the
ZipCPU.
Therefore, in order to switch from running one program to another all the
CPU needs to do
is to write the current context (i.e.
register set)
to memory, and to then to read the stored context for the next program
from memory. This is often called a context
switch.
To facilitate context switching, the ZipCPU maintains two copies of its register set: one for the supervisor and one for user programs. Of these two, the supervisor context is never swapped. Indeed, it is the supervisor context that swaps user contexts.
Outside of the CPU, each context will also need an area of memory for its local variables. This is called the stack, and the stack pointer will point to a location within this memory–starting at the end of the memory. As memory is allocated, the stack pointer will grow towards lower memory as shown by the upwards arrow in Fig. 10.
We can use a simple array of integers to capture the information contained in a ZipCPU register set.
To swap between one program and the next, we’ll just swap register sets. The two programs will never know what happened, because when the next program is activated, all of its data will be right there in its registers just as it was when it left off.
Let’s walk through how we might do this.
The first step will be to assign a stack to each task, as shown in Fig. 10 above. This is a place of memory designed to hold each task’s local variables.
Normally, I’d use malloc()
to do this or even the C++ new
operator.
Today, we’re going to try to do this without the C-library. As a result,
we’ll need an alternative. I’m going to call that alternative ugly_malloc()
.
It will act the same as malloc()
, except that there’s no free()
call and
so the memory will never return.
This ugly_malloc()
function is built around a pointer to the end of our
program’s fixed memory locations. The AutoFPGA linker
scripts define this
pointer as _top_of_heap
. We can grab that here,
Just getting to the point where _heap
even gets this value takes a lot of
work, much of which we will skip here. That work starts in the
AutoFPGA configuration
script
that describes where the ZipCPU’s
memory is even located on a particular hardware architecture. That script
defines the _top_of_heap
pointer and makes certain it’s aligned. The next
step takes place in the
bootloader,
since _heap
is a variable whose value might change. Such variables need to
be set initially. So, this
bootloader
copies the initial value into _heap
before starting our
program.
Finally, we can now write this ugly_malloc()
function. This function works
by just grabbing the next nbytes
from the heap, and then incrementing our
_heap
pointer to the next unallocated section of memory. Since the
ZipCPU
cannot (yet) read from unaligned memory, we’ll also need to make certain this
pointer remains aligned.
Using this ugly_malloc()
function, we can now allocate some local
variable space for each of our programs. This will be the “stack”
space
used by these programs, and so we’ll use this to set the SP
register
for each of the programs.
Thus function allocates a section of memory nbytes
in length, and then
returns a pointer to the end of it. Our tasks won’t actually write to this
end value, but will instead back up
this pointer
by however much space they need as they need it. This is illustrated in
Fig. 11 by an upwards arrow within each task’s
stack memory area
showing that the stack
grows upwards towards low memory.
The astute observer will notice that the stack spaces, which are illustrated in Fig. 10, both for the supervisor and each of the user contexts, are not unlimited. If they grow too far, the stack will overflow into other memory regions causing … lots of problems. Picking how much space to allocate for the stack is therefore quite important, and a problem for which I don’t (yet) have a good solution for.
Since we’ve chosen not to use any system libraries for this code, we’ll need
to write our own memzero
routine so that we can start our tasks off with a
clean slate. This memzero()
(should-be a) library function is also
pretty basic,
Much as one might expect.
Incidentally, we could do this operation four times faster if we took advantage of the fact that our memory is both word aligned and an integer number of words. For now, we’ll leave this as an exercise for the student to try later.
Now that we have this background under our belt, it’s time to build our multi-tasking blinky.
Starting at the top, the
CPU
will begin by executing a
bootloader.
This function
copies our program into memory (if necessary), and then initializes
any global variables. Once complete, the
bootloader
will call our main()
function–giving the appearance that this is where our
program starts. Once in main()
, we will first define all of our
tasks (i.e. contexts), as well as a current
task pointer to point
to the task that is currently active. The tasks themselves in this simple
example consist of nothing more than the task_context
structure we defined
above containing the
register
values of each running program.
I’ve also declared a heartbeats
variable as well. We’ll use this to debug
our program and to determine if anything has gone wrong and we need to enter
into the debugger.
Specifically, if ever the heartbeats
counter stops
counting, we’ll know our program is dead.
There’s actually a couple ways of implementing this heartbeats
idea.
-
One way, shown above, is to create a variable on the stack to hold this value.
The first problem with this approach is that the compiler might move this variable into a register. When it’s in a register, we’ll need to use the CPU’s debugging interface to read it in order to know if the
heartbeats
have ever stopped, rather than reading this value from memory using the debugging bus.There two big problems with this approach. The first is knowing whether
heatbeats
is in a register vs being in memory. The second problem is knowing which registerheartbeats
is kept within. Until the ZipCPU supports a source-level debugger, this will require examining some (dis)assembly to see how the compiler allocated it. You can usezip-objdump -D intdemo > intdemo.txt
to examine this (dis)assembly. Since I find myself doing this so often, there’s amake
option in the Makefile tomake intdemo.txt
which does almost exactly this. Themake
option puts some other information intointdemo.txt
as well, so feel free to try it out yourself and see what you think. -
A second way to handle the
heartbeats
value would be to declare it as avolatile unsigned
value. If we do that, theheartbeats
value will be forced into local (stack) memory. We can then use our debugging bus to read it even while the CPU is running.The problem is, which memory address will
heartbeats
get placed into? Typically, as long asheartbeats
is the first value declared inmain()
, it’ll always be at the same place on the stack, but finding this place the first time might take some work. -
A third option would be to declare
heartbeats
as a global variable.
Were we to do this instead, wbregs
has an option where, if given a map file,
you can read this value by name.
But let’s get back to our
program.
We just created memory to hold NTASKS
register
sets (contexts). Now let’s give them some initial values. The most
important values to provide are the
stack pointer and the
program counter.
Once we’ve done all that, we’ve almost finished our startup processing. There are only a couple steps left.
The first is to choose to start the first task, and then to load the user
registers
from that task pointer. The
ZipCPU toolchain provides a
zip_restore_context()
function to make things easier. This function expands
into some code to copy the
register
values from the address given, in this case a pointer to current->r[0]
,
into the user
register
set, uR0
through uPC
.
Once done, we can then issue a “return-to-usermode” instruction, assembly
RTU
or zip_rtu()
from C, to switch from supervisor to user mode.
Only, we can’t do that just yet. ZipCPU programs only exit user mode on interrupts, exceptions (i.e. faults), and traps (system calls). Since our program shouldn’t be creating any exceptions (if it works), and since we’re not issuing any system calls, we’ll need another way to grab control back from userspace: interrupts.
Let’s therefore set our timer to interrupt us every millisecond.
It’s now time to write out main logic loop. We’ll start out by clearing
every other LED, and incrementing our heartbeats
counter. We can then
run the user task.
Once we issue the RTU
instruction, the
ZipCPU
will switch to using the user-register set. Since everything is captured by
this context, switching to this user-register set will feel like switching
which program is running.
Unlike many other CPUs which have a single register set, the ZipCPU maintains the supervisor’s context while running in user mode. That means that the supervisor program counter, stack pointer, and indeed all of the supervisor registers are maintained until the CPU returns to supervisor mode from user mode. What that means is that, on an interrupt, the CPU will continue running this supervisor function where it left off.
The more traditional approach would be for the CPU to suddenly jump to an interrupt service routine. The address of such a routine would be kept in a special memory location that the CPU could look up and start from when the context switch needed to take place.
By just switching register sets, the ZipCPU is kind of unique in this way. I personally find it easier to write multi-tasking programs as a result.
On our return, the first thing we’ll do is set those LEDs we just cleared–every other LED is now set to indicate we are in supervisor mode. I’ve found this to be a really useful way of debugging what goes wrong when things don’t work: if these LEDs are on, the CPU is in supervisor mode, else it is in user mode.
We’ll turn these off again before we leave supervisor mode.
Our next step is to check the user-mode CC
register to see if we left user
mode as a result of some form of exception. If so, we’ll call a panic()
function–more on that later. It’s important to enter the panic()
function
as soon as possible once we detect an exception, so that we can debug anything
that went wrong with as little change to the system as possible.
At long last, it’s now finally time to check the interrupt
controller
to see if an
interrupt
has taken place. If so, we’ll increment the milliseconds
counter
that all of the tasks are using.
We’ll then want to acknowledge this interrupt, so we don’t get interrupted again until the next millisecond interrupt.
Our last step is to swap tasks.
This is done in three parts. The first part copies the
user mode registers
into the task_context
memory. In this case, since we’ve kept a
pointer to our current
task context this is fairly straightforward.
The second step is to decide which task to call next. This is often called
“CPU scheduling” or “task scheduling”, and many articles have been written
on this topic. We’ll just keep it simple here and move to the “next” task in
our list. In hind sight, it might’ve been easier to maintain a current
task_id
index as well, but I’ll leave that to you to do.
The final step in a ZipCPU context switch is to restore the registers from the new “current” task back into the user register set.
Task swapping like this is more
often an assembly function, and so these two builtin-function calls simply
implement what would be those assembly instructions. They are easily
identifiable from the ZipCPU
disassembly since these are the only instructions that will reference the
uR
register set. If
you are really interested in actually seeing their
definition, you can find it in the GCC ZipCPU
patchset. Here it is for saving the
context, and here again for
restoring the context.
The function basically works by reading four registers from either
memory or the user register set, then writing them to the user register set
or memory. Indeed, the function is little more than a memory copy.
That leaves us with only one loose end to return to: the panic()
function.
The purpose of this function is just to tell us that something has gone wrong, and we need to do some debugging. It also helps us start that process by helping us identify which problem has taken place. I mean, we’re writing a blinky function therefore it should be obvious if there’s a problem–the LEDs won’t blink like they should. But how to start debugging next?
-
We’ve chosen to set certain LEDs to indicate we are in supervisor mode (interrupts disabled), and others to indicate we are in user mode (interrupts enabled).
This supervisor mode indicator LEDs should blink so fast that they appear to be lit dimly. If they ever turn off, or start shining brightly, then we can therefore identify which mode we were in when the CPU stopped.
-
The purpose of the
panic()
function is to help us diagnose what happened to a broken subtask. In this case, if we just stop the LEDs from blinking, we might not be able to tell the difference from a CPU freeze above.Therefore we’ll blink all of our LEDs, either all on or all off together, to indicate a user exception took place.
This particular implementation of panic()
uses a system power-up
counter.
This counter increases on every system clock, starting at power up, until
the top bit is set. Once the top bit is set, the power up counter keeps
that bit set and becomes a rolling 31-bit counter with the other bits.
We can grab bit 28 of this counter as an indication that we need to change the LEDs.
While we could’ve used the
timer for this,
understand that we are in a panic()
situation. We want to leave the
CPU state as un-changed as possible
so that we can diagnose whatever fault took place. In particular, we’ll want
to leave the user register set untouched. We also don’t know if the
fault was associated with interrupt processing or task swapping or something
else. For all of these reasons, this code is has been kept as simple as
possible.
Another thing we could’ve done in this panic()
routine would’ve been to
issue a simulation NHALT
(hardware NOOP) instruction to halt any simulator
at the fault itself. By turning tracing on and then running the simulator
like this, it’s fairly easy to figure out what went wrong. (Easy, perhaps,
but still pretty intense–examining a trace is not trivial.) Alternatively,
we could’ve triggered any CPU-focused Wishbone
Scope.
This is typically how I debug the CPU if a bug makes it to hardware and the
user register set doesn’t tell me enough to know the cause of any fault.
Video
Want to see how we did? Check out the video below.
In this video you’ll see a MAX-1000 board with 8-LED’s. Half of the LEDs are toggling at rates of 1Hz, 2Hz, 3Hz, and 5Hz. These are the far left LED, and every other LED to the right. The other four LEDs are shining dimly, indicating that the CPU truly is handling interrupts and swapping tasks as desired.
Conclusions
Since we’ve already covered both how to make an LED blink from Verilog, as well as from C, it only made sense that we’d discuss how to blink an LED in some more advanced fashions–such as by using an interrupt or from a multi-tasking program.
While blinking an LED might seem like an exceptionally trivial task, let’s consider what we’ve learned: We learned how easy it was to blink an LED from Verilog. Even synchronizing multiple blinking LEDs to within a clock period was fairly easy. Blinking an LED on a CPU has the advantage that it doesn’t typically take (much) more logic resources–but only after you’ve already paid for the CPU, it’s boot code, it’s bootloader, memory, the system bus and the interconnect/crossbar used to hold everything together.
Of course, this would be more valuable if we were doing something more than just blinking an LED. Still, we’ve demonstrated how to build an interrupt driven task, as well as how to split the CPU’s time across multiple independent task contexts, by using an interrupt to tell us when to switch contexts. Both of these capabilities are very powerful and can be used outside of a simple LED context.
Where the CPU starts to have an advantage over the FPGA fabric is where you need something to perform complex sequencing operations–such as performing a complicated startup script, or performing a complex script periodically. Depending on the complexity of the task, adding it to what a CPU is already doing might be cheaper than performing it in the fabric of the FPGA. At the same time, adding a CPU to an FPGA just to blink an LED is truly overkill.
For my thoughts are not your thoughts, neither are your ways my ways, saith the LORD. For as the heavens are higher than the earth, so are my ways higher than your ways, and my thoughts than your thoughts. (Is 55:8-9)