If you’ve never heard of “blinky” before, it’s the name given to a piece of software or even an FPGA design that simply blinks an LED. We’ve discussed building blinky before, as well the more advanced (but no less fun) project of moving an active LED back and forth across a set of LEDs like one of my favorite TV shows as a kid, “Knight Rider”.
Even better, we’ve also discussed how to create a general purpose I/O controller, which could then be used to run a blinky program from within a CPU. In that particular article, we also measured how fast a CPU could toggle an I/O pin as part of such a blinky program. The resulting toggle rates, between 1 and 47MHz, are fairly impressive for a soft-core running within an FPGA with a 100MHz system clock.
Today, let’s return to blinky again, but this time let’s compare and contrast
several approaches to the problem of toggling four separate LEDs: one at
2Hz, one at
3Hz, and a fourth one at
The easiest way to toggle four separate LEDs at once on an FPGA is to create
four separate blinky modules. Each module would have a counter of some number
of bits, say
MSB+1 bits, so it might be defined as
reg [MSB:0] counter;.
We could then add a different step in each module, where the step size
2^(MSB+1)/CLOCK_FREQUENCY * BLINK_FREQUENCY, and so create blinking LEDs
at any frequency we want.
Internally, this might look something like Fig. 1 on the right. Fig. 1 shows four separate logic blocks, each similar to the block above, and each toggling an LED at its own rate.
Of course, this doesn’t tie our LEDs together in phase. What if we wanted all of them to have the same phase, so that they all turned on together at the top of a second?
In that case, we’d need to multiply a common counter, set to step at
2^(MSB+1)/CLOCK_FREQUENCY, by our blink frequency to get the result.
If you’ve never built a design like this before, then I would encourage you to try this. Remember to formally verify it first, and then run it in simulation. The tutorial should help you there if you have any questions.
For a next level challenge, consider removing the multiplies and replacing them with shifts and adds. Since we’re only multiplying by 2, 3, or 5 above, this should be fairly easy.
If you work with FPGAs at all, this test should be fairly basic–perhaps even too easy. If you haven’t, this is a fun place to start.
Today, though, let’s take it up a notch.
CPU Blinky, polled
Moving from simple logic to a CPU is a big step within an FPGA. Even if we use a resource minimized CPU, such as the ZipCPU, you’ll still need a lot of additional infrastructure. At a minimum, you’ll need a ROM to store startup instructions (I like using a flash memory device), a RAM to hold any local variables, a timer to determine what 1Hz is and a GPIO controller to actually set any LED values. We’ll also use a Programmable Interrupt Controller (PIC) as part of our solution. Tying all of these together will require some type of memory and peripheral bus (I like Wishbone), such as is shown in Fig. 3 on the right.
and then writing again to turn it off.
If you wanted to toggle an LED across some time period, you could wait in a for loop, and then toggle your LED–as shown below and in Fig. 4 on the left.
Getting the value for WAITTIME just right might take some work. Worse, caches are notorious for providing fast but unpredictable wait times for both instructions and data.
But how might we handle four LEDs each toggling at a different rates?
We could use a timer! Remember our work building an
interval timer some time
ago? Let’s use it now. We’ll suppose the address of this timer is kept in the
and we’ll have it create a repeating interval of one millisecond.
If the interrupt
has tripped, we know a millisecond has passed, so we can increment our
count_ms, and then toggle each of our LEDs.
One of the neat things about this approach compared to the uncalibrated
for loop above is that we can now know exactly how many clocks take
place between interrupts.
We also know that, should the
be late in processing an LED fast enough, at least the sequence will maintain
its frequency rather than randomly getting later and later and so slower–since
restarts itself every ms in this case.
For now, let’s look at this
toggle_leds_on_ms() function. How should this
function work? The same way as before! We’ll multiply our counter by
the toggle rate, divide by the number of counts per second, and then
grab the resulting bit of interest from the number of times the counter
Remember the way we constructed our GPIO
top 16 bits on read are any inputs to our design, while the bottom 16 bits
are outputs. To be able to set particular output bits and not others, we write
a mask of the outputs we wish to adjust to the top 16 bits of the
(which_led << 16) logic from above. That way we can
leave the other I/O registers alone, while just adjusting only the ones
we want to change.
Interrupt driven CPU Blinky
What if we could shut the CPU down, though, when nothing was changing?
This it the purpose of the ZipCPU’s
We can get access to it from C without needing any assembly by calling the
zip_wait() function. This instruction sets the
SLEEP bit in the
Further, if the ZipCPU
isn’t already in user mode (interrupts are enabled only in user
mode), it puts the
into user mode.
Then, when an interrupt
trips, the ZipCPU will return to
supervisor mode (where interrupts are
disabled) so that we
can process the interrupt.
discusses how to do this using a
function. This function primarily deals with setting up the interrupt
but once done it issues a
zip_wait() instruction for exactly this purpose.
Now we can rewrite our code from above to wait for an
Once done, the CPU
will sleep between its top-of-the-millisecond
computations. The biggest difference in the code below is that
we now issue a
wait_for_interrupt() call at the top of every loop.
Following that call, things should look about the same as before.
Not bad, huh?
This program now functions exactly the same as the last one, save that the ZipCPU is inactive while waiting for the interrupt. This can have two advantages. First, the ZipCPU will stop using the bus, allowing any non-CPU logic to transfer data without contention from the ZipCPU. Second, since the ZipCPU will stop issuing instructions, it can be placed into a lower power state. Stopping the CPU clock at this point might even be an option to lower power–as long as any interrupt source kept clocking, and as long as the interrupt controller could restart the CPU’s clock. Still, it is doable, although it does depend upon how much work you want to do to keep your power down.
What if we wanted to get really fancy, though, and create a multitasking blinky? One where we had a separate program to toggle each of several various LEDs?
To build the multi-tasking blinky, let’s step back and just look at the question of how to run multiple software programs on a piece of hardware.
The easiest way to do this, software-wise, might be to build multiple CPUs, as shown in Fig. 7 on the right. Easy, that is, until the two programs need to communicate with each other … but that’s a story for another day. Each CPU might then run a separate program toggling an LED.
Let’s build such a program now. We’ll start by removing the counter
toggle_leds_one_ms() function. Instead, we’ll make the total number
of milliseconds that have passed into a global variable.
We’ll then adjust our
toggle_leds_one_ms function so that it just reads and
references this global
milliseconds counter. We’ll call this new function
Notice our care to only read the global
milliseconds value once. Since reads
are atomic and since we copied the
milliseconds value, we don’t have to worry
about it changing mid-routine as a result of any routine that might
A second thing to notice is that the
design we are basing this off of has a
Special Purpose I/O
_spio. Setting LED’s in this
_spio register is just like using the
register above, save that the LED area is now the lower 8-bits, and the
“adjust-these-bits” area is the 8-bits above that. Hence, our reference
Finally, as long as something exists to count and keep the
counter for us, we can write a program to toggle our first LED at 1 Hz
as simple as,
Now if each CPU on our circuit board had such a program, they could all toggle their LEDs together.
There are a couple of problems with this approach. First and foremost, the ZipCPU project has been built around small FPGAs and low logic. (I never really had the budget for much more.) Four CPUs, whether on four circuit boards, four chips on the same circuit board, or four CPUs within the same FPGA, has never been within my budget. I’d like to run each of these four separate programs on the same CPU instead.
The solution to this problem is to virtually switch CPUs over time, as shown in Fig. 8 below, through a process called time-sharing.
Here’s how it works: First, the
CPU will start out running one
program–we’ll call this context 1 or
C1 for short. Then, after some
period of time, we’ll call it a quanta and set it to
will return the
CPU to its supervisory task,
The supervisory task will switch programs to the second context,
will then run for the next quanta.
The key to the whole operation is that each “program” needs to
believe that it owns the
In order to support this,
are typically designed so that their entire state is captured within a set of
set, defining what the
is up to, is called a context. Each context contains its own
set of registers,
R15 on the
to include the
(also known as
R13), condition codes
(also known as
R14), and the address of the next instruction, often called
the program counter
R15 on the
Therefore, in order to switch from running one program to another all the
CPU needs to do
is to write the current context (i.e.
to memory, and to then to read the stored context for the next program
from memory. This is often called a context
To facilitate context switching, the ZipCPU maintains two copies of its register set: one for the supervisor and one for user programs. Of these two, the supervisor context is never swapped. Indeed, it is the supervisor context that swaps user contexts.
Outside of the CPU, each context will also need an area of memory for its local variables. This is called the stack, and the stack pointer will point to a location within this memory–starting at the end of the memory. As memory is allocated, the stack pointer will grow towards lower memory as shown by the upwards arrow in Fig. 10.
To swap between one program and the next, we’ll just swap register sets. The two programs will never know what happened, because when the next program is activated, all of its data will be right there in its registers just as it was when it left off.
Let’s walk through how we might do this.
The first step will be to assign a stack to each task, as shown in Fig. 10 above. This is a place of memory designed to hold each task’s local variables.
Normally, I’d use
malloc() to do this or even the C++
Today, we’re going to try to do this without the C-library. As a result,
we’ll need an alternative. I’m going to call that alternative
It will act the same as
malloc(), except that there’s no
free() call and
so the memory will never return.
ugly_malloc() function is built around a pointer to the end of our
program’s fixed memory locations. The AutoFPGA linker
scripts define this
_top_of_heap. We can grab that here,
Just getting to the point where
_heap even gets this value takes a lot of
work, much of which we will skip here. That work starts in the
that describes where the ZipCPU’s
memory is even located on a particular hardware architecture. That script
_top_of_heap pointer and makes certain it’s aligned. The next
step takes place in the
_heap is a variable whose value might change. Such variables need to
be set initially. So, this
copies the initial value into
_heap before starting our
Finally, we can now write this
ugly_malloc() function. This function works
by just grabbing the next
nbytes from the heap, and then incrementing our
_heap pointer to the next unallocated section of memory. Since the
cannot (yet) read from unaligned memory, we’ll also need to make certain this
pointer remains aligned.
ugly_malloc() function, we can now allocate some local
variable space for each of our programs. This will be the “stack”
used by these programs, and so we’ll use this to set the
for each of the programs.
Thus function allocates a section of memory
nbytes in length, and then
returns a pointer to the end of it. Our tasks won’t actually write to this
end value, but will instead back up
by however much space they need as they need it. This is illustrated in
Fig. 11 by an upwards arrow within each task’s
stack memory area
showing that the stack
grows upwards towards low memory.
The astute observer will notice that the stack spaces, which are illustrated in Fig. 10, both for the supervisor and each of the user contexts, are not unlimited. If they grow too far, the stack will overflow into other memory regions causing … lots of problems. Picking how much space to allocate for the stack is therefore quite important, and a problem for which I don’t (yet) have a good solution for.
Since we’ve chosen not to use any system libraries for this code, we’ll need
to write our own
memzero routine so that we can start our tasks off with a
clean slate. This
memzero() (should-be a) library function is also
Much as one might expect.
Incidentally, we could do this operation four times faster if we took advantage of the fact that our memory is both word aligned and an integer number of words. For now, we’ll leave this as an exercise for the student to try later.
Now that we have this background under our belt, it’s time to build our multi-tasking blinky.
Starting at the top, the
will begin by executing a
copies our program into memory (if necessary), and then initializes
any global variables. Once complete, the
will call our
main() function–giving the appearance that this is where our
program starts. Once in
main(), we will first define all of our
tasks (i.e. contexts), as well as a
current task pointer to point
to the task that is currently active. The tasks themselves in this simple
example consist of nothing more than the
task_context structure we defined
above containing the
values of each running program.
I’ve also declared a
heartbeats variable as well. We’ll use this to debug
our program and to determine if anything has gone wrong and we need to enter
into the debugger.
Specifically, if ever the
heartbeats counter stops
counting, we’ll know our program is dead.
There’s actually a couple ways of implementing this
One way, shown above, is to create a variable on the stack to hold this value.
The first problem with this approach is that the compiler might move this variable into a register. When it’s in a register, we’ll need to use the CPU’s debugging interface to read it in order to know if the
heartbeatshave ever stopped, rather than reading this value from memory using the debugging bus.
There two big problems with this approach. The first is knowing whether
heatbeatsis in a register vs being in memory. The second problem is knowing which register
heartbeatsis kept within. Until the ZipCPU supports a source-level debugger, this will require examining some (dis)assembly to see how the compiler allocated it. You can use
zip-objdump -D intdemo > intdemo.txtto examine this (dis)assembly. Since I find myself doing this so often, there’s a
makeoption in the Makefile to
make intdemo.txtwhich does almost exactly this. The
makeoption puts some other information into
intdemo.txtas well, so feel free to try it out yourself and see what you think.
A second way to handle the
heartbeatsvalue would be to declare it as a
volatile unsignedvalue. If we do that, the
heartbeatsvalue will be forced into local (stack) memory. We can then use our debugging bus to read it even while the CPU is running.
The problem is, which memory address will
heartbeatsget placed into? Typically, as long as
heartbeatsis the first value declared in
main(), it’ll always be at the same place on the stack, but finding this place the first time might take some work.
A third option would be to declare
heartbeatsas a global variable.
Were we to do this instead,
wbregs has an option where, if given a map file,
you can read this value by name.
But let’s get back to our
We just created memory to hold
sets (contexts). Now let’s give them some initial values. The most
important values to provide are the
stack pointer and the
Once we’ve done all that, we’ve almost finished our startup processing. There are only a couple steps left.
The first is to choose to start the first task, and then to load the user
from that task pointer. The
ZipCPU toolchain provides a
zip_restore_context() function to make things easier. This function expands
into some code to copy the
values from the address given, in this case a pointer to
into the user
Once done, we can then issue a “return-to-usermode” instruction, assembly
zip_rtu() from C, to switch from supervisor to user mode.
Only, we can’t do that just yet. ZipCPU programs only exit user mode on interrupts, exceptions (i.e. faults), and traps (system calls). Since our program shouldn’t be creating any exceptions (if it works), and since we’re not issuing any system calls, we’ll need another way to grab control back from userspace: interrupts.
It’s now time to write out main logic loop. We’ll start out by clearing
every other LED, and incrementing our
heartbeats counter. We can then
run the user task.
Once we issue the
RTU instruction, the
will switch to using the user-register set. Since everything is captured by
this context, switching to this user-register set will feel like switching
which program is running.
Unlike many other CPUs which have a single register set, the ZipCPU maintains the supervisor’s context while running in user mode. That means that the supervisor program counter, stack pointer, and indeed all of the supervisor registers are maintained until the CPU returns to supervisor mode from user mode. What that means is that, on an interrupt, the CPU will continue running this supervisor function where it left off.
The more traditional approach would be for the CPU to suddenly jump to an interrupt service routine. The address of such a routine would be kept in a special memory location that the CPU could look up and start from when the context switch needed to take place.
By just switching register sets, the ZipCPU is kind of unique in this way. I personally find it easier to write multi-tasking programs as a result.
On our return, the first thing we’ll do is set those LEDs we just cleared–every other LED is now set to indicate we are in supervisor mode. I’ve found this to be a really useful way of debugging what goes wrong when things don’t work: if these LEDs are on, the CPU is in supervisor mode, else it is in user mode.
We’ll turn these off again before we leave supervisor mode.
Our next step is to check the user-mode
register to see if we left user
mode as a result of some form of exception. If so, we’ll call a
function–more on that later. It’s important to enter the
as soon as possible once we detect an exception, so that we can debug anything
that went wrong with as little change to the system as possible.
Our last step is to swap tasks.
This is done in three parts. The first part copies the
user mode registers
task_context memory. In this case, since we’ve kept a
pointer to our
current task context this is fairly straightforward.
The second step is to decide which task to call next. This is often called
“CPU scheduling” or “task scheduling”, and many articles have been written
on this topic. We’ll just keep it simple here and move to the “next” task in
our list. In hind sight, it might’ve been easier to maintain a current
task_id index as well, but I’ll leave that to you to do.
Task swapping like this is more
often an assembly function, and so these two builtin-function calls simply
implement what would be those assembly instructions. They are easily
identifiable from the ZipCPU
disassembly since these are the only instructions that will reference the
uR register set. If
you are really interested in actually seeing their
definition, you can find it in the GCC ZipCPU
patchset. Here it is for saving the
context, and here again for
restoring the context.
The function basically works by reading four registers from either
memory or the user register set, then writing them to the user register set
or memory. Indeed, the function is little more than a memory copy.
That leaves us with only one loose end to return to: the
The purpose of this function is just to tell us that something has gone wrong, and we need to do some debugging. It also helps us start that process by helping us identify which problem has taken place. I mean, we’re writing a blinky function therefore it should be obvious if there’s a problem–the LEDs won’t blink like they should. But how to start debugging next?
We’ve chosen to set certain LEDs to indicate we are in supervisor mode (interrupts disabled), and others to indicate we are in user mode (interrupts enabled).
This supervisor mode indicator LEDs should blink so fast that they appear to be lit dimly. If they ever turn off, or start shining brightly, then we can therefore identify which mode we were in when the CPU stopped.
The purpose of the
panic()function is to help us diagnose what happened to a broken subtask. In this case, if we just stop the LEDs from blinking, we might not be able to tell the difference from a CPU freeze above.
Therefore we’ll blink all of our LEDs, either all on or all off together, to indicate a user exception took place.
This particular implementation of
panic() uses a system power-up
This counter increases on every system clock, starting at power up, until
the top bit is set. Once the top bit is set, the power up counter keeps
that bit set and becomes a rolling 31-bit counter with the other bits.
We can grab bit 28 of this counter as an indication that we need to change the LEDs.
While we could’ve used the
timer for this,
understand that we are in a
panic() situation. We want to leave the
CPU state as un-changed as possible
so that we can diagnose whatever fault took place. In particular, we’ll want
to leave the user register set untouched. We also don’t know if the
fault was associated with interrupt processing or task swapping or something
else. For all of these reasons, this code is has been kept as simple as
Another thing we could’ve done in this
panic() routine would’ve been to
issue a simulation
NHALT (hardware NOOP) instruction to halt any simulator
at the fault itself. By turning tracing on and then running the simulator
like this, it’s fairly easy to figure out what went wrong. (Easy, perhaps,
but still pretty intense–examining a trace is not trivial.) Alternatively,
we could’ve triggered any CPU-focused Wishbone
This is typically how I debug the CPU if a bug makes it to hardware and the
user register set doesn’t tell me enough to know the cause of any fault.
Want to see how we did? Check out the video below.
In this video you’ll see a MAX-1000 board with 8-LED’s. Half of the LEDs are toggling at rates of 1Hz, 2Hz, 3Hz, and 5Hz. These are the far left LED, and every other LED to the right. The other four LEDs are shining dimly, indicating that the CPU truly is handling interrupts and swapping tasks as desired.
Since we’ve already covered both how to make an LED blink from Verilog, as well as from C, it only made sense that we’d discuss how to blink an LED in some more advanced fashions–such as by using an interrupt or from a multi-tasking program.
While blinking an LED might seem like an exceptionally trivial task, let’s consider what we’ve learned: We learned how easy it was to blink an LED from Verilog. Even synchronizing multiple blinking LEDs to within a clock period was fairly easy. Blinking an LED on a CPU has the advantage that it doesn’t typically take (much) more logic resources–but only after you’ve already paid for the CPU, it’s boot code, it’s bootloader, memory, the system bus and the interconnect/crossbar used to hold everything together.
Of course, this would be more valuable if we were doing something more than just blinking an LED. Still, we’ve demonstrated how to build an interrupt driven task, as well as how to split the CPU’s time across multiple independent task contexts, by using an interrupt to tell us when to switch contexts. Both of these capabilities are very powerful and can be used outside of a simple LED context.
Where the CPU starts to have an advantage over the FPGA fabric is where you need something to perform complex sequencing operations–such as performing a complicated startup script, or performing a complex script periodically. Depending on the complexity of the task, adding it to what a CPU is already doing might be cheaper than performing it in the fabric of the FPGA. At the same time, adding a CPU to an FPGA just to blink an LED is truly overkill.
For my thoughts are not your thoughts, neither are your ways my ways, saith the LORD. For as the heavens are higher than the earth, so are my ways higher than your ways, and my thoughts than your thoughts. (Is 55:8-9)