If you’ve never heard of “blinky” before, it’s the name given to a piece of software or even an FPGA design that simply blinks an LED. We’ve discussed building blinky before, as well the more advanced (but no less fun) project of moving an active LED back and forth across a set of LEDs like one of my favorite TV shows as a kid, “Knight Rider”.

Even better, we’ve also discussed how to create a general purpose I/O controller, which could then be used to run a blinky program from within a CPU. In that particular article, we also measured how fast a CPU could toggle an I/O pin as part of such a blinky program. The resulting toggle rates, between 1 and 47MHz, are fairly impressive for a soft-core running within an FPGA with a 100MHz system clock.

Today, let’s return to blinky again, but this time let’s compare and contrast several approaches to the problem of toggling four separate LEDs: one at `1Hz`, one at `2Hz`, one at `3Hz`, and a fourth one at `5Hz`.

The easiest way to toggle four separate LEDs at once on an FPGA is to create four separate blinky modules. Each module would have a counter of some number of bits, say `MSB+1` bits, so it might be defined as `reg [MSB:0] counter;`. We could then add a different step in each module, where the step size is `2^(MSB+1)/CLOCK_FREQUENCY * BLINK_FREQUENCY`, and so create blinking LEDs at any frequency we want.

Internally, this might look something like Fig. 1 on the right. Fig. 1 shows four separate logic blocks, each similar to the block above, and each toggling an LED at its own rate.

Of course, this doesn’t tie our LEDs together in phase. What if we wanted all of them to have the same phase, so that they all turned on together at the top of a second?

In that case, we’d need to multiply a common counter, set to step at `2^(MSB+1)/CLOCK_FREQUENCY`, by our blink frequency to get the result.

If you’ve never built a design like this before, then I would encourage you to try this. Remember to formally verify it first, and then run it in simulation. The tutorial should help you there if you have any questions.

For a next level challenge, consider removing the multiplies and replacing them with shifts and adds. Since we’re only multiplying by 2, 3, or 5 above, this should be fairly easy.

If you work with FPGAs at all, this test should be fairly basic–perhaps even too easy. If you haven’t, this is a fun place to start.

Today, though, let’s take it up a notch.

Moving from simple logic to a CPU is a big step within an FPGA. Even if we use a resource minimized CPU, such as the ZipCPU, you’ll still need a lot of additional infrastructure. At a minimum, you’ll need a ROM to store startup instructions (I like using a flash memory device), a RAM to hold any local variables, a timer to determine what 1Hz is and a GPIO controller to actually set any LED values. We’ll also use a Programmable Interrupt Controller (PIC) as part of our solution. Tying all of these together will require some type of memory and peripheral bus (I like Wishbone), such as is shown in Fig. 3 on the right.

When we last discussed blinking an LED from a CPU, we discussed how to turn the LED on by writing to a particular register from a C program,

and then writing again to turn it off.

If you wanted to toggle an LED across some time period, you could wait in a for loop, and then toggle your LED–as shown below and in Fig. 4 on the left.

Getting the value for WAITTIME just right might take some work. Worse, caches are notorious for providing fast but unpredictable wait times for both instructions and data.

But how might we handle four LEDs each toggling at a different rates?

We could use a timer! Remember our work building an interval timer some time ago? Let’s use it now. We’ll suppose the address of this timer is kept in the constant pointer `_timer`, and we’ll have it create a repeating interval of one millisecond.

We can then use our interrupt controller, with a register addresss at `_pic`, to determine if this timer has tripped:

If the interrupt has tripped, we know a millisecond has passed, so we can increment our millisecond counter, `count_ms`, and then toggle each of our LEDs.

Of course, polling the interrupt controller like this is really the wrong way to do this, but let’s come back to this thought in the next section when we discuss how to do this with interrupts.

One of the neat things about this approach compared to the uncalibrated `for` loop above is that we can now know exactly how many clocks take place between interrupts. We also know that, should the CPU, be late in processing an LED fast enough, at least the sequence will maintain its frequency rather than randomly getting later and later and so slower–since the timer restarts itself every ms in this case.

For now, let’s look at this `toggle_leds_on_ms()` function. How should this function work? The same way as before! We’ll multiply our counter by the toggle rate, divide by the number of counts per second, and then grab the resulting bit of interest from the number of times the counter wraps.

Remember the way we constructed our GPIO controller: the top 16 bits on read are any inputs to our design, while the bottom 16 bits are outputs. To be able to set particular output bits and not others, we write a mask of the outputs we wish to adjust to the top 16 bits of the register–hence the `(which_led << 16)` logic from above. That way we can leave the other I/O registers alone, while just adjusting only the ones we want to change.

What if we could shut the CPU down, though, when nothing was changing?

This it the purpose of the ZipCPU’s `WAIT` instruction. We can get access to it from C without needing any assembly by calling the built-in `zip_wait()` function. This instruction sets the `SLEEP` bit in the ZipCPU’s control register. Further, if the ZipCPU isn’t already in user mode (interrupts are enabled only in user mode), it puts the ZipCPU into user mode. Then, when an interrupt trips, the ZipCPU will return to supervisor mode (where interrupts are disabled) so that we can process the interrupt.

The ZipCPU specification discusses how to do this using a `wait_for_interrupt` function. This function primarily deals with setting up the interrupt controller, but once done it issues a `zip_wait()` instruction for exactly this purpose.

Now we can rewrite our code from above to wait for an interrupt. Once done, the CPU will sleep between its top-of-the-millisecond computations. The biggest difference in the code below is that we now issue a `wait_for_interrupt()` call at the top of every loop. Following that call, things should look about the same as before.

This program now functions exactly the same as the last one, save that the ZipCPU is inactive while waiting for the interrupt. This can have two advantages. First, the ZipCPU will stop using the bus, allowing any non-CPU logic to transfer data without contention from the ZipCPU. Second, since the ZipCPU will stop issuing instructions, it can be placed into a lower power state. Stopping the CPU clock at this point might even be an option to lower power–as long as any interrupt source kept clocking, and as long as the interrupt controller could restart the CPU’s clock. Still, it is doable, although it does depend upon how much work you want to do to keep your power down.

What if we wanted to get really fancy, though, and create a multitasking blinky? One where we had a separate program to toggle each of several various LEDs?

To build the multi-tasking blinky, let’s step back and just look at the question of how to run multiple software programs on a piece of hardware.

The easiest way to do this, software-wise, might be to build multiple CPUs, as shown in Fig. 7 on the right. Easy, that is, until the two programs need to communicate with each other … but that’s a story for another day. Each CPU might then run a separate program toggling an LED.

Let’s build such a program now. We’ll start by removing the counter from our `toggle_leds_one_ms()` function. Instead, we’ll make the total number of milliseconds that have passed into a global variable.

We’ll then adjust our `toggle_leds_one_ms` function so that it just reads and references this global `milliseconds` counter. We’ll call this new function `toggleled`.

Notice our care to only read the global `milliseconds` value once. Since reads are atomic and since we copied the `milliseconds` value, we don’t have to worry about it changing mid-routine as a result of any routine that might interrupt this processing.

A second thing to notice is that the design we are basing this off of has a Special Purpose I/O register, `_spio`. Setting LED’s in this `_spio` register is just like using the `_gpio` register above, save that the LED area is now the lower 8-bits, and the “adjust-these-bits” area is the 8-bits above that. Hence, our reference above to `_spio` and `(led_bits<<8)`.

Finally, as long as something exists to count and keep the `milliseconds` counter for us, we can write a program to toggle our first LED at 1 Hz as simple as,

Now if each CPU on our circuit board had such a program, they could all toggle their LEDs together.

There are a couple of problems with this approach. First and foremost, the ZipCPU project has been built around small FPGAs and low logic. (I never really had the budget for much more.) Four CPUs, whether on four circuit boards, four chips on the same circuit board, or four CPUs within the same FPGA, has never been within my budget. I’d like to run each of these four separate programs on the same CPU instead.

The solution to this problem is to virtually switch CPUs over time, as shown in Fig. 8 below, through a process called time-sharing.

Here’s how it works: First, the CPU will start out running one program–we’ll call this context 1 or `C1` for short. Then, after some period of time, we’ll call it a quanta and set it to `1ms`, an interrupt will return the CPU to its supervisory task, `S`. The supervisory task will switch programs to the second context, `C2`, which will then run for the next quanta.

The key to the whole operation is that each “program” needs to believe that it owns the CPU. In order to support this, CPUs are typically designed so that their entire state is captured within a set of registers. This register set, defining what the CPU is up to, is called a context. Each context contains its own set of registers, `R0`-`R15` on the ZipCPU, to include the stack pointer `SP` (also known as `R13`), condition codes `CC` (also known as `R14`), and the address of the next instruction, often called the program counter `PC` or equivalently `R15` on the ZipCPU. Therefore, in order to switch from running one program to another all the CPU needs to do is to write the current context (i.e. register set) to memory, and to then to read the stored context for the next program from memory. This is often called a context switch.

To facilitate context switching, the ZipCPU maintains two copies of its register set: one for the supervisor and one for user programs. Of these two, the supervisor context is never swapped. Indeed, it is the supervisor context that swaps user contexts.

Outside of the CPU, each context will also need an area of memory for its local variables. This is called the stack, and the stack pointer will point to a location within this memory–starting at the end of the memory. As memory is allocated, the stack pointer will grow towards lower memory as shown by the upwards arrow in Fig. 10.

We can use a simple array of integers to capture the information contained in a ZipCPU register set.

To swap between one program and the next, we’ll just swap register sets. The two programs will never know what happened, because when the next program is activated, all of its data will be right there in its registers just as it was when it left off.

Let’s walk through how we might do this.

The first step will be to assign a stack to each task, as shown in Fig. 10 above. This is a place of memory designed to hold each task’s local variables.

Normally, I’d use `malloc()` to do this or even the C++ `new` operator. Today, we’re going to try to do this without the C-library. As a result, we’ll need an alternative. I’m going to call that alternative `ugly_malloc()`. It will act the same as `malloc()`, except that there’s no `free()` call and so the memory will never return.

This `ugly_malloc()` function is built around a pointer to the end of our program’s fixed memory locations. The AutoFPGA linker scripts define this pointer as `_top_of_heap`. We can grab that here,

Just getting to the point where `_heap` even gets this value takes a lot of work, much of which we will skip here. That work starts in the AutoFPGA configuration script that describes where the ZipCPU’s memory is even located on a particular hardware architecture. That script defines the `_top_of_heap` pointer and makes certain it’s aligned. The next step takes place in the bootloader, since `_heap` is a variable whose value might change. Such variables need to be set initially. So, this bootloader copies the initial value into `_heap` before starting our program.

Finally, we can now write this `ugly_malloc()` function. This function works by just grabbing the next `nbytes` from the heap, and then incrementing our `_heap` pointer to the next unallocated section of memory. Since the ZipCPU cannot (yet) read from unaligned memory, we’ll also need to make certain this pointer remains aligned.

Using this `ugly_malloc()` function, we can now allocate some local variable space for each of our programs. This will be the “stack” space used by these programs, and so we’ll use this to set the `SP` register for each of the programs.

Thus function allocates a section of memory `nbytes` in length, and then returns a pointer to the end of it. Our tasks won’t actually write to this end value, but will instead back up this pointer by however much space they need as they need it. This is illustrated in Fig. 11 by an upwards arrow within each task’s stack memory area showing that the stack grows upwards towards low memory.

The astute observer will notice that the stack spaces, which are illustrated in Fig. 10, both for the supervisor and each of the user contexts, are not unlimited. If they grow too far, the stack will overflow into other memory regions causing … lots of problems. Picking how much space to allocate for the stack is therefore quite important, and a problem for which I don’t (yet) have a good solution for.

Since we’ve chosen not to use any system libraries for this code, we’ll need to write our own `memzero` routine so that we can start our tasks off with a clean slate. This `memzero()` (should-be a) library function is also pretty basic,

Much as one might expect.

Incidentally, we could do this operation four times faster if we took advantage of the fact that our memory is both word aligned and an integer number of words. For now, we’ll leave this as an exercise for the student to try later.

Now that we have this background under our belt, it’s time to build our multi-tasking blinky.

Starting at the top, the CPU will begin by executing a bootloader. This function copies our program into memory (if necessary), and then initializes any global variables. Once complete, the bootloader will call our `main()` function–giving the appearance that this is where our program starts. Once in `main()`, we will first define all of our tasks (i.e. contexts), as well as a `current` task pointer to point to the task that is currently active. The tasks themselves in this simple example consist of nothing more than the `task_context` structure we defined above containing the register values of each running program.

I’ve also declared a `heartbeats` variable as well. We’ll use this to debug our program and to determine if anything has gone wrong and we need to enter into the debugger. Specifically, if ever the `heartbeats` counter stops counting, we’ll know our program is dead.

There’s actually a couple ways of implementing this `heartbeats` idea.

1. One way, shown above, is to create a variable on the stack to hold this value.

The first problem with this approach is that the compiler might move this variable into a register. When it’s in a register, we’ll need to use the CPU’s debugging interface to read it in order to know if the `heartbeats` have ever stopped, rather than reading this value from memory using the debugging bus.

There two big problems with this approach. The first is knowing whether `heatbeats` is in a register vs being in memory. The second problem is knowing which register `heartbeats` is kept within. Until the ZipCPU supports a source-level debugger, this will require examining some (dis)assembly to see how the compiler allocated it. You can use `zip-objdump -D intdemo > intdemo.txt` to examine this (dis)assembly. Since I find myself doing this so often, there’s a `make` option in the Makefile to `make intdemo.txt` which does almost exactly this. The `make` option puts some other information into `intdemo.txt` as well, so feel free to try it out yourself and see what you think.

2. A second way to handle the `heartbeats` value would be to declare it as a `volatile unsigned` value. If we do that, the `heartbeats` value will be forced into local (stack) memory. We can then use our debugging bus to read it even while the CPU is running.

The problem is, which memory address will `heartbeats` get placed into? Typically, as long as `heartbeats` is the first value declared in `main()`, it’ll always be at the same place on the stack, but finding this place the first time might take some work.

3. A third option would be to declare `heartbeats` as a global variable.

Were we to do this instead, `wbregs` has an option where, if given a map file, you can read this value by name.

But let’s get back to our program. We just created memory to hold `NTASKS` register sets (contexts). Now let’s give them some initial values. The most important values to provide are the stack pointer and the program counter.

Once we’ve done all that, we’ve almost finished our startup processing. There are only a couple steps left.

The first is to choose to start the first task, and then to load the user registers from that task pointer. The ZipCPU toolchain provides a `zip_restore_context()` function to make things easier. This function expands into some code to copy the register values from the address given, in this case a pointer to `current->r[0]`, into the user register set, `uR0` through `uPC`.

Once done, we can then issue a “return-to-usermode” instruction, assembly `RTU` or `zip_rtu()` from C, to switch from supervisor to user mode.

Only, we can’t do that just yet. ZipCPU programs only exit user mode on interrupts, exceptions (i.e. faults), and traps (system calls). Since our program shouldn’t be creating any exceptions (if it works), and since we’re not issuing any system calls, we’ll need another way to grab control back from userspace: interrupts.

Let’s therefore set our timer to interrupt us every millisecond.

It’s now time to write out main logic loop. We’ll start out by clearing every other LED, and incrementing our `heartbeats` counter. We can then run the user task.

Once we issue the `RTU` instruction, the ZipCPU will switch to using the user-register set. Since everything is captured by this context, switching to this user-register set will feel like switching which program is running.

Unlike many other CPUs which have a single register set, the ZipCPU maintains the supervisor’s context while running in user mode. That means that the supervisor program counter, stack pointer, and indeed all of the supervisor registers are maintained until the CPU returns to supervisor mode from user mode. What that means is that, on an interrupt, the CPU will continue running this supervisor function where it left off.

The more traditional approach would be for the CPU to suddenly jump to an interrupt service routine. The address of such a routine would be kept in a special memory location that the CPU could look up and start from when the context switch needed to take place.

By just switching register sets, the ZipCPU is kind of unique in this way. I personally find it easier to write multi-tasking programs as a result.

On our return, the first thing we’ll do is set those LEDs we just cleared–every other LED is now set to indicate we are in supervisor mode. I’ve found this to be a really useful way of debugging what goes wrong when things don’t work: if these LEDs are on, the CPU is in supervisor mode, else it is in user mode.

We’ll turn these off again before we leave supervisor mode.

Our next step is to check the user-mode `CC` register to see if we left user mode as a result of some form of exception. If so, we’ll call a `panic()` function–more on that later. It’s important to enter the `panic()` function as soon as possible once we detect an exception, so that we can debug anything that went wrong with as little change to the system as possible.

At long last, it’s now finally time to check the interrupt controller to see if an interrupt has taken place. If so, we’ll increment the `milliseconds` counter that all of the tasks are using.

We’ll then want to acknowledge this interrupt, so we don’t get interrupted again until the next millisecond interrupt.

Our last step is to swap tasks. This is done in three parts. The first part copies the user mode registers into the `task_context` memory. In this case, since we’ve kept a pointer to our `current` task context this is fairly straightforward.

The second step is to decide which task to call next. This is often called “CPU scheduling” or “task scheduling”, and many articles have been written on this topic. We’ll just keep it simple here and move to the “next” task in our list. In hind sight, it might’ve been easier to maintain a current `task_id` index as well, but I’ll leave that to you to do.

The final step in a ZipCPU context switch is to restore the registers from the new “current” task back into the user register set.

Task swapping like this is more often an assembly function, and so these two builtin-function calls simply implement what would be those assembly instructions. They are easily identifiable from the ZipCPU disassembly since these are the only instructions that will reference the `uR` register set. If you are really interested in actually seeing their definition, you can find it in the GCC ZipCPU patchset. Here it is for saving the context, and here again for restoring the context. The function basically works by reading four registers from either memory or the user register set, then writing them to the user register set or memory. Indeed, the function is little more than a memory copy.

That leaves us with only one loose end to return to: the `panic()` function.

The purpose of this function is just to tell us that something has gone wrong, and we need to do some debugging. It also helps us start that process by helping us identify which problem has taken place. I mean, we’re writing a blinky function therefore it should be obvious if there’s a problem–the LEDs won’t blink like they should. But how to start debugging next?

1. We’ve chosen to set certain LEDs to indicate we are in supervisor mode (interrupts disabled), and others to indicate we are in user mode (interrupts enabled).

This supervisor mode indicator LEDs should blink so fast that they appear to be lit dimly. If they ever turn off, or start shining brightly, then we can therefore identify which mode we were in when the CPU stopped.

2. The purpose of the `panic()` function is to help us diagnose what happened to a broken subtask. In this case, if we just stop the LEDs from blinking, we might not be able to tell the difference from a CPU freeze above.

Therefore we’ll blink all of our LEDs, either all on or all off together, to indicate a user exception took place.

This particular implementation of `panic()` uses a system power-up counter. This counter increases on every system clock, starting at power up, until the top bit is set. Once the top bit is set, the power up counter keeps that bit set and becomes a rolling 31-bit counter with the other bits.

We can grab bit 28 of this counter as an indication that we need to change the LEDs.

While we could’ve used the timer for this, understand that we are in a `panic()` situation. We want to leave the CPU state as un-changed as possible so that we can diagnose whatever fault took place. In particular, we’ll want to leave the user register set untouched. We also don’t know if the fault was associated with interrupt processing or task swapping or something else. For all of these reasons, this code is has been kept as simple as possible.

Another thing we could’ve done in this `panic()` routine would’ve been to issue a simulation `NHALT` (hardware NOOP) instruction to halt any simulator at the fault itself. By turning tracing on and then running the simulator like this, it’s fairly easy to figure out what went wrong. (Easy, perhaps, but still pretty intense–examining a trace is not trivial.) Alternatively, we could’ve triggered any CPU-focused Wishbone Scope. This is typically how I debug the CPU if a bug makes it to hardware and the user register set doesn’t tell me enough to know the cause of any fault.

## Video

Want to see how we did? Check out the video below.

In this video you’ll see a MAX-1000 board with 8-LED’s. Half of the LEDs are toggling at rates of 1Hz, 2Hz, 3Hz, and 5Hz. These are the far left LED, and every other LED to the right. The other four LEDs are shining dimly, indicating that the CPU truly is handling interrupts and swapping tasks as desired.

## Conclusions

Since we’ve already covered both how to make an LED blink from Verilog, as well as from C, it only made sense that we’d discuss how to blink an LED in some more advanced fashions–such as by using an interrupt or from a multi-tasking program.

While blinking an LED might seem like an exceptionally trivial task, let’s consider what we’ve learned: We learned how easy it was to blink an LED from Verilog. Even synchronizing multiple blinking LEDs to within a clock period was fairly easy. Blinking an LED on a CPU has the advantage that it doesn’t typically take (much) more logic resources–but only after you’ve already paid for the CPU, it’s boot code, it’s bootloader, memory, the system bus and the interconnect/crossbar used to hold everything together.

Of course, this would be more valuable if we were doing something more than just blinking an LED. Still, we’ve demonstrated how to build an interrupt driven task, as well as how to split the CPU’s time across multiple independent task contexts, by using an interrupt to tell us when to switch contexts. Both of these capabilities are very powerful and can be used outside of a simple LED context.

Where the CPU starts to have an advantage over the FPGA fabric is where you need something to perform complex sequencing operations–such as performing a complicated startup script, or performing a complex script periodically. Depending on the complexity of the task, adding it to what a CPU is already doing might be cheaper than performing it in the fabric of the FPGA. At the same time, adding a CPU to an FPGA just to blink an LED is truly overkill.