Now that you’ve seen the ZipCPU by itself, and now that you’ve built its tool chain, let’s look at together at what you can do with the ZipCPU as part of a larger design: ZBasic. Today, I’d like to show you how to run the ZBasic design within a Verilator-based simulation environment–one that simulates a QSPI flash, a serial port, and even (optionally) an SD-Card. If all goes well, we’ll run the ZipCPU’s CPU-Test program, and then even play 4x4x4 Tic-tac-toe.
But first, let’s start with a little history.
When I first started out with the ZipCPU, my goal was to demonstrate it on a cheap hobbyist board. After my first development, on a Digilent Basys-3, I then built demonstrations for an Xess.com XuLA2-LX25 board, and then the Digilent CMod S6, the Arty, and most recently Nexys Video boards. You can still find most of these builds on-line in the XuLALX25SoC, S6SoC, OpenArty, and VideoZip repositories. Indeed, my main Github page still highlights the OpenArty project. Many of these boards are peripheral rich, and even for those that aren’t I purchased peripherals (mostly from Digilent) to have something fun and new to work with on each board. I have found it to be a fun exercise to learn how to build the RTL code to support a new peripheral and I would commend that exercise to every RTL student.
I then ran into the problem of supporting someone who didn’t have the peripherals I had. How could or should they use the ZipCPU, if their hardware didn’t match the hardware of one of the demonstration designs?
So, I backed up and took a look at all the designs I had. Almost all of them had some type of serial flash, some amount of block RAM, and a serial port. Why not then make a design that had only these peripherals?
That was, and still is, the purpose of the ZBasic design.
Because the design
is intended to be generic, it has no
nor any DDR3 SDRAM,
nor any other type of external RAM chip.
These interfaces tend to be board specific, and I wanted
this distribution to be as basic and as simple as possible. What that means,
though, is that the main
design requires 1MB of on-chip block RAM. Well, “requires” is a harsh word,
what I mean to say is that the design as currently configured on
github will try to infer 1MB of block RAM.
While few chips have this much RAM, it allows the
design to have access to an abundance of RAM without worrying about the
interface to the RAM. Even better, this amount of RAM can be easily
by changing only one number in the
block RAM config
and then rebuilding the design (i.e.
If that’s not enough, by just adding your own user code and
configuration file, you can add whatever additional hardware to the
distribution you want–SDRAMs included.
Your first task in using the ZBasic design will be building the toolchain for the ZipCPU: binutils GCC and newlib. I’ll assume you’ve already done that, if not you’ll need to back up a step. I’m also going to assume that the toolchain is in your path, as we discussed when building it. The next step is to clone the ZBasic repository and build it. Since this repository doesn’t include a copy of GCC, it’s fairly light and a straightforward clone will work.
Shall we run our first test? This test will require two windows, and a little bit of timing to do right. In your first window, go ahead an type the following–but don’t hit return on that last line yet or you might miss some of the simulation output. This will run the main simulation “test-bench” wrapper, and apply it to my (more modern) CPU test software–once you hit return (don’t do it yet).
In your second window, type the following–but don’t hit return. When you do (eventually) hit return, this will connect you to the running ZBasic simulation.
Ok, now you can hit return in the first window and then the second. You should see the results of the CPU test, such as Fig 2 illustrates.
Be aware, however, there’s a reason this option is turned off by default: your VCD file could easily top 11GB.
Alternatively, you could have just started the design on its own without giving a program to the ZipCPU. As the ZipCPU is configured within the ZBasic design, it starts up in a halted configuration. (This is optional–it can be configured to start immediately on power up–see the spec for more details.) If you give a program name as an argument, the simulation wrapper will load the program into memory and then clear the halt bit from the debugging interface. On the other hand, if you give the simulation driver no program name,
then you’ll need to load the ZipCPU’s program into memory–just as you would need to do on actual FPGA hardware. This is done with the zipload program found in the sw/board subdirectory. We’ll also give this program the ‘-r’ switch, to indicate that the ZipCPU should be started once the program is loaded into memory.
which only tests the CPU itself,
as well, with such typical library system calls as
These calls get routed, via a board specific glue
simulated serial port.
To try this out, change directory into the
sw/board directory, and build
If it doesn’t clone tttt (I’ve had mixed success with git submodules so far–all probably due to a problem lying somewhere between my keyboard and my chair …), feel free to clone tttt right there in that directory.
Once you have it cloned, you’ll need to adjust a couple of lines within the sw/board/tttt/src/Makefile to tell tttt where the C-library is. Therefore, open the Makefile in your favorite editor and replace the lines,
with these lines,
This will make certain the cross-compiler environment variables are properly
set to build
(If you had instead cd’d into
tttt and issued a
make command, it would
tttt for your local/host architecture.)
Now we can play. Ready?
As before, we’ll type in the command to start the simulator in one window,
and connect to the simulated serial port from another window,
When you hit return on the two (in sequence), the
telnet window will show
You should be able to just type your move in as a series of three numbers,
each 1-4, as in
1 1 1. Have fun!
Be careful, although the computer isn’t unbeatable, he does play a pretty mean game!
The glue logic supporting the
includes a file called
CPU’s this is an
file called crt0.s. Not for the
ZipCPU. For the
this file is written in C. It contains two routines:
The first routine,
_start starts the
by setting the stack pointer to the end of memory, and then jumping to a
_bootloader. This is really an
routine with a thin veneer of a C wrapper, but it’s placed within the
file anyway. When you strip away the cruft, it basicaly reads as,
The second routine within
_bootloader routine that is called from the
_start function above.
This is the routine I’d like to demonstrate for this homework lesson.
_bootloader function itself is really nothing more than a series of
memory copy routines. These are based around a couple of assumptions. First,
is non-volatile (i.e. like a ROM) and so upon startup instructions can
be found there. The second assumption is that the block RAM is faster than
Hence, we want to move our instructions (and data) from
into block RAM before starting any program.
_bootloader copies memory from the
into block RAM. This section is framed by an
#ifdef _BOARD_HAS_KERNEL_SPACE, so that any
high priority (kernel) functions would be or could be placed into block RAM.
Second, the bootloader copies from
the board might have,
as defined by the
_sdram pointer. Since
doesn’t have any
this second memory copy ends up continuing the write into block RAM
The third section of memory is the BSS section. This is a memory section
whose initial contents are all zeros. The
_bootloader fulfills this
commitment by writing zeros to all of the memory location within this
However, if you look inside the
file, you’ll actually see two choices
for how to handle these memory copies. The first choice is applied if
USE_DMA is defined. This is set earlier in the file to be true only if
_HAVE_ZIPSYS_DMA is defined–something that comes from the
For this homework assignment, turn on tracing with the
-d flag (you do have
a rough 4GB available, right?) and run the
4x4x4 tic-tac-toe program (tttt)
To keep it from taking up too much room on your
hard-drive, kill it as soon as the game instructions start coming up (i.e.
type Ctrl-C on the screen where you typed
main_tb -d ...).
Copy the trace file
Then comment the
USE_DMA define in
//s at the beginning of the line,
You can then rebuild in
sw/zlib by typing
make in that directory, but you’ll need to do a
make clean in
before you can re-issue
make again there. Once done, you can issue a
make tttt. This will propagate this change throughout the
and into the application software.
Now run the simulator again, still with the -d option.
Kill it (Ctrl-C) as before when the characters start getting printed to the
terminal. Then rename the
trace.vcd file to be
Now that you have two comparison files, pull them both up in
Let’s look specifically at the
serial output line
o_wbu_uart_tx from the top level, and then from
within the top level, the
select) line, and then
bkram_sel (block RAM select)
lines. As you may recall,
wb_stb will be true anytime a request is being
made across the bus. The other two lines indicate when the address
associated with this request is either referencing the
or the block RAM.
See a difference?
Here’s my figure for running without the DMA,
and again for running with the DMA controller,
If you look at the far right, when
o_wbu_uart_tx starts toggling that’s when
the first characters of the game are being sent to the serial port. This
doesn’t happen until the
_bootloader has finished. Here, you can see that it takes about nine
seconds to copy everything from
when using the
_bootloader, whereas it
takes over twenty seconds without! You can also see a big difference in the
bkram_sel lines. What’s going on there?
Let’s drill one level deeper and look at what’s going on by zooming in.
Let’s also add the
bkram_ack lines–there are the
lines from these two peripherals, and indicate when a request has been
What’s not as readily apparent in this
is the context–it begins in the middle of a transaction. A value has
already been requested from the
controller by the time my screen capture starts. Once the
acknowledges the transaction, that is when
flash_ack goes high, the data
becomes available to the
and it immediately turns around and writes to the block RAM. Since the
block RAM is quite fast, it acknowledges its transaction almost immediately.
(Remember, transaction requests only take place when
wb_stb is high, and
so Fig 6 only shows two transaction requests.) The
then issues a read request of the
and … everything stalls again waiting for the flash
This is very different from what happens when the DMA, is turned on. For that case, you can see what happens when in Fig 6 below.
In this case, the
reads multiple items from the
in a back to back
fashion–you can see all of the acknowledgement’s in the
in Fig 6. During this time, the block RAM is idle. Once the
reading a rough 1k words from the
it then bursts these to the block
RAM. Look at the
wb_stb line to see this–it’s nearly a constant
signal, indicating that one request after another is being made. In a
similar fashion, but unlike the
response, the block RAM’s acknowledgment signal is also a constant high–since
the block RAM can respond to one request per clock.
As a result, this portion of the copy goes by very quickly.
How do I change the amount of block RAM?
Since I know this is going to come up, let me show you how easy it is to change the amount of block RAM in this device.
This line defines a tag,
@LGMEMSZ, specifies that it is a numerical
tag with the
@$ prefix, and then gives it the value of
This tag is used to specify that the log, based two, of the block RAM
memory size is twenty–meaning it should have 1MB, or
2^20 bytes, of
block RAM. The key itself is unique to this
block ram configuration
so you aren’t likely to find it elsewhere. It basically defines a local
variable within an AutoFPGA context.
However, with a bit of math and some substitution (remember,
AutoFPGA is primarily a
copy/paste utility with a calculator and address assignment built
in), this number becomes the amount of block RAM called for in the
You can change this one number, and then run
make autodata from the main
directory (assuming you have
AutoFPGA installed and in your path),
and the design will immediately be reconfigured for the new memory size.
Yes, you’ll still need to run
make from the main directory again once
you’ve done this,
so that this newly configured design has a chance to build.
Well, first, the
@MAIN.INSERT tag that same
file is used to tell
AutoFPGA what to place into your
file. In this case, it’s a reference to a
module which implements a block RAM device that is parameterized by its size.
@THIS.LGMEMSZ is used to control this parameter. It’s also used to
connect that design parameter to the number of address lines fed to this
component, and changing this size may cause the other peripherals on the bus
to be shuffled around to minimize the required bus logic.
Second, all of the addresses will (may) be re-assigned as I just mentioned.
This includes more than just the block RAM. These new addresses can be found
listed in the
files based upon the
Third, the linker definition
will have changed, which will adjust the
_bkram pointer used by the
_bootloader we discussed above.
@SIM.LOAD tag defines the software necessary to load a program
into this memory, given the new location and length found in the updated
The result of all of this is that, following an AutoFPGA based reconfigure, all that is required is to rebuild the project and we have a new amount of memory at a (potentially) different location.
Now that you know how to run the ZBasic demonstration, the next step will be to show how simple and easy it is to add a new component using AutoFPGA, and then to demonstrate how we can integrate this component into our simulation and ultimately the FPGA design as a whole.
That will be our next step in this series, although there’s really a lot of information we can come back to–such as how the DMA controller works in the first place.
Wilt thou play with him as with a bird? or wilt thou bind him for thy maidens? (Job 41:5)