That’s a good question, and worthy of a blog post in its own right.
The ZipOS was an “Operating System” of my own design (if you can even call it that) that I then used on the S6SoC project to demonstrate the multiple tasking capability of the ZipCPU. As demonstrated, it consisted of a couple of small but primary components.
There was the O/S Kernel. You can find an example of it here for the S6SoC project. The key to getting it to operate in real time for the doorbell application of the S6SoC project was to split the kernel into both realtime and non-realtime components.
The real-time “kernel”, such as it was, required all tasks to be loaded and defined at build time. It wasn’t very generic, in that you couldn’t start a new task at a later time. Worse, you had to know the stack size required of all of the tasks at build time as well. That was always a key limitation whenever I wanted to use it for a new project.
Part of the problem was due to my (non-existent) malloc implementation at the time. It worked great, as long as you never wanted to “free” any memory.
Did I say “real-time”? What made this kernel this kernel “real-time” wasn’t anything more than the fact that it met its realtime requirements. I managed that by placing the real-time critical parts of the system into RAM, and leaving the rest running from Flash. (RAM was a very precious resource on the S6, since there was only ever 16kB of block RAM and no off chip RAM.) Putting software into RAM was therefore a big deal, but given how slow the flash operated, it was a necessity.)
The second key component of the ZipOS was the System Pipe architecture, syspipe. This was a piece I was particularly proud of, in that it was a multiprocessing O/S component that could be shared across all user tasks.
The SysPipe component allowed one process to write into it, and another process to read out of it. In many ways, this was a basic AXI-stream protocol – just written in software. If the wasn’t enough space within it to write, the writing task would either block or raise an error flag (and cause a panic if I recall correctly). Reads were similar. If there wasn’t enough used memory within the pipe, the pipe would block until another task filled it with sufficient memory.
Indeed, the pipe task was kind of unique, in that while one process wrote into the pipe, another process might try reading from it, and the second process would then block until the write was finished–meaning that “two threads” might be active with the pipe at the same time, even though only one thread would ever actually be active while the other one was blocking.
A third piece was the traps (i.e. syscalls) that were defined. Do note the key word in that statement: “defined.” Not all of the defined traps were implemented. For example, there were traps for getting and returning semaphores that were never implemented in the Kernel.
Those are the big components of the ZipOS.
Development stalled, however, when I tried to make the ZipOS more generic. In many ways, this reveals how far the ZipOS was from a general purpose Operating System.
The fact that the all tasks, and their stack sizes and pipe connections, all needed to be known at startup didn’t bode well for larger or more generic systems.
Frankly, this was quite cumbersome to work with.
The lack of a good free() system call meant that all memory was once allocated and never released. While this might be great for a small, embedded task managing system that only ever had 16kB of RAM, it was never generic enough to be used in a system that would start new tasks after it had started.
It wasn’t just tasks that were the problem. At the time, I didn’t have a good means of adding and removing new devices into the O/S. The methods I had weren’t generic enough to easily move on to the next project.
So, rather than fixing the ZipOS, I’ve been spending my time fixing the foundations that made the ZipOS difficult to work with.
One of the bigger problems was the challenge of building a bus interconnect for a design. It seems like every design I use has a different set of peripherals, and they all get mapped to new addresses with every new project. It’s not just the addresses that get remapped from one FPGA design to another, the interrupts get remapped as well.
I needed a means of generating a design, connecting all the peripherals, and providing an address mapping from the FPGA design portion to the software that was controlling it. This became the project AutoFPGA. You can read an overview of the project here.
A second big problem I had was the lack of a good memory allocation scheme.
Anyone who has studied memory allocation will know this is a field of study
in and of itself. I needed to fix the
malloc() issue, but didn’t want to get
stuck in a lifetime of studying memory allocation to come up with the perfect
My solution was to adapt NEWLIB for the ZipCPU. Most of my projects today now use this NEWLIB port. The NEWLIB port is now a part of the ZipCPU repository, and these ZBasic instructions should be enough to get you started with it.
A second big problem with the ZipOS was that it didn’t have any support for a disk drive or other non-volatile storage of any type. Sure, I had my flash controller, but this worked better as a ROM than a disk-drive which could hold any type of file system.
This was fixed in one of my recent SONAR projects. In that project, my customer didn’t have a Linux setup. He couldn’t load software onto the ZipCPU in my usual debugging bus fashion. Worse, without the debugging bus, I couldn’t really load a bootloader design into the flash. I needed a new way of delivering software to him.
My solution was to use my SDSPI SD-Card controller together with the FATFS library. I placed the bootloader for this project into RAM, and then used it to read a file from the SD-Card into memory which would then be the program file that my customer then ran.
The pieces from this solution can be found in a couple of places. First, the files that made my SDSPI controller work with FATFS are posted in the software directory of the SDSPI controller controller. Many of the other pieces are hidden in the dev branch of my videozip project. While I’d like to move them out of the dev branch and into the master branch, that’ll require pulling the board out and making sure they work within that project again–something I haven’t (yet) had the opportunity to do.
Sadly, the size of the initial required boot loader means that it won’t fit in the block RAM of the smaller FPGAs I have. Those will still need to have their initial bootloader initially loaded into flash. Other than that problem, they should work quite nicely.
The Memory Management Unit
That gets us past most of the problems with the original ZipOS, but without a Memory Management Unit (MMU) I’ll struggle to allocate new tasks at runtime.
One solution, similar to the one I used in the original ZipOS, would be to give every task its own stack and memory area. In this case, I would again need to know how much to give each task at allocation time. I suppose I could give a generous amount to each task, but this would only work if I had a generous amount of RAM to give to every task.
The second problem with the solution above was that there was no way to tell if a task overran its memory area, nor was there a means of allocating more memory to a task if it had done so.
Both of these problems can be solved by adding an MMU to the ZipCPU.
I have built such an MMU. It has yet to be successfully integrated into the CPU. Well, I shouldn’t quite say it like that. I did manage to get the CPU to build (once) with the MMU attached, but then ran into the problem where I needed a test program to test out and try this MMU and that’s where the MMU project ran out of steam. I just … wasn’t certain what kind of program to write in order to exercise it well.
The second problem the MMU had was that the caches operated on Virtual memory addresses within the MMU boundary. In other words, the caches would only ever know virtual memory addresses. Worse, a piece of memory in the cache might get overwritten by a second virtual memory address to the same physical memory address if ever a single physical memory address existed in two virtual memory places. Worse, addresses had context associated with them that would also need to be checked and … I never managed to come up with an acceptable solution at the time to fix this problem.
With a bit of hind sight, I now understand more of MMU design. The MMU I built was a fully associative MMU–requiring much more logic than was really required. I’m tempted, now, to take another start at the MMU and to place it between the CPU and its caches. I’m also tempted to build the TLB within the MMU as a one-way cache rather than a fully associative one, but I haven’t put my hand to that task yet, because of the next task I’ve been working on which has really taken precedence for me.
as originally designed and built, is tightly coupled to a
pipelined implementation of the
Wishbone bus. The last
couple of projects I’ve worked on, however, have required an AXI3 bus not a
My most recent work, therefore, has been to make the
bus agnostic so that it can be used with both a
or an AXI4 bus. (If you build the master right, then converting from AXI3 to
AXI4 and back becomes easy. The key trick is for the master to only use one
AXI ID to solve the
WID problem and then to avoid issuing any burst
requests longer than 16 beats.)
I’ve therefore been rebuilding the ZipCPU since last fall to be bus agnostic.
I now have instruction fetch routines for AXI4. The basic fetch routines are in this AXILFETCH file, and include a single instruction fetch (
FETCH_LIMIT=1), a dual instruction fetch (
FETCH_LIMIT=2), and even a full instruction fetch pipeline and FIFO (
FETCH_LIMIT > 2). There’s even now an AXI4 instruction cache implementation that mimics my original Wishbone pipelined instruction cache implementation–although the updated CPU is still missing a full AXI4 wrapper. There is, however, an AXI4-lite wrapper that I’d like to try out in the near future.
I’m also working on an AXI data cache. I expect it will be very similar to my original Wishbone data cache, but it’s still a work in progress and hasn’t (yet) been posted. I’m still hoping to outperform Microblaze again but this time using AXI4 either with (or without) the data cache.
That leaves the issue of atomic accesses when using AXI4, something the AXI4 specification calls “exclusive access”.
This last weekend, I worked on and posted an updated AXI block RAM controller, which, unlike Xilinx’s block RAM controller, can actually handle exclusive access. (Yes, I would like to blog about this soon …) A CPU requires atomic access support from its bus in order to make an operating system work, and when using the AXI4 bus that support needs to come from the slave–and not so much the bus. Therefore, I now have a slave which can offer this support. That may even keep the CPU from needing a trap to handle semaphore’s in the future.
The key step to making this task happen was updating my AXI4 (full) properties to fully check the exclusive access handshake. Much to my surprise (not), several bus components stopped working now that the exclusive properties are being fully tested–with the key failing component being the AXI4 firewall.
Exclusive access when using AXI4 requires a bit of a change to interface the CPU core with its memory modules.
AXI4 exclusive access works in two steps, four from the standpoint of the CPU. 1) The CPU receives a LOCK instruction, warning it that an exclusive access request is coming. 2) The first instruction following the LOCK request is a read, which then sets the
ARLOCKfield in an AXI4 (not lite) memory request. 3) Once the value is read, the CPU operates on it–however it needs to. In the ZipCPU’s atomic access scheme, there’s only one ALU instruction allowed here–sufficient for most operations. 4) The value is then written back to memory using a request with
AWLOCKset. If all goes well, the slave responds with an
EXOKAYresponse. If, however, something else has accessed the memory between the original read and this final write, the slave is required not to write the value to memory and to instead return an
OKAYresponse. This needs to then trigger the CPU to go back and re-start the operation from the
I have most of the basic AXI-lite memory unit converted to a version that now uses this exclusive access mechanism, however that has meant that I’ve needed to change the memory interface properties used to verify the other memory routines and so those other routines now need to be reverified together with the CPU core (again).
I still need to modify the CPU core to handle the re-start part of the operation as well. I’m hoping this can be as simple as the memory unit reading into the program counter register, but I haven’t checked that part either.
I haven’t (yet) resolved how to make the bus and CPU endian-agnostic. Wishbone doesn’t require any particular endian requirement, although the ZipCPU is big endian by nature. AXI4 on the other hand is very definitely little endian in spite of what the specification says. Perhaps you may recall discussing this issue before? Swapping endianness isn’t nearly as trivial as it sounds. I’m expecting this issue to rear its head once I try to start using these routines–which should be in the next couple of months. Indeed, I’m already aware of some bugs that may need to be addressed early on.
In preparation for this, I’ve made modifications to both the compiler and binutils to handle building the ZipCPU tool chain for either endianness. I have yet to test these modifications. Depending on what happens when I try out my bus-based hacks, I may or may not need to continue with updating the tool chain.
Building the ZipCPU for another bus also requires another wrapper, so I now have an AXI-lite wrapper for the ZipCPU to test out. Unlike my previous wrappers, this one is now formally verified (and I’ve discovered bugs in the previous wrappers in the process–so I’ll now be formally verfiying all my wrappers). That means there’s now a new ZipBones wrapper, and a (work in progress) updated ZipSystem wrapper.
I’m also moving around the control register address allocation for the ZipCPU. Previously, the debugging interface required two interactions to read a register. 1) First, you’d need to write to the CPU debug control register to set the register address of the register you wanted to inspect, and then 2) you’d need to read the data value from the CPU data register. This required two round trips across the bus. This process then needed to be repeated for every register within the CPU to just display the debug screen. Because this involved both writing and then reading, the debugging bus was never able to use its burst access capability. On a bad day, the operation could be painfully slow. I’m therefore allocating addresses for every register in the ZipCPU’s register, so that you can read all registers with one group-read command. This should drastically improve the performance of the debugger when reading registers across a slow serial port–a long needed change.
Bus Width Adjustments
All of the above bus agnostic work is designed to make the ZipCPU work on an AXI4 bus. The project that needs it, however, involves connecting the ZipCPU to a controller that originally had two 32-bit AXI3 bus ports. One of those ports, that of an AXI3 DMA, now needs to be made to be 64-bits wide. Indeed, the whole core bus infrastructure of that project will require a bus whose (minimum) size is 64-bits or this DMA will have to run in a crippled fashion. This has left me with some problems–now resolved below.
The debugging bus I use has only ever been 32-bits.
I now have an AXI-lite bus upsizer which should work nicely for this debugging interface, so that can now issue read or write requests on any AXI4( or lite) bus that’s at least 32-bits wide.
The ZipCPU’s original Wishbone memory modules, both instruction and data, were only ever 32-bits. This was always a sore point when working with a 64-bit wide DDR3 SDRAM memory, such as the one the Arty A7 has, since the memory device could read twice as fast as the CPU could ever receive the results. Sure, my Wishbone to AXI bridge “works” and works nicely, but half the bandwidth of the memory was lost in the 32-bit to 64-bit conversion.
The ZipCPU’s new AXI4 memory modules can now be parameterized by bus width, so they should be able to handle an arbitrary sized memory bus.
That still leaves me with a 64-bit bus needing to control a 32-bit peripheral.
To handle that problem, I recently built a 64-bit to 32-bit AXI4 downsizer. This appears to work in initial formal testing, but the design hasn’t yet passed either an induction check or any simulation. That’ll be coming soon.
Indeed, this project is now getting much closer to testing. I’m expecting to have the project, with the updated ZipCPU in test before this summer time.
No, the ZipCPU project hasn’t stalled. It’s still going strong. However, while I’ve made many of the changes above, they’re still waiting on some simulation tests to shake out the last bugs remaining in them. Until that happens, I’m not likely going to merge the new zipcore branch into the master.
Perhaps the ZipOS project itself has stalled. However, it’s now being given the foundations it was missing from the beginning. It’s my hope, therefore, that when I return to it I’ll be able to make much more progress on it than ever before.
Prepare thy work without, and make it fit for thyself in the field; and afterwards build thine house. (Prov 24:27)