Planning an Intermediate Design Tutorial

I’ve been known for wandering through an FPGA forum or two, and I’ve see some common and reoccurring themes. One of them is, “My design isn’t working and I don’t know why not.” It’s the reason I dedicated this blog to keeping individuals out of “FPGA Hell” as I called it.

Indeed, I saw such a post again just this morning. Someone had a MicroBlaze design that wasn’t starting. It worked on an older board, but not the newer revision of the board. What might be wrong?

If all you know is that, “My design doesn’t start,” you don’t have much to debug from.

This was one of the reasons why one of my first blog topics was how to build a debugging bus followed by the wishbone scope. Why? Because this is how I debug problems like that. Using the two of those, I can typically trace anything going wrong down to a trace between two interfaces. At that point, you can then visually “see” what’s going on.

Sadly, if all you have are the vendor tools, it’s very hard to “see” what’s going on. Worse, I find myself quick to blame someone else’s code when I don’t see a problem in my own–even if I can’t figure out what the problem is.

It was for this reason that I dedicated the blog to keeping individuals out of “FPGA Hell”. One of the problems associated with blogging, however, is that my articles 1) tend not to be arranged in any particular order, and 2) tend not to get updated over time.

This was my first reason for writing a beginner’s tutorial.

The second reason for writing a beginner’s tutorial was in response to problems I’ve seen with the more traditional instruction. For example, I’ve seen students confuse “testbench” constructs with “synthesizable” constructs and then wonder why their design doesn’t work. I’ve seen students create bench tests that provide less test coverage of their code than “modern” swimwear.

Indeed, I once had the same problem in my own designs: My own test coverage left me chasing bugs in my designs over late nights, and GB traces containing the bug … somewhere … within them. It was specifically for this reason that I fell in love with formal verification so quickly–it finds the bugs within your design that my own testbenches were always missing.

When no one listened to me hollering about the way I felt things “should” be done, I decided to try writing a tutorial myself to help teach what’s missing.

So far, that tutorial has been well received. Sure, I’ve had some welcome but less than flattering comments. Perhaps the biggest one regards Verilator and the C++ nature of the tutorial. Why should C++ be required when your goal is to learn Verilog and FPGA design. I get it. A similar comment regards “make” files we used. However, you can’t do things like this VGA simulator without some basic software background, and a lot of folks are coming into the FPGA community with that background–much like me. For them, at least, it makes sense.

That said, there’s a strong need in the community for teaching materials that will teach “From blinky to AXI,” and while my own tutorial gets past blinky, it doesn’t make it anywhere near AXI.

So let me present some of my thoughts today regarding how this might be fixed.

The problem with the intermediate tutorial

There’s a couple of reasons why I have yet to start on an intermediate FPGA tutorial. One is that I sell my services and … things have been quite busy as of late. (Sorry, but this blog is a hobby of mine rather than something that puts food on the table for my family.) The second reason is that the next step really requires a lot of design-ware that few students would like to build.

Allow me to explain.

Many of the FPGA designs I’ve worked with involve some kind of bus master together with several bus slaves. The common task, then, for the FPGA designer is to build a new bus slave. A classic example of this might be to create a new piece of hardware to add to a CPU’s capabilities, such as is shown in Fig. 1 below.

Fig 1. A typical CPU based design

The task of the student is then to build this new slave. Perhaps he has several such slaves he’d like to build.

Were I to build this the way my mathematical background requires, I’d want to teach everything from the bus master, to the S(D)RAM memory controller, to the bus interconnect before the student gets to their first bus slave. You know, learn multiplication before square roots. In this case, that’d be …

BORING!

I mean, seriously, would you want to know how to build a lock and dam just to go canoing on the river?

Here’s another example design that’s common among FPGAs: you want to process data, say an image perhaps. That means you want to read the image from memory (there’s not enough room in block RAM to store most images), process it however, and then store it back into memory again.

Fig 2. A basic processing pipeline

While this might be a typical signal or image processing application, there’s a missing piece to it: the design usually begins and ends with Matlab or, in my case, Octave.

First, you build your design in Octave and prove that your algorithm works
Along the way, you discover how to measure the performance of your algorithm, and you learn how to communicate (i.e. plot) that performance.
Now you want to put it on an FPGA. So, you build your data processing algorithm to put it on the FPGA.
It’s now the big moment: As you are synthesizing your brand-new algorithm in order to place it into an FPGA design, you suddenly realize that you have no way of getting your data set into or out of memory. Worse, even if you do realize that, you have no easy way to get it in or out of Octave from your design, even if you were to get it into memory.

You can see the problem illustrated below in Fig. 3,

Fig 3. A traditional data streaming problem

Of course, what I haven’t mentioned is that the end goal of this sort of stream processing task is typically not to process the data within memory, but rather to receive the data on some signal or video feed, process it, and then to forward the output back to a similar feed.

Fig 4. From input to output

Where’s the tutorial to teach that?

There isn’t one (that I know of). (I don’t really know of that many.)

Hence, the reality is that a lot of individuals end up using the vendor tools and vendor design components and have no idea what’s going wrong when they don’t work.

As an example, a recent Xilinx user wrote that he’d written a lot of data into his Xilinx stream processor and no data ever came out. Why not? Eventually, after some back and forth, he realized he’d never marked the end of the data packet. Now, without using your own code, or at least something that’s open source, how would you ever find a bug like that?

My point in all of this is simply that, when your goal is data processing, you don’t really want to build all of that miracle glueware shown in Fig. 3 above–just like you didn’t want to build the interconnect and the CPU shown in the example before that, in Fig. 1.

Yes, I understand that most FPGA vendors provide cores and logic that can handle all of this middle ware. Personally, I have a couple of problems with using these cores.

First, and perhaps most important, if you ever need to switch hardware vendors, you’ll have to tear apart your design and rebuild it for the new cores using their new interfaces.

This includes switching design flows, even for the same hardware. For example, if there were an open source tool chain, would you be able to get by without the vendor supplied cores? How about if you wanted to use an open source simulator, like Verilator? Know of any good open source interconnects? Or tools to connect your components to said open source interconnects?

Or have you not noticed that I have ZipCPU based designs for iCE40, ECP5, Xilinx, and Intel? Yes, it is doable.
Second, the purpose of the intermediate tutorial we are discussing would be for learning. It’s one thing to use a vendor’s core when you have a product that’s due on a tight schedule. Sure, I get it, go for it. I’m not knocking that at all. On the other hand, if you want to learn design, then doesn’t it make sense to spend your time learning how to build your own versions of the basic building blocks before you turn around and use those from a vendor?

Worse, wouldn’t it be a shame if you learned how to design using vendor based building blocks but then … had to switch tools and discovered that you no longer knew anything because you could no longer use the cores and tools you were familiar with? For example, have you ever tried simulating Xilinx’s AXI interconnect using the fastest simulator out there? (Hint: it’s not Vivado.)

My point here is simply this: there’s a need for instruction material that goes past basic serial port I/O in a vendor independent fashion.

Tutorial Goals

As always, one of the goals of the tutorial is to have the widest applicability possible. That means it needs to share FPGA design concepts and strategies in a vendor independent fashion. That means I can’t really use vendor code in my tutorial. That includes all the vendor glueware, bus interconnects, etc.

My apologies to all of you big-named vendors out there. On the other hand, after trying to answer questions from clueless forum posters, wouldn’t you rather have customer’s who knew how to debug their own designs?

So, here was my thought: Using entirely open source tools, so that the design components could be verified with SymbiYosys and then simulated using Verilator, create a set of lessons similar to those shown in Fig. 5 below.

Fig 5. Proposed intermediate tutorial structure

The lessons would use AutoFPGA to connect all of the parts and pieces together. In every lesson, the goal would be to be able to formally verify any new components, then to run the design in a Verilator based simulation, and then in actual hardware.

The lesson sequence would start out by discussing some of your basic slave design components.

The first lesson would start out by creating a very simple “blinky” design, but this time using the AutoFPGA generated Wishbone bus. Commands sent from the host over the debugging bus would be used to turn LEDs on and off.

For those who don’t recall the debugging bus articles, a “debugging bus” is my term for a bus, internal to an FPGA, that can be accessed and commanded from host (PC) software. Even better, I like to run my debugging bus software over a network, allowing me to interact with either my design or its simulation from anywhere on my local network.
The second lesson would involve simply creating an audio tone. This would be very similar to the first lesson, but might involve a couple of bus addresses, to allow the developer to control amplitude and frequency from their external computer as one example. The tone itself could be played using a basic PWM controller.
The third lesson would be quick, just showing how to connect a block RAM to the bus as well as how to verify RAM based slaves.
We’d then discuss building a “bus scope”. If you’ve read my blog much, you’ll know that I use what I call a “wishbone scope”. You’d be amazed at the bugs you can find and diagnose using something like this.

This is somewhat different from using vendor tools (chipscope, ILA, etc), simply because it is bus based. This will allow you greater control of the scope, eventually allowing you to control it from your embedded CPU–but we’re not there yet.
The final slave in this section of the course would be your basic xSPI flash memory controller. This could be done either with SPI or QSPI.

Voila, the first section of an intermediate Verilog tutorial.

The second section of the tutorial would focus on bus masters, rather than bus slaves.

It would begin with a set of lessons on creating a video output.

(6) The first might discuss outputting a fixed test pattern
(7) That lesson would be followed with a stream processing lesson where a “sprite” of some type would be added to the video stream.
(8) The next lesson would be (generic) on asynchronous FIFOs. Although this is really a video independent topic, it’s really a necessary background for the next lesson
(9) The final video lesson would be on how to stream pixels from a (fixed) memory location to the video controller and hence to the VGA screen.

All of this would be simulatable using Verilator. Perhaps painfully simulatable, but simulatable none the less.

We’ll then move back to a quick bus slave, to learn how to control a basic SPI A/D controller.
Once we have a way to ingest samples, the next lesson would discuss how to record samples from something like an A/D controller to memory, in order to later be read out using the debugging bus.

At this point, you should be able to ingest your pipeline processing algorithm into a design.
Finally, before getting into CPU design, we’d work our way through a basic hardware controller–something that could read “instructions” from a memory, and then use them to control a hardware output. In this case, it should be possible to build a basic music box–perhaps something that could play “Music Box Dancer”?

That would end the basic set of lessons on building bus masters.

The next group of lessons would focus on building a CPU. This would not be about ISA design (Sorry, I know my limits), but rather on implementing some increasingly complex CPUs.

The first lesson would discuss just a simple, very basic, microcontroller. I haven’t yet decided what ISA I’d use, or if there’s one available that has a nice tool suite with it, but you get the idea.

In this lesson, the student would implement such a simple CPU.
A CPU really needs a debugging interface, so we’d add one. This would allow us to start and stop our CPU, using the same debugging bus that’s supported us so far, and perhaps even read registers and state from it.
That would lead us right into building an ELF program loader. This could be just something basic that can read a compiled file and load it into either flash or (block) RAM. Of course, this would also require an ELF-based tool chain, and so likely a proper ISA as well.
How fast your CPU works is really dependent upon where it’s memory is found, and linker scripts provide the means of adjusting where your memory is found within your design. We’ll discuss how to read, write, and adjust linker scripts so you have an idea of what’s going on within your design.

The student should be able to load a CPU (either theirs or mine) so that it runs from Flash, block RAM, block ROM, or … wherever.
This really then feeds nicely into understanding how a bootloader works. Once you know how to place program instructions (wherever), it’s important to be able to copy them to where you need them.
We can then move to a lesson on pipelined CPUs.

This one I haven’t quite figured out yet, but I’d like to offer something more complex than the basic state-machine based microcontroller.

While one option might be to use the ZipCPU here, I want an option that presents some amount of learning to the student–rather than just following a script.

Perhaps one option might be adding a special instruction (or two, or four) to the CPU. Another option might be to restructure the CPU for some purpose (such as MMU integration as an example). We’ll see.
The next lesson would focus on how to build a cache controller. I’d provide the ZipCPU to any one who wished to use it at this point, although I could understand why a student might rather wish to use their own. I’m not (yet) set on this course of action.

Of course, as with all of these designs, part of the lesson would discuss how to go about formally verifying the design.
Finally, we’d discuss the FAT filesystem, so that the CPU could access files on an external SD Card.

Yes, the course will show how this can be done from simulation too.

I think this progression builds nicely, one lesson upon the next, with the bus components being built growing every more capable. Further, while I like this progression of lessons, I’ve noted that with all of my tutorials students have found it valuable to pick up in the middle as their interest and needs dictate.

Hardware required

Unlike the beginner’s tutorial, once we get past lesson one, special hardware will be required for the following lessons. Well, either that or the student might choose (instead) to build the design in the simulator alone and just go on.

Fig 6. Proposed hardware requirements

Judging from the hardware we’ve discussed above, to complete all of the lessons you’d need:

A Flash memory controller (most FPGAs have this)
A VGA port.

While I’d love to do HDMI, and while HDMI isn’t really all that much more difficult, the I/O’s required for HDMI are a touch more challenging to do in a generic hardware fashion.

Perhaps this dual Pmod VGA peripheral would serve our purpose well here.
While I’d like to avoid any external RAM controllers, the Video lesson really requires a significant amount of RAM (typically external) in order to handle streaming from a memory buffer.

This may require an AXI lesson mid-tutorial.

Perhaps the best way to handle this in an intermediate course would be to offer vendors an opportunity to post and share how to interact with their xDDR SDRAM controllers. I know I have several SDRAM controllers available to work from, and even an SRAM controller. I would also expect that the litedram authors might be willing to support this project as well. (Maybe I should ask them?)
A simple PWM based audio controller, such as might be used with this PMod audio device.
A SPI based A/D. For this, I was thinking of something similar to Digilent’s audio PMod, for which I already have both (verified) example code as well as a decent emulator.
Finally, in order to use the SD Card. and read and parse a FAT file filesystem, you’d need a design with an SD card reader on it. My current SD card controller is SPI based, so I might start there. On the other hand, one of my current projects is to upgrade that controller to be fully SDIO compliant, so we might do even better.

For those that do not have an SD card on their board, there does exist a PMod SD which might work nicely for this purpose as well.

Sadly, not all entry level development boards have all of this hardware, so whether or not this list is too aggressive or not is an important question.

“From Blinky to AXI” … where’s the AXI?

So I started by saying that a tutorial “From Blinky to AXI” would be a valuable contribution. Sadly, the proposed tutorial above doesn’t (yet) discuss AXI. I haven’t quite decided on how to handle that.

I could leave the AXI work for a future “AXI only” tutorial, or I might work it into this tutorial, or I might just leave it within the blog. Another approach might be to provide the bus with a WB to AXI-lite bridge, and then to make all of the devices using AXI-lite. I’m not quite certain right now.

I am open to ideas, if you’d like to share them.

Conclusions

Whew! While I like this overview of what would be next, I’m not certain any (potential) students would be interested in something quite this intensive.

I’d love to hear your thoughts. I intend to create a Reddit post with this article, and hear any comments that might be shared. Feel free to join in the discussion.

My own first thought is to note that, while AutoFPGA handles multiple slave integration easily, it doesn’t (yet) handle multiple masters. The good news is that I’m going to have to fix that already for a contract I’m working on, so that’s likely to get fixed quickly.

Of course, all of this would only be if the Lord wills–so we’ll have to see if any of this ever gets off of the ground in the first place, but your thoughts would be welcome either way.