It’s been a while since I’ve discussed AutoFPGA. If you remember from my introduction to it, it’s a System on a Chip composition tool designed to compose a design together from multiple components. While most of the work it does consists of copying tags from a configuration file to one of several output files, it will also assign addresses to peripherals on a bus and create an interconnect to connect those peripherals together.
I’ve now used AutoFPGA on several of my projects. I used it first on my VideoZip project. I maintain a ZipCPU simulation test capability in ZBasic, another project that uses AutoFPGA. My iCE40 designs, both ICOZip for the icoboard and TinyZip for the TinyFPGA BX, both use AutoFPGA. Even OpenArty is slowly getting an AutoFPGA upgrade.
Why? Because (when done right) it makes it easy to compose systems from other components. Components may be added to or removed from a design simply by adding or removing them from the AutoFPGA command line and rebuilding.
Why not other tools? Because I really didn’t want to learn another language besides the Verilog, C++, make, and so forth that I already work with. But aren’t AutoFPGA scripts a new language in themselves? To some extent this is true, although the purpose of AutoFPGA remains being a tool that does its job and gets out of the way.
However, when I went to build a design for the TinyFPGA BX, I discovered a big hole in AutoFPGA’s capabilities. While it has always created linker scripts, the script it has created hasn’t had the flexibility required to handle such diverse designs as the very spartan TinyFPGA BX as well as the more full featured Nexys Video board from Digilent.
Understanding the problem
The CPU within an SoC needs access to memory for several purposes. It needs a place to store its instructions, another space for global data structures, another space for allocable data structures commonly called a heap, and finally a stack space to support local variables.
One common arrangement of the address space to support these various
contains a keep out region near
zero, followed by
the heap, and then a
in that order, as shown in Fig 1. The
typically starts at the
end of memory and grows upwards with each subroutine call, whereas the
typically starts at the end of global variable memory and grows downwards
However, FPGA systems tend not to have one monolithic type of memory. They typically have several memory types within any design. These basic memories include:
This is a slow non-volatile memory. It is great for initial instructions to get a program off the ground. Since it is slow to access, it may not be ideal to execute programs from, although small designs may need to do just that.
This is the ideal type of RAM you’d want to use in any system. It is fast. It’s simple to use and create. The cost to access one part of block RAM is the same as the cost to access any other part of block RAM.
The big downside of block RAM? You only have a limited amount of it. For example, the iCE40HX8k FPGA typically has only about 8kB of usable block RAM. Yes, the data sheet will claim 16kB of block RAM. Realistically, some of that 16kB will be used elsewhere in the design, so the most you are likely to get is probably going to be about 8kB of block RAM.
This is similar to block RAM, but not quite as fast or as simple to work with. Like block RAM, it is volatile. However, it tends to be off chip, slower to access, fairly easy to build a controller for, and it also cheap enough that you can have more memory than block RAM. The drawback is the technology within: SRAM tends to use more power and take more room than the dynamic types of RAM.
One cool feature of SRAM is that if you ignore it, and don’t cut the power, the memory doesn’t change. As a result, Digilent once had a Nexys board design that allowed you to load the SRAM with one FPGA configuration, and then swap FPGA configurations. Sadly, the board with this capability is no longer actively marketed and there may only be a small number of these boards left. As I understand the story, Digilent struggled to get the SRAM chips they needed to continue manufacturing the boards, and so they were forced to switch to SDRAM.
The big grandaddy of all RAM devices tend to be the SDRAMs.
By this I’m going to include not only the simpler SDRAMs, but also the DDR, DDR2, and DDR3 SDRAMs. Since these RAM devices are built out of capacitors, the memory can be made compact, and so they are inexpensive to manufacture, and therefore some of the cheapest RAM devices to purchase. The biggest drawbacks to SDRAM are that the controllers tend to be complex, and the access latency tends to be high. How hard are the controllers? Well, let’s just say that I have yet to complete my first working DDR3 SDRAM controller. I know it’s possible, since LiteDRAM has built some awesome SDRAM controllers. Other than that, SDRAMs tend to be high volume low cost devices.
A newcomer to the digital design space are the Hyperram chips. These really belong in the SDRAM category above, since they tend to be built from SDRAMs internally. The big difference is that HyperRAMs have a simpler interface that is easy to build a controller for. Likewise, HyperRAMs tend to have lower latencies than many other DDR SDRAM solutions, since the complex SDRAM array control is handled within the HyperRAM chip itself.
Block RAM Only
The simplest memory configuration we might build would be a block RAM only configuration. This configuration would be built as though there were no other memories in the system. It would typically consist of a keep-out address range near zero, addresses for the various peripherals, then the block RAM address itself.
Of course, the problem with this configuration is that block RAM is both limited and volatile: it won’t have the values we need within it when we power up our new design, or later when we reset our design. Still, this is a great memory model if you are first bringing up your CPU, and you haven’t yet debugged any other types of memory.
Why would I do this? Because it seems like few processors measure their Dhrystone performance in the absence of their bus. Were I to build a system like this, I might be able to measure the speed of the ZipCPUs instruction set independent of the bus implementation.
Of course, the problem with both of these designs is that block RAM is scarce. What else might we use?
Block RAM and Flash ROM together
Most FPGAs, have a SPI flash of some type which can be used as a ROM. The flash itself exists for the purpose of storing the FPGA’s power up configuration, but typically there’s 75% of the flash left over once that is done. Hence, you get this ROM memory. for “free” with the price of your board.
When block RAM isn’t enough, or alternatively when you want your program to run from non-volatile memory, this flash is available to you. Indeed, some FPGA boards don’t really have much more than block RAM and flash devices to act as memories as discussed above. Examples of these boards include the TinyFPGA BX, the iCEBreaker board, and the CMod S6. This leads to a memory space such as Fig. 4 below.
The original linker script I used for my CMod S6 design placed all of the CPU instructions in flash following the FPGA’s configuration, and all the data memory into the block RAM. This configuration is shown in Fig. 5 below.
In this figure, the ‘D’ below the
block RAM represents
‘H’ represents the heap, and
the ‘S’ represents the
memory area would start following the
configuration data, shown here as
This would then be followed by a bootloader ‘B’, traditional instructions
Insns and any constant program data
Const. The purpose of the bootloader
was to move any pre-initialized global data, shown here as
the beginning of the
When the design failed to meet my real-time speed requirements, driven by the
need for an audio output, I then came back and placed certain instructions,
those in critical sections of my code that needed to run at high speed, into the
them from their original location
in flash. This new configuration
is shown in Fig 6 below, with the
K section denoting these high speed
instructions that needed to be copied to
by the bootloader
While I managed to solve this challenge, it was a challenge that needed to be solved and the solution I found won’t necessarily work for all designs. Imagine, for example, if I wanted to load the C-library into block RAM. It’s not going to fit no matter how you try to squeeze it. (It’s not a pair of Levi’s) Therefore, given that flash is slow, you might wish to move up to a faster RAM type: SDRAM.
Flash and SDRAM
Some of my larger devices, such as my Arty A7 or my Nexys Video boards, have a DDR3 SDRAM as well. The XuLA2-LX25 SoC I have also works with an SDRAM, just not a DDR3 SDRAM. Either way, an SDRAM chip provides a lot of memory, allowing programs to copy themselves from the flash device to the SDRAM device. This could easily fit the model above, only we would now replace the block RAM with SDRAM. Not only that, for speed we could copy our instructions from the extremely slow flash onto the SDRAM.
But what about that block RAM? How might we use it now?
Flash, Block RAM, and SDRAM
Alternatively, we could place certain memories, at our discretion, within the block RAM. I’ve often done this with the stack memory, but you could also do this with any kernel memory that needed to be low-latency as well.
Flash, Block RAM, and HyperRAM
Now, just when you think you have everything figured out, someone will give you an auxiliary memory chip, such as this Hyperram from one bit-squared, and you’ll wonder how to integrate it with the rest of your system. It may never be a permanent fixture to any given design, or it may be the SDRAM the iCEBreaker was was lacking. Either way, you now need to quickly and easily reconfigure the design you once had working.
My whole point is that, in the realm of reconfigurable memory spaces, the place where you want to keep all the various parts of your software programs will likely keep changing.
AutoFPGA was just given an upgrade to handle just that issue.
The basic Linker Script File
The linker scripts that I build tend to have four parts to them. First, the script describes a pointer to the first instruction the CPU will execute. The second block declares the various memories on board. The third part declares some fixed pointers that can then be referenced from within my code. Finally, the fourth part describes how the various components of my design will be laid out in memory. Let’s take a look at what this might look like.
The important part of this section is the
ENTRY() command, which
specifies that the
entry point will be
_start. This label will be set
by the linker to point to the entry point in your code. For the
this is always the first instruction in the instruction address space.
As for the legalese, if you don’t like my legalese then feel free to replace it
with your own. The legalese in the
files is copied from a file I typically call
and introduced through
@LEGAL= tag in the
file. Further, as the owner of
AutoFPGA, I assert
no ownership rights over the designs you create with it, just over the
code itself–which is released under
The second section is the
MEMORY section. This section lists the address
location and length of every physical memory component within the system.
The comment you see in this section below was added by
It is one of many throughout the various
generated files to help guide you through the process of creating and updating
MEMORY section contains a list of all peripherals that contained
@SLAVE.TYPE key with a
MEMORY value. If you recall,
works off of configuration files containing
@KEY=VALUE statements. The
@SLAVE.TYPE key currently supports one of
four types of peripherals:
MEMORY peripherals different is that they are included in the
MEMORY section above. You can read more about this in my
ORIGIN value is assigned by
LENGTH value, indicating the
size of the peripheral, is given by the
@NADDR tag times the byte-width
of the bus the peripheral is on. Hence an
0x8000 will create
0x20000 as shown above for a 32-bit wide bus.
The names given above come from either the
@LD.NAME tag within the peripheral,
or the peripheral’s name itself as found within its
The point is that as your design is composed, and the memories given addresses,
supports this reconfiguration by creating and populating the
of the linker script.
The next section contains a variety of symbol declarations and assignments. These symbol names, if defined and used within your C/C++ code, will be replaced with the values given below.
First, all of the
MEMORY peripherals are given names and values pointing
to the beginning of their memory regions.
Second, if there is an
LD.DEFNS tag within the
its value will be copied into this section as well.
Together, the sections above tell the linker that we have three types of memories, block RAM, flash, and SDRAM. It identifies the origins of those memories, their lengths, and then creates symbols so that your code can access these values.
_top_of_stack symbols are used by
to load items from
into a high-speed kernel
block RAM, if used)
or otherwise into regular
(i.e. an SDRAM).
Finally, the top of the
is set to be the end of the
block RAM section
in this design.
These are just symbols assigned to values. We haven’t described any real linking yet. Those instructions are found in the next section.
This last section describes where the various segments of your program need to be placed into memory. In this example, I define a new memory section starting at the origin of the block RAM, aligned on units of 4 octets, and filled with a series of segments.
There are also a series of assignments in this section as well. These define
both values that will be used by the
_bss_image_end, as well as an ending value
which will then be the pointer to the beginning of the heap,
A simple pair of lines within your C++ code, such as,
will allow you to get the value of this
_top_of_heap value, and to initialize
heap pointer with it.
But what about those sections? Here are some of their basic meanings:
*(.start) *(.boot): These two segments are ZipCPU specific segments. The
*(.start)segment is used by the ZipCPU to make certain the startup code is the first set of instructions following the reset address–which is typically the beginning of the
SECTIONSarea although not in this case. The most important part of this startup code is that it sets the stack pointer that everything else will depend upon, and then jumps to the bootloader. When the bootloader returns, it then jumps to your
main()returns, it halts the CPU.
*(.kernel): I created this ZipCPU specific section to support my S6SoC project. Any code placed in this section will be copied to the fastest RAM in the project (block RAM), in case the CPU has code that must run at high speed.
*(.text*): These sections contain the instructions for the program in question. Now that we have all the nastiness above out of the way, we can actually place these sections, with the
*(.text.startup)section among these placed into memory first.
*(.rodata*) *(.strings) *(.data) *(COMMON): These sections contain the read-only (i.e.
const) data used by my program, any strings within the program, and finally any global data structures with initial values.
The bootloader needs to copy these sections into their places, but nothing else is required.
*(.bss): The final section is the BSS segment. Unlike the other segments above, where the bootloader just needs to copy them into place, the BSS segment needs to be cleared to all zeros. This is where any uninitialized global variables within your program will be placed.
There’s one other thing you need to know about this section, the
notation. This means that the section just described should be allocated
a place in the
bkram device. Something else you might see is
}> bkram AT>flash. This means that the section needs to be placed into
bkram, and that your code needs to be linked as though the section were
bkram. However, it is first placed into the
area, and left there for your
to copy it into
Now that you know what the various sections of this file are, and how the segments within your program will be allocated among them, what happens if we want to do something else?
Multiple Linker Configurations
created one linker script, called
board.ld, and adjusted it based upon
the peripherals available to it. For example, it could handle designs with
but couldn’t really do much with
This worked great for some designs, such as those with a massive amount of
as shown in Fig. 7 or 8 above, but horrible for others, such as Fig. 2
through 6 above.
As an example, if I wanted a design to run from
block RAM alone,
such as to test the
itself apart from its memory peripherals with
the form in Fig. 2 above, this one size fits all
would have be inadequate. Likewise, if I had a design that didn’t have enough
room in RAM
to copy the various program segments into (imagine the C-library here), the
stock linker script wouldn’t work either. While I could create a script by
hand for each of these scenarios, such as I was starting to
in my TinyZip design, that script would
then need to be updated by hand every time the addresses in
MEMORY region changed.
This was getting annoying.
@LD.DEFNS: If present, these definitions will be added to the definitions section of the new linker script.
Well, sort of. What if a design has multiple linker script configuration files? In this case, the components that have no
@LD.FILEtags will have their
@LD.DEFNStags copied to all linker scripts, while the components with an
@LD.FILEtag will have their
@LD.DEFNStag copied into the linker script defined by that component only.
@LD.SCRIPT: This tag, containing the
SECTIONcomponent above, will be copied into the linker script associated with the
@LD.FILEtag in the same component verbatim, although with variable substitution applied. So, for example, if our design creates a
RESET_ADDRESStag within the peripheral named
zip(i.e. having a
zip, then we might reference
@$(zip.RESET_ADDRESS)to get a copy of what that address was here in this location.
Several former linker tags have kept their functionality, but now have new names.
@LD.NAME: This is the name of the memory component, as found in the linker script. In the example above, we had names of
sdram. This tag used to be called
MEMORYsection of a linker script requires a permission string. The binutils documentation calls this a set of attributes. So far, I’ve only used
wxfor executable ROM and executable RAM respectively. Other possible attributes are defined can be found in the binutils documentation. AutoFPGA does nothing more than copy then from your design file to the
MEMORYsection of the linker script.
Remember, AutoFPGA is primarily a copy-paste tool with the ability to compose bus interconnects, and a limited variable substitution and expression evaluation capability sprinkled within. Similarly, another of the goals of AutoFPGA was that when it’s work was done, the computer generated files would be comprehensible, rather than your more typical computerese.
@LD.ENTRY: If present, this will define the entry symbol for a given linker script. If not specified, this will default to the
_startsymbol as above.
This updated method of generating custom linker script has now worked so well for me that I have several linker scripts defined for the AutoFPGA upgrade to my OpenArty project: one for block RAM only, another for flash plus block RAM, and I’ll be adding a third for flash, block RAM, and SDRAM support. Even better, using this approach, adding support for a HyperRAM controller should be just as simple as copying the controller components to my RTL directory (or a subdirectory of it) and adding the HyperRAM AutoFPGA linker script configuration to my design.
Working with one CPU design across many different hardware components and capabilities can be a challenge. It can be difficult to take a basic design and rapidly configure it for a new set of hardware, or to maintain support across several different hardware implementations. AutoFPGA can handle many of these reconfiguration needs, to make reconfiguring designs from one hardware configuration to another easier.
Of course, the unwritten reality of this article is that I don’t really want to spend my time writing linker script. I would rather be spending my time getting my new HyperRAM to work. This is just my way of trying to simplify the massive configuration challenges I have along the way.
Let him that stole steal no more: but rather let him labour, working with his hands the thing which is good, that he may have to give to him that needeth. (Eph 4:28)