AutoFPGA's linker script support gets an update

It’s been a while since I’ve discussed AutoFPGA. If you remember from my introduction to it, it’s a System on a Chip composition tool designed to compose a design together from multiple components. While most of the work it does consists of copying tags from a configuration file to one of several output files, it will also assign addresses to peripherals on a bus and create an interconnect to connect those peripherals together.

I’ve now used AutoFPGA on several of my projects. I used it first on my VideoZip project. I maintain a ZipCPU simulation test capability in ZBasic, another project that uses AutoFPGA. My iCE40 designs, both ICOZip for the icoboard and TinyZip for the TinyFPGA BX, both use AutoFPGA. Even OpenArty is slowly getting an AutoFPGA upgrade.

Why? Because (when done right) it makes it easy to compose systems from other components. Components may be added to or removed from a design simply by adding or removing them from the AutoFPGA command line and rebuilding.

Why not other tools? Because I really didn’t want to learn another language besides the Verilog, C++, make, and so forth that I already work with. But aren’t AutoFPGA scripts a new language in themselves? To some extent this is true, although the purpose of AutoFPGA remains being a tool that does its job and gets out of the way.

However, when I went to build a design for the TinyFPGA BX, I discovered a big hole in AutoFPGA’s capabilities. While it has always created linker scripts, the script it has created hasn’t had the flexibility required to handle such diverse designs as the very spartan TinyFPGA BX as well as the more full featured Nexys Video board from Digilent.

Understanding the problem

To understand the problem, we’ll need to take a look at how memory is used within an FPGA SoC design.

The CPU within an SoC needs access to memory for several purposes. It needs a place to store its instructions, another space for global data structures, another space for allocable data structures commonly called a heap, and finally a stack space to support local variables.

Fig 1. A common address space layout

One common arrangement of the address space to support these various purposes contains a keep out region near zero, followed by code instructions, data structures, the heap, and then a stack in that order, as shown in Fig 1. The stack typically starts at the end of memory and grows upwards with each subroutine call, whereas the heap typically starts at the end of global variable memory and grows downwards with each malloc() call.

However, FPGA systems tend not to have one monolithic type of memory. They typically have several memory types within any design. These basic memories include:

Flash memory

This is a slow non-volatile memory. It is great for initial instructions to get a program off the ground. Since it is slow to access, it may not be ideal to execute programs from, although small designs may need to do just that.

The two big details you need to know are that flash is slow, and it is very difficult to write to as part of a program. This makes it really good as a ROM memory, but not so great for other purposes.
Block RAM

This is the ideal type of RAM you’d want to use in any system. It is fast. It’s simple to use and create. The cost to access one part of block RAM is the same as the cost to access any other part of block RAM.

The big downside of block RAM? You only have a limited amount of it. For example, the iCE40HX8k FPGA typically has only about 8kB of usable block RAM. Yes, the data sheet will claim 16kB of block RAM. Realistically, some of that 16kB will be used elsewhere in the design, so the most you are likely to get is probably going to be about 8kB of block RAM.
Static RAM (SRAM)

This is similar to block RAM, but not quite as fast or as simple to work with. Like block RAM, it is volatile. However, it tends to be off chip, slower to access, fairly easy to build a controller for, and it also cheap enough that you can have more memory than block RAM. The drawback is the technology within: SRAM tends to use more power and take more room than the dynamic types of RAM.

One cool feature of SRAM is that if you ignore it, and don’t cut the power, the memory doesn’t change. As a result, Digilent once had a Nexys board design that allowed you to load the SRAM with one FPGA configuration, and then swap FPGA configurations. Sadly, the board with this capability is no longer actively marketed and there may only be a small number of these boards left. As I understand the story, Digilent struggled to get the SRAM chips they needed to continue manufacturing the boards, and so they were forced to switch to SDRAM.
Synchronous, Dynamic Random Access Memory (SDRAM)

The big grandaddy of all RAM devices tend to be the SDRAMs.

By this I’m going to include not only the simpler SDRAMs, but also the DDR, DDR2, and DDR3 SDRAMs. Since these RAM devices are built out of capacitors, the memory can be made compact, and so they are inexpensive to manufacture, and therefore some of the cheapest RAM devices to purchase. The biggest drawbacks to SDRAM are that the controllers tend to be complex, and the access latency tends to be high. How hard are the controllers? Well, let’s just say that I have yet to complete my first working DDR3 SDRAM controller. I know it’s possible, since LiteDRAM has built some awesome SDRAM controllers. Other than that, SDRAMs tend to be high volume low cost devices.
Hyperram

A newcomer to the digital design space are the Hyperram chips. These really belong in the SDRAM category above, since they tend to be built from SDRAMs internally. The big difference is that HyperRAMs have a simpler interface that is easy to build a controller for. Likewise, HyperRAMs tend to have lower latencies than many other DDR SDRAM solutions, since the complex SDRAM array control is handled within the HyperRAM chip itself.

Ok, so that’s what we have to play with. What might an FPGA address space look like with these various types of RAMs?

Block RAM Only

The simplest memory configuration we might build would be a block RAM only configuration. This configuration would be built as though there were no other memories in the system. It would typically consist of a keep-out address range near zero, addresses for the various peripherals, then the block RAM address itself.

Fig 2. Block RAM and peripherals only

Of course, the problem with this configuration is that block RAM is both limited and volatile: it won’t have the values we need within it when we power up our new design, or later when we reset our design. Still, this is a great memory model if you are first bringing up your CPU, and you haven’t yet debugged any other types of memory.

I’ll admit I’ve even thought about segmenting the block RAM into both a read only component, or block ROM if you will, and a volatile block RAM component.

Fig 3. Block RAM and Block ROM

Why would I do this? Because it seems like few processors measure their Dhrystone performance in the absence of their bus. Were I to build a system like this, I might be able to measure the speed of the ZipCPUs instruction set independent of the bus implementation.

Of course, the problem with both of these designs is that block RAM is scarce. What else might we use?

Block RAM and Flash ROM together

Most FPGAs, have a SPI flash of some type which can be used as a ROM. The flash itself exists for the purpose of storing the FPGA’s power up configuration, but typically there’s 75% of the flash left over once that is done. Hence, you get this ROM memory. for “free” with the price of your board.

When block RAM isn’t enough, or alternatively when you want your program to run from non-volatile memory, this flash is available to you. Indeed, some FPGA boards don’t really have much more than block RAM and flash devices to act as memories as discussed above. Examples of these boards include the TinyFPGA BX, the iCEBreaker board, and the CMod S6. This leads to a memory space such as Fig. 4 below.

Fig 4. Flash (ROM) and Block RAM

The original linker script I used for my CMod S6 design placed all of the CPU instructions in flash following the FPGA’s configuration, and all the data memory into the block RAM. This configuration is shown in Fig. 5 below.

Fig 5. Flash based instruction layout

In this figure, the ‘D’ below the block RAM represents global data, ‘H’ represents the heap, and the ‘S’ represents the Stack memory. Likewise the CPU’s flash memory area would start following the FPGA’s configuration data, shown here as FPGA. This would then be followed by a bootloader ‘B’, traditional instructions Insns and any constant program data Const. The purpose of the bootloader was to move any pre-initialized global data, shown here as D, to the beginning of the block RAM.

When the design failed to meet my real-time speed requirements, driven by the need for an audio output, I then came back and placed certain instructions, those in critical sections of my code that needed to run at high speed, into the block RAM–copying them from their original location in flash. This new configuration is shown in Fig 6 below, with the K section denoting these high speed instructions that needed to be copied to block RAM by the bootloader B.

Fig 6. Placing critical instructions in Block RAM

While I managed to solve this challenge, it was a challenge that needed to be solved and the solution I found won’t necessarily work for all designs. Imagine, for example, if I wanted to load the C-library into block RAM. It’s not going to fit no matter how you try to squeeze it. (It’s not a pair of Levi’s) Therefore, given that flash is slow, you might wish to move up to a faster RAM type: SDRAM.

Flash and SDRAM

Some of my larger devices, such as my Arty A7 or my Nexys Video boards, have a DDR3 SDRAM as well. The XuLA2-LX25 SoC I have also works with an SDRAM, just not a DDR3 SDRAM. Either way, an SDRAM chip provides a lot of memory, allowing programs to copy themselves from the flash device to the SDRAM device. This could easily fit the model above, only we would now replace the block RAM with SDRAM. Not only that, for speed we could copy our instructions from the extremely slow flash onto the SDRAM.

Fig 7. Copying all data to the SDRAM

But what about that block RAM? How might we use it now?

The classic answer would be to use all of the block RAM on your device as caches for the CPU. This would mitigate the latency found within the SDRAM.

Flash, Block RAM, and SDRAM

Alternatively, we could place certain memories, at our discretion, within the block RAM. I’ve often done this with the stack memory, but you could also do this with any kernel memory that needed to be low-latency as well.

Fig 8. Placing the stack and critical instructions into Block RAM

Flash, Block RAM, and HyperRAM

Now, just when you think you have everything figured out, someone will give you an auxiliary memory chip, such as this Hyperram from one bit-squared, and you’ll wonder how to integrate it with the rest of your system. It may never be a permanent fixture to any given design, or it may be the SDRAM the iCEBreaker was was lacking. Either way, you now need to quickly and easily reconfigure the design you once had working.

My whole point is that, in the realm of reconfigurable memory spaces, the place where you want to keep all the various parts of your software programs will likely keep changing.

AutoFPGA was just given an upgrade to handle just that issue.

The basic Linker Script File

The linker scripts that I build tend to have four parts to them. First, the script describes a pointer to the first instruction the CPU will execute. The second block declares the various memories on board. The third part declares some fixed pointers that can then be referenced from within my code. Finally, the fourth part describes how the various components of my design will be laid out in memory. Let’s take a look at what this might look like.

The following is an AutoFPGA generated script to handle a block RAM only configuration on the Arty platform.

Binutils supports script comments delimited by /* and */. The generated script therefore begins with a block of legalese comments, followed by the entry point for your program.

/*******************************************************************************
*
* Filename:	./bkram.ld
*
* Project:	OpenArty, an entirely open SoC based upon the Arty platform
*
*---- Skipped comments
/*******************************************************************************
*/
ENTRY(_start)

The important part of this section is the ENTRY() command, which specifies that the CPU entry point will be _start. This label will be set by the linker to point to the entry point in your code. For the ZipCPU, this is always the first instruction in the instruction address space.

As for the legalese, if you don’t like my legalese then feel free to replace it with your own. The legalese in the AutoFPGA output files is copied from a file I typically call legalgen.txt, and introduced through AutoFPGA via a @LEGAL= tag in the global.txt file. Further, as the owner of AutoFPGA, I assert no ownership rights over the designs you create with it, just over the AutoFPGA code itself–which is released under GPLv3.

The second section is the MEMORY section. This section lists the address location and length of every physical memory component within the system. The comment you see in this section below was added by AutoFPGA. It is one of many throughout the various AutoFPGA generated files to help guide you through the process of creating and updating AutoFPGA configuration files.

MEMORY
{
	/* To be listed here, a slave must be of type MEMORY.  If the slave
	* has a defined name in its @LD.NAME tag, it will be listed here
	* under that name.  The permissions are given by the @LD.PERM tag.
	* If no permission tag exists, a permission of 'r' will be assumed.
	*/
	   bkram(wx) : ORIGIN = 0x05000000, LENGTH = 0x00020000
	   flash(rx) : ORIGIN = 0x06000000, LENGTH = 0x01000000
	   sdram(wx) : ORIGIN = 0x08000000, LENGTH = 0x08000000
}

This MEMORY section contains a list of all peripherals that contained a @SLAVE.TYPE key with a MEMORY value. If you recall, AutoFPGA works off of configuration files containing @KEY=VALUE statements. The @SLAVE.TYPE key currently supports one of four types of peripherals: SINGLE, DOUBLE, OTHER, and MEMORY. What makes MEMORY peripherals different is that they are included in the linker script MEMORY section above. You can read more about this in my AutoFPGA icd.txt file.

The ORIGIN value is assigned by AutoFPGA when AutoFPGA assigns addresses. The LENGTH value, indicating the size of the peripheral, is given by the @NADDR tag times the byte-width of the bus the peripheral is on. Hence an @NADDR of 0x8000 will create a LENGTH of 0x20000 as shown above for a 32-bit wide bus.

The names given above come from either the @LD.NAME tag within the peripheral, or the peripheral’s name itself as found within its @PREFIX tag.

The point is that as your design is composed, and the memories given addresses, AutoFPGA supports this reconfiguration by creating and populating the MEMORY section of the linker script.

The next section contains a variety of symbol declarations and assignments. These symbol names, if defined and used within your C/C++ code, will be replaced with the values given below.

First, all of the MEMORY peripherals are given names and values pointing to the beginning of their memory regions.

/* For each defined memory peripheral, we also define a pointer to that
* memory.  The name of this pointer is given by the @LD.NAME tag within
* the memory peripheral's configuration
*/
_bkram    = ORIGIN(bkram);
_flash    = ORIGIN(flash);
_sdram    = ORIGIN(sdram);

Second, if there is an LD.DEFNS tag within the AutoFPGA script, its value will be copied into this section as well.

/* LD.DEFNS */
_kram  = 0; /* No high-speed kernel RAM */
_ram   = ORIGIN(bkram);
_rom   = 0;
_top_of_stack = ORIGIN(bkram) + LENGTH(bkram);

Together, the sections above tell the linker that we have three types of memories, block RAM, flash, and SDRAM. It identifies the origins of those memories, their lengths, and then creates symbols so that your code can access these values.

Next, the _kram, _ram, _rom, and _top_of_stack symbols are used by the ZipCPU’s bootloader to load items from ROM into a high-speed kernel RAM (i.e. block RAM, if used) or otherwise into regular RAM (i.e. an SDRAM). Finally, the top of the stack is set to be the end of the block RAM section in this design.

These are just symbols assigned to values. We haven’t described any real linking yet. Those instructions are found in the next section.

This last section describes where the various segments of your program need to be placed into memory. In this example, I define a new memory section starting at the origin of the block RAM, aligned on units of 4 octets, and filled with a series of segments.

/* LD.SCRIPT */
SECTIONS
{
       .ramcode ORIGIN(bkram) : ALIGN(4) {
               _boot_address = .;
               _kram_start = .;
               _kram_end = .;
       		_ram_image_start = . ;
               *(.start) *(.boot)
               *(.kernel)
               *(.text.startup)
               *(.text*)
               *(.rodata*) *(.strings)
               *(.data) *(COMMON)
               }> bkram
       _ram_image_end = . ;
       .bss : ALIGN_WITH_INPUT {
               *(.bss)
               _bss_image_end = . ;
               } > bkram
       _top_of_heap = .;
}

There are also a series of assignments in this section as well. These define both values that will be used by the bootloader, such as _ram_image_start and _bss_image_end, as well as an ending value which will then be the pointer to the beginning of the heap, _top_of_heap.

A simple pair of lines within your C++ code, such as,

extern int _top_of_heap[1];

int *heap = _top_of_heap;

will allow you to get the value of this _top_of_heap value, and to initialize the heap pointer with it.

But what about those sections? Here are some of their basic meanings:

*(.start) *(.boot): These two segments are ZipCPU specific segments. The *(.start) segment is used by the ZipCPU to make certain the startup code is the first set of instructions following the reset address–which is typically the beginning of the SECTIONS area although not in this case. The most important part of this startup code is that it sets the stack pointer that everything else will depend upon, and then jumps to the bootloader. When the bootloader returns, it then jumps to your main() function. When main() returns, it halts the CPU.

The *(.boot) code is another ZipCPU section where I place the bootloader instructions.

Both of these need to come early in the code order, primarily for the times when I need to copy instructions from flash to RAM–although they aren’t necessarily used in this example.
*(.kernel): I created this ZipCPU specific section to support my S6SoC project. Any code placed in this section will be copied to the fastest RAM in the project (block RAM), in case the CPU has code that must run at high speed.

Both the *(.kernel) section as well as the *(.start) and *(.boot) sections are unknown to the binutils linker or GCC. The code to be placed in these sections must specifically be marked as such.
*(.text*): These sections contain the instructions for the program in question. Now that we have all the nastiness above out of the way, we can actually place these sections, with the *(.text.startup) section among these placed into memory first.
*(.rodata*) *(.strings) *(.data) *(COMMON): These sections contain the read-only (i.e. const) data used by my program, any strings within the program, and finally any global data structures with initial values.

The bootloader needs to copy these sections into their places, but nothing else is required.
*(.bss): The final section is the BSS segment. Unlike the other segments above, where the bootloader just needs to copy them into place, the BSS segment needs to be cleared to all zeros. This is where any uninitialized global variables within your program will be placed.

There’s one other thing you need to know about this section, the }> bkram notation. This means that the section just described should be allocated a place in the bkram device. Something else you might see is }> bkram AT>flash. This means that the section needs to be placed into bkram, and that your code needs to be linked as though the section were placed into bkram. However, it is first placed into the flash memory area, and left there for your bootloader to copy it into bkram.

Now that you know what the various sections of this file are, and how the segments within your program will be allocated among them, what happens if we want to do something else?

Multiple Linker Configurations

Originally, AutoFPGA created one linker script, called board.ld, and adjusted it based upon the peripherals available to it. For example, it could handle designs with Flash and SDRAM, but couldn’t really do much with Flash and block RAMs. This worked great for some designs, such as those with a massive amount of RAM as shown in Fig. 7 or 8 above, but horrible for others, such as Fig. 2 through 6 above.

As an example, if I wanted a design to run from block RAM alone, such as to test the CPU itself apart from its memory peripherals with the form in Fig. 2 above, this one size fits all linker script would have be inadequate. Likewise, if I had a design that didn’t have enough room in RAM to copy the various program segments into (imagine the C-library here), the stock linker script wouldn’t work either. While I could create a script by hand for each of these scenarios, such as I was starting to do in my TinyZip design, that script would then need to be updated by hand every time the addresses in the MEMORY region changed.

This was getting annoying.

To deal with this, I just recently created some new AutoFPGA tags for creating linker scripts:

@LD.FILE: If present in a given configuration file, AutoFPGA will create a linker script and write it out to the named file.
@LD.DEFNS: If present, these definitions will be added to the definitions section of the new linker script.

Well, sort of. What if a design has multiple linker script configuration files? In this case, the components that have no @LD.FILE tags will have their @LD.DEFNS tags copied to all linker scripts, while the components with an @LD.FILE tag will have their @LD.DEFNS tag copied into the linker script defined by that component only.
@LD.SCRIPT: This tag, containing the SECTION component above, will be copied into the linker script associated with the @LD.FILE tag in the same component verbatim, although with variable substitution applied. So, for example, if our design creates a RESET_ADDRESS tag within the peripheral named zip (i.e. having a PREFIX tag of zip, then we might reference @$(zip.RESET_ADDRESS) to get a copy of what that address was here in this location.

Several former linker tags have kept their functionality, but now have new names.

@LD.NAME: This is the name of the memory component, as found in the linker script. In the example above, we had names of bkram, flash, and sdram. This tag used to be called @LDSCRIPT.NAME.
@LD.PERM: The MEMORY section of a linker script requires a permission string. The binutils documentation calls this a set of attributes. So far, I’ve only used rx and wx for executable ROM and executable RAM respectively. Other possible attributes are defined can be found in the binutils documentation. AutoFPGA does nothing more than copy then from your design file to the MEMORY section of the linker script.

Remember, AutoFPGA is primarily a copy-paste tool with the ability to compose bus interconnects, and a limited variable substitution and expression evaluation capability sprinkled within. Similarly, another of the goals of AutoFPGA was that when it’s work was done, the computer generated files would be comprehensible, rather than your more typical computerese.
@LD.ENTRY: If present, this will define the entry symbol for a given linker script. If not specified, this will default to the _start symbol as above.

This updated method of generating custom linker script has now worked so well for me that I have several linker scripts defined for the AutoFPGA upgrade to my OpenArty project: one for block RAM only, another for flash plus block RAM, and I’ll be adding a third for flash, block RAM, and SDRAM support. Even better, using this approach, adding support for a HyperRAM controller should be just as simple as copying the controller components to my RTL directory (or a subdirectory of it) and adding the HyperRAM AutoFPGA linker script configuration to my design.

Conclusion

Working with one CPU design across many different hardware components and capabilities can be a challenge. It can be difficult to take a basic design and rapidly configure it for a new set of hardware, or to maintain support across several different hardware implementations. AutoFPGA can handle many of these reconfiguration needs, to make reconfiguring designs from one hardware configuration to another easier.

Even better, AutoFPGA’s linker script generation just got an upgrade to help it deal with the need for multiple different memory configurations–either between designs or even within the same design.

Of course, the unwritten reality of this article is that I don’t really want to spend my time writing linker script. I would rather be spending my time getting my new HyperRAM to work. This is just my way of trying to simplify the massive configuration challenges I have along the way.