Minimizing FPGA Resource Utilization

There have now been several projects I’ve worked on where I’ve butted up against the maximum resource utilization of an FPGA chip. If you ask a manager how to deal with this problem, they will think you need a bigger FPGA. This answer is problematic, though, for two primary reasons:

If you already have your board designed, built, and tested–perhaps even in production, and you wish to add features to the board, you will be stuck with the FPGA resources that you already have. Building or even just assembling a new board with a bigger FPGA can be an expensive proposition.
I’m a hobbyist. I typically can’t afford a bigger FPGA

At times like these, a thorough code scrubbing can often help you squeak that last Lookup-Table (LUT) out of your FPGA. To do that, I’ve come up with several strategies for reducing my LUT count on an FPGA.

What is a LUT?

A LUT is a Look-Up Table. Modern FPGA’s are built out of large arrays of these lookup tables. Using a lookup table, you can build any logic you want–so long as you don’t exceed the number of elements in the lookup table.

As an example, the 7-series Xilinx FPGAs are composed of “configurable logic blocks”, each of which contain two “slices”, of which each of those “slices” contain four 6-input LUTs. Each of these LUT’s can handle either one six input lookup, or two five input lookups–as long as the two share the same inputs.

The Altera Cyclone IV, on the other hand, has only 4-input LUTs.

The point being that every FPGA implements your logic via a combination of LUTs.

Chips differ by the capability of their LUTs, as well as by the number of LUTs on board. In general, the more LUTs you have, the more logic your chip can do, but also the more your FPGA chip is going to cost. It’s all a tradeoff.

To get the most logic for a given price, the FPGA design engineer needs to be able to code efficiently, and pack their code into the fewest LUTs possible.

The ZipCPU and the S6SoC

When the ZipCPU was first built, it only supported 32-bit bytes. This meant that the smallest unit of individually addressable memory was the full bus size: 32-bits. This has all kinds of consequences when you wish to operate on 8-bit values, or when computer code depends upon 4*sizeof(char)==sizeof(int).

In my case, the light dawned when I realized that I would need to rewrite the entire C-library to support 32-bit bytes, that the Dhrystone benchmark depended upon 8-bit bytes (Ref), that Coremark also depended upon 8-bit bytes. Figuring that this was only the tip of the iceberg regarding the problems I was going to have, I then decided to byte the bullet and add byte-wise addressing instructions to the ZipCPU.

The problem with this was that the ZipCPU already worked and worked well on some very small FPGA’s, and the Spartan 6/LX4 on the CMod S6 was one of them. With only 2400 six-input LUTs to work with in total, and only about 50 to spare, could the additional logic necessary to fit the new instructions fit onto the S6?

I actually set my goals higher. Not only did I want the new instructions to fit, but I wanted to make some other upgrades as well. Most of these other upgrades surrounded the slow flash speed. Because the S6/LX4 has hardly any RAM, almost all of the instructions had to fit within the flash. Flash access, however, is slow. Hence, when I started my upgrades, the ZipCPU took 52 clocks per instruction when running from the flash. Ouch! Could something be done about this as well?

So these were my goals:

The obvious: I wanted to be able to support the extra 8-bit byte instructions: LH (load halfword), SH (store halfword), LB (load byte), and SB (store byte).
I also wanted to upgrade the S6SoC to be able to use the compressed instruction set extension of the ZipCPU. Using that extension, the assembler can (sometimes) pack two instructions into one instruction word. If the S6SoC could be made to handle these instructions, then I wouldn’t need to compile two versions of the C-library: one using and one without using these instructions.
I wanted to upgrade the fetch routine, so that it could handle fetching two words at a time.

Because of the way the QSPI flash is set up, the first word you fetch from flash will always cost 22 flash clocks. The first six of those are for the address, the next 8 to allow the flash to look up your address, and then another eight to read the data back out. Since I was generating a clock signal from logic, the flash actually ran at half that speed, costing me 44 clocks every time I wished to fetch an instruction. At the time, the S6SoC had absolutely no pipelining, so it was always fetching a new instruction after executing every instruction.

If I could fetch two instructions at a time, that would cost only 60 clocks, and yet provide me with two instruction words. This should raise the instruction speed from one instruction every 52 clocks to one instruction every 38 clocks or so.
I wanted to update the flash controller so that it would work off of a 80MHz flash clock instead of a 40MHz flash clock. Basically, I wanted to run the flash at the speed of the rest of the FPGA. Since the flash was rated for 100MHz, this shouldn’t be a problem, right?

This should reduce the number of clocks necessary to load two instructions from 60 clocks to 30, speeding the instruction issue rate up to about one instruction every 2 clocks.

In the end, supporting the compressed instructions got me down to one instruction every 18 clocks, but I’m getting ahead of myself.

As the update went along, things went so well I started to get greedy. So, I added more goals to my upgrade:

Now, I wanted to add the divide unit into the design
I wanted a full multiply
I also wanted to partially pipeline the CPU itself. I still wanted to avoid the expensive collision detection logic, but partially pipelining the CPU would speed the CPU up another 60% (5 clocks per ALU instruction down to 3)

In the end, I got everything I wanted save the LOCK instruction. (The LOCK instruction holds the bus cycle line high while reading a value, running one ALU operation on it, and then writing the value back. It’s also useful for atomically reading 64-bit values.)

Strategies to Reduce Logic Usage

Here’s how I managed to fit so many upgrades into 2400 LUTs, given that I was already using 2345 of them. (Yes, that’s right, on a device with 2400 LUTs, I managed to use all of them.)

Here are some of the strategies I’ve used over time to keep my LUT usage down:

Start by counting your logic.

Your design tools should be able to tell you how many LUTs you are using. Some tools will even break this out nicely between design modules, helping you to focus on the part that needs work.

Find out how to do this, and then track your work throughout your design.

While one strategy is to set your sites on the biggest components, I also got a lot of savings by looking at how many LUTs were consumed by the smaller components as well. When you are starting out at 2345 LUTs out of 2400, every LUT matters.

Now, understand that the synthesizer is going to try to map all of your logic into LUTs. Every LUT can handle some number of inputs. Look up what your hardware’s definition of a LUT is, and then count the number of inputs to every flip-flop in your design. Count the number of wires to each flip-flop enable, and the number of alternatives within every block. Work to bring that number down to eight or less if possible, and you’ll be amazed at how well your design will handle both fitting onto your device, as well as the speed of the logic you can achieve using your logic.

So, for some simple numbers to think about with respect to Xilinx’s CLB structure:

5 inputs: If you can get your logic down to five inputs or less, you might be able to share a LUT with wires within a group
6 inputs: Anything that requires six inputs will require one LUT. Be aware of busses of logic all requiring 6-inputs … you are likely to need one LUT for every input in that bus.
7 inputs: Requires two LUTs and a 7-Mux. This still fits within one slice, so it doesn’t impact your timing (much) yet.
8 inputs: Requires four LUTs and n 8-input mux. This still fits within one slice
9 inputs and more: These don’t fit nicely into a LUT, nor do they fit nicely into a set of four LUTs, and so the synthesis tool is going to have to try to figure out how to string LUTs together to accomplish your logic. If your logic isn’t an add, subtract, or compare that can use the carry chain, you may wish to do your best to avoid this amount of logic.

As you go through your logic: force every LUT to justify its existence.

Use block RAM anywhere you can.

Look for memories that are implemented in LUTs, and turn them into block RAMs. Anywhere you can do this will yield an amazing savings. Indeed, switching from LUTs to block RAM has often been the difference between failure and success in my designs.

For example, the following logic isn’t likely to make it into block RAM:

always @(posedge i_clk)
	if (A)
		mem[addrA] <= valA;
	else if (B)
		mem[addrB] <= valB;
	else if (C)
		mem[addrC] <= valC;

However, with a couple simple manipulations, you can often trade a clock of delay for the capability you need. The following, for example, will fit in Block RAM:

// Capture the logic we need ...
always @(posedge i_clk)
	wr <= ((A)||(B)||(C));
always @(posedge i_clk)
	if (A)
	begin
		wAddr <= addrA;
		wData <= valA;
	end else if (B)
	begin
		wAddr <= addrB;
		wData <= valB;
	end else if (C)
	begin
		wAddr <= addrC;
		wData <= valC;
	end

// So we can implement it with simplicity on the next clock,
// using block RAM.
always @(posedge i_clk)
	if (wr)
		mem[wAddr] <= wData;

This was an important part of getting my SD-SPI and RMII ethernet modules to work, although it didn’t help the S6SoC in particular. Still, it allowed me to run the SD-SPI controller with a 200MHz clock – so lowering your LUT usage is likely to increase your maximum speed as well.

Be very aware of nested if’s

My first attempt at any design tends to look like a giant if-then-else logic block handling states and state transitions. They’re just too easy to understand and follow. As an example, the simple wishbone master controller we built the other day started with a giant always block with several conditions.

The problem with these large logic blocks is that there are often states where you don’t care what certain values are for some variables. In these cases, you can simplify the logic for those states.

For example, imagine something like ….

always @(posedge i_clk)
if (RESET)
begin
	valid <= 1'b0;
	B <= 0; /// Really a don't care
end else if (CE)
begin
	valid <= (input is valid);
	B <= (... something of value);
end

Let’s count this logic: Each of the bits in valid and B depend upon both RESET and CE, hence a minimum of one LUT each. Further, the flip-flop clock enable line also requires a LUT for each flip flop. So, we’re talking about 2 LUTs for the valid signal, and another 2 LUTs for every bit in B.

If the only value that really depends upon the RESET is the valid signal, which indicates whether or not B contains a valid value, why must B be set on reset? In this case, you want to separate your logic into multiple logic blocks, often one block per variable. You can then end up with something like …

always @(posedge i_clk)
if (RESET)
	valid <= 1'b0;
else if (CE)
	valid <= (input is valid);

always @(posedge i_clk)
if (CE)
	B <= (... something of value);

If CE is a simple wire by itself, and not composed of more logic resources, than it won’t cost any LUTs to calculate. Likewise if the value you wish to store in B comes from already calculated logic on your board, then you have just spared yourself one LUT for every wire in B. Even if there are several inputs to the wires in B, if you just managed to drop from 6-inputs to five for each wire, then you may have dropped the number of LUTs used to calculate B in half.

The impact of this can be significant.

Two examples of updated and simplified logic would be my simplified QSPI flash controller when compared with the giant always block in basic flash controller. Even though the simplified controller can only handle read requests, its amazingly light on the logic.

Likewise, my recent rewrite of my UART transmitter and receiver saved me many LUTs by breaking up the logic in that big always block. Perhaps you might enjoy looking through the GIT history of this UART controller, to see how easy they were to read back when they used a massive state transition block, and then to compare that with what they look like today.

Reduce your requirements if you can

Very few UART controllers truly need to handle 5, 6, and 7-bit words, parity, break setting and detection, parity error and frame error detection, hardware flow control, and so forth. If you know your application well, and you have code in your application that is designed to handle things that will never happen, then remove that code.

In my case, the UART transmitter and receiver had way more functionality than I needed. So, I built a lighter UART transmitter and receiver with only the capabilities I would ever use. Since I had long since removed the UART FIFO, there wasn’t much more I could after that.

The point being, if you’ve over designed your product–remove the all that extra, unused, logic when you get tight on resources.

Avoid bit selects if you can

I made a lot of adjustments to the divide unit to get it to fit. Most of these adjustments followed the advice above, but one very powerful adjustment was also to get rid of a statement like …

if (lots_of_conditions)
	o_quotient[current_bit] <= (logic to calculate a bit);

and, with some understanding that every bit needed to be set anyway, I was able to rearrange that statement into,

if (lots of conditions)
	o_quotient <= { o_quotient[30:0], (logic to calculate a bit) };

Think about the LUT usage required to implement this statement. The first statement requires many LUTs per bit to determine not only whether that bit is going to be set, but also what the bit will be set to. The second example still requires some LUTs to determine whether or not a bit is going to be set, but those LUTs can be shared across all bits. Further, only the bottom bit has to have the special set of LUTs to determine what the bit needs to be set to.

That one optimization by itself saved me many, many LUTs. In the end, I think I saved about 100 LUTs out of a 300 LUT component, but you can look at the git log to see the details of the other changes that were made along the way.

Manage your Reset Logic

If you follow Xilinx’s guidance, they would highly recommend against having a global reset. They explain pretty clearly how it impedes all of the logic within your design.

In my own implementations, such as within the ZipCPU, I tend to create _valid lines stating whether or not a group of logic lines is actually valid. These lines then get the reset signal, while all the other logic ignores it. Indeed, many of my ZipCPU implementations don’t use a global (or even a local) reset at all. Sure, the CPU has a reset line … but I usually leave it unconnected, and instead command the reset from some amount of local logic. For example, both the watchdog timer and the debug port can command a reset. Further, on a really small design with no debug port (i.e., the S6SoC again), I used a button for that purpose.

Application Specific Optimizations

Some optimizations, though, are by nature application specific. A good example of this is a FIR filter I built some time ago in support of my universal resampling project. (Any rate in, to any lower rate out, out of band rejection greater than 70 dB, etc.) When building the FIR, I couldn’t afford to use block RAM’s to hold the coefficients—cause there was no time for a RAM lookup: every coefficient needed to be used on every clock. The entire design, though, needed to run at the full clock rate.

So, I made each tap settable from a single address that would push them all through memory–like a giant shift register. Still, this was too much logic.

In the end, even that wasn’t enough: I had to go farther so as to fix the FIR taps with constant values within the filter. Because of the nature and structure of the resampling problem, this approach worked.

I then managed to reduce the logic required even further by forcing the filter to be symmetric. Indeed, the Parks-McClellan filter design algorithm, used for designing my lowpass anti-aliasing filters, only designs such linear phase (i.e. symmetric) filters anyway. Thus, with this trick, I was able to use half as many hardware multiplies.

Looking over my taps, I then realized I could do more since some of the taps were zero, some had only only one bit set in their coefficient, and since the mid-point was already scaled to (2^N-1) so as to be able to use a subtract instead of a multiply.

If DSP usage is an issue, you might wish to consider this approach.

Keep large comparisons out of state transition logic

Don’t place large comparisons in your state transition diagrams if you don’t need to.

As an example, the logic below compares A against a 32-bit value, and C against a 24-bit value.

always @(posedge i_clk)
	A <= A + 1'b1;
always @(posedge i_clk)
	C <= C + 7;

always @(posedge i_clk)
if (A == 32'h00)
	B <= iB;
else if (C == 24'h084)
	B <= ~B;

If we are smart about things, we can move the comparison one clock earlier and calculate intermediate values.

always @(posedge i_clk)
	zA <= &A;
always @(posedge i_clk)
	mC <= (C == 24'h081);

These two intermediate values then relieve the stress on the set for the value of B,

always @(posedge i_clk)
if (zA)
	B <= iB;
else if (mC)
	B <= ~B;

This was another technique that I used when simplifying the onboard UART. To see how it works, look at the baud counter logic within the receiver, and pay attention to the zero_baud_counter logic. At one time, the state diagram had depended upon tests for whether or not baud_counter was zero. You can see from the receiver source how not only was the baud_counter logic removed from the main always loop, but the zero_baud_counter logic was removed as well. Now, the main loop only references the zero_baud_counter logic strobe to know when to step the logic forward by one.

If all else fails, ask for help

Finally, if all else fails, ask someone with more experience to look over your code. Perhaps they might see something you are missing?

On second thought … maybe not. They might tell me that the way I format my code is all wrong.

Testing the changes

Since one focus of this blog is on testing and debugging, how did I manage to prove that these new changes worked?

First, all of these changes were tested initially within a simulation. My favorite simulation test was to see if I could get the S6SoC to run all the way from boot up to sending Hello World out the serial port.

Second, the new instruction set also included special NOOP encodings. These allow me to place NOOPs into the code that will then cause the simulator to print to the screen, whether it be characters, strings, register values, or a full register dump. The same code, when run within the hardware, would just ignore these NOOP instructions. These allowed me, when using the simulator, to “debug by printf” all over the place.

Third, the simulator also printed out a lot of trace information. This was useful for tracing back memory values that hadn’t been set appropriately all the way back to when they were set. Further, by matching the trace timing with GTKwave’s timing, I was able to look up, using GTKwave, any values that weren’t printed in the trace to find the problem.

Finally, for the bugs the first two items didn’t catch, I did have a wishbone scope on board. Since I didn’t have room on board for my favorite debug interface, I set the CPU up so that on any reset (button press) it would dump the values of the scope and the register values on startup (these weren’t cleared on reset). While the result is that the Hello World program’s output doesn’t start out looking pretty, I can also use that startup output to find whatever caused the CPU to halt and need to be restarted manually.

So, it was all doable and straight forward. Indeed, even today, if you don’t have the CMod S6, you can still test how the S6SoC would run on that CMod S6 by using the simulator.

How did it work?

The changes outlined above were the difference between success and failure, between getting the updated S6SoC working with the new instruction set and the compressed instruction extension, or being able to abandon that hardware (and my bragging rights) entirely. Had I not made the optimizations above, I might never have gotten the CPU to work with 8-bit bytes, and hence I might’ve never had C-library support. Now, however, not only does the CPU run with 8-bit byte support, but it also has a divide unit, and runs 3x faster than it ever did before.

What does all this mean? If I disable illegal instruction detection for the LOCK instruction, then I can play 4x4x4 Tic-Tac-Toe on my CMod S6.

So is minimizing LUT usage important?

Tell me your experiences!