Logic usage and decoding return results with cascaded multiplexers
About a week ago or so, I was working on a Wishbone cross-bar interconnect. It seemed like a fun project to toy around with. However, at one point in my design, things just weren’t working. Personally, I was convinced yosys was at fault, but I couldn’t figure out what the fault was.
So I set out about the task of creating a yosys issue.
The problem with creating a yosys issue is that most issues are ignored unless you can provide a minimal, complete, verifiable example of the error. For me, this is usually about 60-90 lines of code. However, usually when I turn in this 60-90 lines, Clifford tends to school me: either pointing out how the fault was my own, or otherwise he reduces my “minimal” example to 15 lines of code.
This time I was determined to create a better example of this bug myself first. Who knows, perhaps I might have even been able to find the bug?
So I set up all of the configurable parameters just the way I wanted, and ran yosys. At the prompt, I typed:
Sure enough, if I squinted my eyes just right at the output in log.txt
,
I felt like I could understand what was going on.
For example, I found things like:
This appeared to describe a
multiplexer, where the output,
\Y
, would be set to \A
is \S
was false, and to \B
otherwise.
If I looked at line 227 within my source file, sure
enough, I could recognize some of the variables within my design: r_stb[0]
,
r_we[0]
, and even i_mwe[0]
. The $ternary$../../rtl/wbxbar.v:227$1146_Y
value really didn’t make much sense to me, but if I searched on this string
within the log.txt
file, I could often find where that was an input to
another block, meaning that it was an intermediate piece of logic.
Then, as I started trying to back out what was going on in this giant file of logic, I found a chain of several muxes all in a row. Sure, I had to do some scratching through the output to try to put it back together. No, it wasn’t easy, but I was convinced I was going to do this! If you adjust the output format to make it easier to read, you can then see (roughly) what I found:
This is really simplified, but what I found was certainly a long chain of these multiplexers, perhaps as many as sixteen in a row.
This new knowledge left me wondering, was I writing my logic properly?
What’s wrong with cascaded if’s
Okay, I’ll admit that I use cascaded if’s in my code all the time. A classic example of when I might use such a cascaded if is when decoding a Wishbone return within an interconnect.
The basic logic is something like,
While I don’t normally use a for loop for this logic, like I did in my crossbar code, the result would roughly be the same.
Now, consider what this would look like once it was turned into logic. It would become something similar to,
And so on …. as this would continue through processing eight separate slaves.
Immediately I was concerned: long chains of logic like this will impact the clock rate this logic can run at! In general, as part of design, you want to minimize your longest path in order to achieve high speed. This logic, however was anything but minimum!
But could it be improved?
Looking over this logic tree, it was very much a one-sided tree, much like the one shown in Fig. 1 at the right.
In this figure, the inputs are at the bottom left, and the intersections within the tree represent the multiplexers yosys placed into my design. At the top, then is the result of this logic.
What if I could balance the tree?
Were I able to balance the tree, I might then use fewer multiplexers, the decoding logic would get far simpler, and the longest path far shorter, perhaps as short as Fig 2 on the left.
Was it doable?
In this example, I was issuing requests to one of several slaves, never to more than one slave at a time, then listening for the requests in return. So I got to thinking, what if I also created an index to describe the one slave that would respond? In that fashion, the above code could be simplified to something closer to:
Alternatively, I could’ve written this output in a case statement–had I known how many slaves. An example of such a case statement for four slaves might look something like,
Surely that would be better, right?
It was time to run some tests.
The basic logic test
Some time ago, I had the wonderful opportunity to meet Jan Gray. While you may know me for advertising a resource constrained 32-bit CPU that I call the ZipCPU, Jan has been known for stuffing many, many CPUs within an FPGA.
Much as I hate to admit it, his CPUs are often much smaller than mine. So, I asked him, how do you get your CPUs that small? I found his answer very enlightening.
He said, and I’m paraphrasing here, to start with the clock rate you want to achieve. Then start piecing small bits of logic together here and there, and just run these tiny pieces of logic through the synthesizer–never the whole design. Look at what the synthesizer does with these tiny pieces, the number of LUTs they create, their timing performance, and then optimize these small pieces at the low level. In this way, you can gradually work your way up to the whole.
No, that’s not a direct quote. About a year and a half has passed since that conversation, but his point remains: optimize your design piecemeal at the very low level, and learn optimization there. It will then be easier to create a larger design from the lessons you’ve learned.
So, let’s do that here!
Imagine if you will that we have four sources of data, a
through d
. Each
of these sources has a select line, asel
through dsel
, telling us when
their respective data words, adata
through ddata
are valid. Our goal
will be to set our own output, odata
, to contain the data from the one valid
input data word.
The code for this would look something like the one below.
Let’s run this through yosys and see what we can learn.
So we now know that this code would cost us 96 LUTs–essentially 3 LUTs for every outgoing data bit.
Can we do better?
Suppose we instead had an index
, which could indicate for us which of the
four inputs we wanted to set our output to. Such an example might look like
this one:
This time, yosys tells us that we could calculate our outgoing data using only two LUTs per output bit.
If you find yourself like me, scrounging for every LUT you can in order to maintain your reputation as a log-logic designer, 32 LUTs is a big difference. It might get even bigger if this sort of savings could be replicated across the design.
Moving to the CPU
So I took a look within the ZipCPU. As you may recall, the ZipCPU has a couple of functional units. There’s the ALU, which can perform basic arithmetic, the memory unit, which can load values from memory and stores them back again, and the divide unit. Not shown in Fig. 3, is the fact that the external debugging port can also write values to the registers within the CPU. I’ve also dreamed of creating a floating point unit, so let’s include that in our discussion as well.
Now, each of these three units can return a valid
signal, indicating that it
has an item to write to the register file, a data
signal, containing the
data to be written, and similarly, many of the units can return an error signal.
The logic, then, to determine when to write a value to a register looks something like the following:
Does this look at all like the logic cascade we were dealing with above?
How about the logic to determine which of the various values needs to be written to the register file? That looks like another logic cascade:
Well, actually, it’s worse. If you look at the code I referenced above,
you’ll notice that I calculated the value twice, once for wr_gpreg_vl
and
another time for wr_spreg_vl
. I had to do this just to keep any updates to
the program counter
and the status register
off of the critical path.
This logic doesn’t need to be so complex!
The ZipCPU is very much an in order processor. Sure, it is pipelined, but by the time you get to this stage in the pipeline, you already know what component will be producing a value. Hence, I could’ve used an index instead.
So let’s instead create a simple test, just to estimate what the impact of this change will be. Let’s say that this is our current logic.
If you compare this to the actual logic above, you should see some strong similarities.
So let’s ask, how much logic does this use? A quick run through yosys yields the following result:
Can we do better?
Sure we can! Let’s add in just a little bit of precursor logic. In this
case, imagine we know we will have to write to the
ALU
on the clock after prealu
is true, and likewise for the other values.
This means we’ll need to create an index
first from these pre*
values,
and so change our test case to:
Once we have this three bit index, we can now create logic based around a case
statement to decode our results. As before ov
will tell us when our output is
valid and hence when we need to write to the register file,
whereas odata
will be the value we write to the register file.
What kind of logic usage might this garner?
Any guesses?
110 LUTs! That’s a savings of nearly 20%!
What about that SB_DFFSR
line? That’s new. Those three SB_DFFSR
s
are the
flip-flops
holding the three bits of our index. In most all of my designs, I’ve been
LUT
starved, not
flip-flops
starved, so I’m not really worried about these.
Resource measurement is not just for iCE40s
Did you know you could measure resource usage for Xilinx chips, Intel chips, as well as ASIC gates using this approach with yosys?
If you wanted to do this for a Xilinx design, for example, all you’d need to change is the synth_ice40 line to a synth_xilinx line. Indeed, in my experience, you’ll also get a much faster result using yosys than using any of the Vendor tools–Vivado, ISE, or even Quartus. Feel free to check me out on this.
So our basic design would require 65 6-input LUTs, one 5-input LUT, and 32 3-input LUTs. The 32 MUXF7s are needed to combine 64 of those 6-input LUTs together in order to make 32 7-input LUTs.
If you now swap design approaches to use the index, the LUT usage drops dramatically:
If you figure that the MUXF7s are really there for free, then we dropped our logic needs from 98 6-input LUTs down to 78–again this is about a 20% improvement or so.
Ever had to try to answer the question, “how complex is your design?” and the questioner wants your answer in gates? We can do that too! In this case, just run synth, a generic synthesis routine, followed by abc -g cmos2, to map your design to traditional CMOS gates, and then stat to report the results.
That is, our pre-optimized design will require 293 NAND gates, 70 NOR gates, and another 70 NOT gates.
Similarly, after our optimization, you’d get:
If we don’t count the flip-flops, then you can see that we’ve dropped from 433 gates down to 393. This time the savings isn’t nearly as much, but 10% fewer gates translates to area in the ASIC business, and lower area translates to lower cost.
Conclusion
Perhaps you’ve picked up on a couple of points from this discussion. First, if you are concerned about minimum area and minimum logic, then it’s important to know a bit about what’s happening under the hood. Second, that yosys dump command was what tipped me off to what was going on under the hood. Yosys also supports a show command for those who would rather view this information graphically. Third, you can compare various design techniques in seconds when using yosys. It’s really quite a valuable lesson.
I’ve just rewritten how AutoFPGA handles its address decoding as a result. While I’m still working through some of the pains of that rewrite, it turns out that it wasn’t all that hard to do.
Adjusting how the ZipCPU determines whether or not to write back, and which value to write back to the register file is likely to be next. Perhaps you might want to consider at this point, what formal properties could be used in a switch like this, and how easy (or hard) would it be to convince yourself that the updated code still worked as before?
I should also mention, there is another competing theory out there espoused by the major vendors. This theory is that you should build your design, count up the gates you need, and buy your chip. If your algorithm doesn’t fit in the chip, then you need to buy a bigger chip and possibly even build a bigger board. This also applies to much of the vendor supplied IP. The major vendors have no incentive to build low-logic designs, if their low logic designs would keep you from buying more of their product.
Think about it.
Oh, one more thing, remember that bug that sent me looking into the yosys dump in the first place? Yeah, that one. It was my own fault. The problem was just obscured by the for loop I was using, and the fact that for loops in Verilog don’t work like they do in C++.
It is naught, it is naught, saith the buyer: but when he is gone his way, then he boasteth. (Prov 20:14)