Logic usage and decoding return results with cascaded multiplexers

About a week ago or so, I was working on a Wishbone cross-bar interconnect. It seemed like a fun project to toy around with. However, at one point in my design, things just weren’t working. Personally, I was convinced yosys was at fault, but I couldn’t figure out what the fault was.

So I set out about the task of creating a yosys issue.

The problem with creating a yosys issue is that most issues are ignored unless you can provide a minimal, complete, verifiable example of the error. For me, this is usually about 60-90 lines of code. However, usually when I turn in this 60-90 lines, Clifford tends to school me: either pointing out how the fault was my own, or otherwise he reduces my “minimal” example to 15 lines of code.

This time I was determined to create a better example of this bug myself first. Who knows, perhaps I might have even been able to find the bug?

So I set up all of the configurable parameters just the way I wanted, and ran yosys. At the prompt, I typed:

$ yosys -ql log.txt
yosys> read -sv wbxbar.v
yosys> dump
yosys> exit

Sure enough, if I squinted my eyes just right at the output in log.txt, I felt like I could understand what was going on.

For example, I found things like:

  attribute \src "../../rtl/wbxbar.v:227"
  cell $mux $ternary$../../rtl/wbxbar.v:227$1146
    parameter \WIDTH 1
    connect \Y $ternary$../../rtl/wbxbar.v:227$1146_Y
    connect \S \r_stb [0]
    connect \B \r_we [0]
    connect \A \i_mwe [0]
  end

This appeared to describe a multiplexer, where the output, \Y, would be set to \A is \S was false, and to \B otherwise.

	always @(*)
	if (S)
		Y = B;
	else
		Y = A;

If I looked at line 227 within my source file, sure enough, I could recognize some of the variables within my design: r_stb[0], r_we[0], and even i_mwe[0]. The $ternary$../../rtl/wbxbar.v:227$1146_Y value really didn’t make much sense to me, but if I searched on this string within the log.txt file, I could often find where that was an input to another block, meaning that it was an intermediate piece of logic.

Then, as I started trying to back out what was going on in this giant file of logic, I found a chain of several muxes all in a row. Sure, I had to do some scratching through the output to try to put it back together. No, it wasn’t easy, but I was convinced I was going to do this! If you adjust the output format to make it easier to read, you can then see (roughly) what I found:

  cell $mux ...
    connect \Y $one$
    connect \S av
    connect \B bdata
    connect \A adata
  end

  cell $mux ...
    connect \Y $two$
    connect \S cv
    connect \B cdata
    connect \A $one$
  end

  cell $mux ...
    connect \Y myoutput
    connect \S dv
    connect \B ddata
    connect \A $two$
  end

This is really simplified, but what I found was certainly a long chain of these multiplexers, perhaps as many as sixteen in a row.

This new knowledge left me wondering, was I writing my logic properly?

What’s wrong with cascaded if’s

Okay, I’ll admit that I use cascaded if’s in my code all the time. A classic example of when I might use such a cascaded if is when decoding a Wishbone return within an interconnect.

The basic logic is something like,

always @(posedge i_clk)
begin
	o_ack  = 0;
	o_data = 0;
	for(i=0; i<num_slaves; i=i+1)
	begin
  		if (slave_ack[i])
		begin
    			o_ack = 1;
    			o_data = slave_data[i];
		end
	end
end

While I don’t normally use a for loop for this logic, like I did in my crossbar code, the result would roughly be the same.

Now, consider what this would look like once it was turned into logic. It would become something similar to,

always @(*)
begin
	tmp_ack[0]  = 1;
	tmp_data[0] = 0;
	if (slave_ack[0])
	begin
  		tmp_ack[0] = 1;
  		tmp_data[0] = slave_data[0];
	end
end

always @(*)
begin
	if (slave_ack[1])
	begin
  		tmp_ack[1] = 1;
  		tmp_data[1] = slave_data[1];
	end else begin
  		tmp_ack[1] = tmp_ack[0];
  		tmp_data[1] = tmp_data[0];
	end
end

always @(*)
begin
	if (slave_ack[2])
	begin
  		tmp_ack[2] = 1;
  		tmp_data[2] = slave_data[2];
	end else begin
  		tmp_ack[2] = tmp_ack[1];
  		tmp_data[2] = tmp_data[1];
	end
end

And so on …. as this would continue through processing eight separate slaves.

Immediately I was concerned: long chains of logic like this will impact the clock rate this logic can run at! In general, as part of design, you want to minimize your longest path in order to achieve high speed. This logic, however was anything but minimum!

But could it be improved?

Fig 1: A series of unbalanced multiplexers

Looking over this logic tree, it was very much a one-sided tree, much like the one shown in Fig. 1 at the right.

In this figure, the inputs are at the bottom left, and the intersections within the tree represent the multiplexers yosys placed into my design. At the top, then is the result of this logic.

What if I could balance the tree?

Fig 2: A balanced multiplexer tree

Were I able to balance the tree, I might then use fewer multiplexers, the decoding logic would get far simpler, and the longest path far shorter, perhaps as short as Fig 2 on the left.

Was it doable?

In this example, I was issuing requests to one of several slaves, never to more than one slave at a time, then listening for the requests in return. So I got to thinking, what if I also created an index to describe the one slave that would respond? In that fashion, the above code could be simplified to something closer to:

always @(posedge i_clk)
begin
	o_ack  = slave_ack[slave_index];
	o_data = slave_data[slave_index];
end

Alternatively, I could’ve written this output in a case statement–had I known how many slaves. An example of such a case statement for four slaves might look something like,

always @(posedge i_clk)
case(slave_index)
	2'b00: { o_ack, o_data } <= { slave_ack[0], slave_dat[0] };
	2'b01: { o_ack, o_data } <= { slave_ack[1], slave_dat[1] };
	2'b10: { o_ack, o_data } <= { slave_ack[2], slave_dat[2] };
	2'b11: { o_ack, o_data } <= { slave_ack[3], slave_dat[3] };
endcase

Surely that would be better, right?

It was time to run some tests.

The basic logic test

Some time ago, I had the wonderful opportunity to meet Jan Gray. While you may know me for advertising a resource constrained 32-bit CPU that I call the ZipCPU, Jan has been known for stuffing many, many CPUs within an FPGA.

Much as I hate to admit it, his CPUs are often much smaller than mine. So, I asked him, how do you get your CPUs that small? I found his answer very enlightening.

He said, and I’m paraphrasing here, to start with the clock rate you want to achieve. Then start piecing small bits of logic together here and there, and just run these tiny pieces of logic through the synthesizer–never the whole design. Look at what the synthesizer does with these tiny pieces, the number of LUTs they create, their timing performance, and then optimize these small pieces at the low level. In this way, you can gradually work your way up to the whole.

No, that’s not a direct quote. About a year and a half has passed since that conversation, but his point remains: optimize your design piecemeal at the very low level, and learn optimization there. It will then be easier to create a larger design from the lessons you’ve learned.

So, let’s do that here!

Imagine if you will that we have four sources of data, a through d. Each of these sources has a select line, asel through dsel, telling us when their respective data words, adata through ddata are valid. Our goal will be to set our own output, odata, to contain the data from the one valid input data word.

The code for this would look something like the one below.

module testa(asel, adata, bsel, bdata, csel, cdata, dsel, ddata, odata);
        input   wire            asel, bsel, csel, dsel;
        input   wire    [31:0]  adata, bdata, cdata, ddata;
        output  reg     [31:0]  odata;

        always @(*)
        if (asel)
                odata = adata;
        else if (bsel)
                odata = bdata;
        else if (csel)
                odata = cdata;
        else
                odata = ddata;

endmodule

Let’s run this through yosys and see what we can learn.

% yosys
yosys> read -sv test.v
# ... lots of stuff
yosys> synth_ice40 -top testa
# ... skipping even more stuff, then ...
#
2.42. Printing statistics.

=== testa ===

   Number of wires:                 73
   Number of wire bits:            228
   Number of public wires:           9
   Number of public wire bits:     164
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                 96
     SB_LUT4                        96

2.43. Executing CHECK pass (checking for obvious problems).
checking module testa..
found and reported 0 problems.

yosys>

So we now know that this code would cost us 96 LUTs–essentially 3 LUTs for every outgoing data bit.

Can we do better?

Suppose we instead had an index, which could indicate for us which of the four inputs we wanted to set our output to. Such an example might look like this one:

module testb(index, adata, bdata, cdata, ddata, odata);
        input   wire    [1:0]   index;
        input   wire    [31:0]  adata, bdata, cdata, ddata;   
        output  reg     [31:0]  odata;

        always @(*)
        case(index)
        2'b00: odata = adata;
        2'b01: odata = bdata;
        2'b00: odata = cdata;
        2'b01: odata = ddata;
        endcase
endmodule

This time, yosys tells us that we could calculate our outgoing data using only two LUTs per output bit.

2.42. Printing statistics.
=== testb ===

   Number of wires:                 38
   Number of wire bits:            194
   Number of public wires:           6
   Number of public wire bits:     162
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                 64
     SB_LUT4                        64

If you find yourself like me, scrounging for every LUT you can in order to maintain your reputation as a log-logic designer, 32 LUTs is a big difference. It might get even bigger if this sort of savings could be replicated across the design.

Moving to the CPU

Fig 3: The ZipCPU has four basic data calculation units: ALU, memory, divide and the (not yet built) FPU

So I took a look within the ZipCPU. As you may recall, the ZipCPU has a couple of functional units. There’s the ALU, which can perform basic arithmetic, the memory unit, which can load values from memory and stores them back again, and the divide unit. Not shown in Fig. 3, is the fact that the external debugging port can also write values to the registers within the CPU. I’ve also dreamed of creating a floating point unit, so let’s include that in our discussion as well.

Now, each of these three units can return a valid signal, indicating that it has an item to write to the register file, a data signal, containing the data to be written, and similarly, many of the units can return an error signal.

The logic, then, to determine when to write a value to a register looks something like the following:

	assign	wr_reg_ce = (dbgv)||(mem_valid)
				||((!clear_pipeline)&&(!alu_illegal)
					&&(((alu_wR)&&(alu_valid))
						||((div_valid)&&(!div_error))
						||((fpu_valid)&&(!fpu_error))));

Does this look at all like the logic cascade we were dealing with above?

How about the logic to determine which of the various values needs to be written to the register file? That looks like another logic cascade:

	assign	wr_gpreg_vl = ((mem_valid) ? mem_result
				:((div_valid|fpu_valid))
					? ((div_valid) ? div_result:fpu_result)
				:((dbgv) ? dbg_val : alu_result));

Well, actually, it’s worse. If you look at the code I referenced above, you’ll notice that I calculated the value twice, once for wr_gpreg_vl and another time for wr_spreg_vl. I had to do this just to keep any updates to the program counter and the status register off of the critical path.

This logic doesn’t need to be so complex!

The ZipCPU is very much an in order processor. Sure, it is pipelined, but by the time you get to this stage in the pipeline, you already know what component will be producing a value. Hence, I could’ve used an index instead.

So let’s instead create a simple test, just to estimate what the impact of this change will be. Let’s say that this is our current logic.

module testcpu(dbgv, clear, wb, aluv, memv, divv, fpuv,
		dbgerr, aluerr, memerr, diverr, fpuerr,
                dbgdata, aludata, memdata, divdata, fpudata,
                ov, odata);
        input   wire            dbgv, clear, wb, aluv, memv, divv, fpuv;
        input   wire            dbgerr, aluerr, memerr, diverr, fpuerr;
        input   wire    [31:0]  dbgdata, aludata, memdata, divdata, fpudata;
        output  reg             ov;
        output  reg     [31:0]  odata;

        always @(*)
        ov = (dbgv | memv) || (!clear && wb) && 
        	((aluv && !aluerr) || (memv && !memerr)
			||(divv && !diverr) || (fpuv && !fpuerr));

        always @(*)
        if (memv)
                odata = memdata;
        else if (divv | fpuv)
                odata = (divv ? divdata : fpudata);
        else if (dbgv)
                odata = dbgdata;
        else // if (aluv)
                odata = aludata;
endmodule

If you compare this to the actual logic above, you should see some strong similarities.

So let’s ask, how much logic does this use? A quick run through yosys yields the following result:

2.42. Printing statistics.

=== testcpu ===

   Number of wires:                117
   Number of wire bits:            303
   Number of public wires:          19
   Number of public wire bits:     205
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                131
     SB_LUT4                       131

Can we do better?

Sure we can! Let’s add in just a little bit of precursor logic. In this case, imagine we know we will have to write to the ALU on the clock after prealu is true, and likewise for the other values. This means we’ll need to create an index first from these pre* values, and so change our test case to:

module testcpui(i_clk, pred, prealu, premem, prediv, prefpu,
		dbgv, clear, wb, aluv, memv, divv, fpuv,
		dbgerr, aluerr, memerr, diverr, fpuerr,
		dbgdata, aludata, memdata, divdata, fpudata,
		ov, odata);
	input	wire		i_clk;
	input	wire		pred, prealu, premem, prediv, prefpu;
	input	wire		dbgv, clear, wb, aluv, memv, divv, fpuv;
	input	wire		dbgerr, aluerr, memerr, diverr, fpuerr;
	input	wire	[31:0]	dbgdata, aludata, memdata, divdata, fpudata;
	output	reg		ov;
	output	reg	[31:0]	odata;

	reg	[2:0]	index;

	initial index = 0;
	always @(posedge i_clk)
	if (pred)
		index <= 3'h0;
	else if (prealu)
		index <= 3'h1;
	else if (premem)
		index <= 3'h2;
	else if (prediv)
		index <= 3'h3;
	else // if (fpuv)
		index <= 3'h4;

Once we have this three bit index, we can now create logic based around a case statement to decode our results. As before ov will tell us when our output is valid and hence when we need to write to the register file,

	always @(*)
	case(index)
	3'h0: ov = (dbgv);
	3'h1: ov = (clear && !wb) && (aluv && !aluerr);
	3'h2: ov = (memv);
	3'h3: ov = (clear && !wb) && (divv && !diverr);
	3'h4: ov = (clear && !wb) && (fpuv && !fpuerr);
	default: ov = 0;
	endcase

whereas odata will be the value we write to the register file.

	always @(*)
	case(index)
	3'h0: odata = dbgdata;
	3'h1: odata = aludata;
	3'h2: odata = memdata;
	3'h3: odata = divdata;
	default: odata = fpudata;
	endcase

endmodule

What kind of logic usage might this garner?

Any guesses?

=== testcpui ===

   Number of wires:                103
   Number of wire bits:            291
   Number of public wires:          26
   Number of public wire bits:     214
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                113
     SB_DFFSR                        3
     SB_LUT4                       110

110 LUTs! That’s a savings of nearly 20%!

What about that SB_DFFSR line? That’s new. Those three SB_DFFSRs are the flip-flops holding the three bits of our index. In most all of my designs, I’ve been LUT starved, not flip-flops starved, so I’m not really worried about these.

Resource measurement is not just for iCE40s

Did you know you could measure resource usage for Xilinx chips, Intel chips, as well as ASIC gates using this approach with yosys?

If you wanted to do this for a Xilinx design, for example, all you’d need to change is the synth_ice40 line to a synth_xilinx line. Indeed, in my experience, you’ll also get a much faster result using yosys than using any of the Vendor tools–Vivado, ISE, or even Quartus. Feel free to check me out on this.

$ yosys
yosys> read -sv test.v
yosys> synth_xilinx -top testcpu
# ... 
=== testcpu ===

   Number of wires:                116
   Number of wire bits:            302
   Number of public wires:          19
   Number of public wire bits:     205
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                130
     LUT3                           32
     LUT5                            1
     LUT6                           65
     MUXF7                          32

So our basic design would require 65 6-input LUTs, one 5-input LUT, and 32 3-input LUTs. The 32 MUXF7s are needed to combine 64 of those 6-input LUTs together in order to make 32 7-input LUTs.

If you now swap design approaches to use the index, the LUT usage drops dramatically:

=== testcpui ===

   Number of wires:                 67
   Number of wire bits:            257
   Number of public wires:          26
   Number of public wire bits:     214
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                 79
     FDRE                            3
     LUT2                            5
     LUT3                            1
     LUT4                            1
     LUT5                           16
     LUT6                           52
     MUXF7                           1

If you figure that the MUXF7s are really there for free, then we dropped our logic needs from 98 6-input LUTs down to 78–again this is about a 20% improvement or so.

Ever had to try to answer the question, “how complex is your design?” and the questioner wants your answer in gates? We can do that too! In this case, just run synth, a generic synthesis routine, followed by abc -g cmos2, to map your design to traditional CMOS gates, and then stat to report the results.

$ yosys
yosys> read -sv test.v
yosys> synth
yosys> abc -g cmos2
yosys> stat
=== testcpu ===

   Number of wires:                729
   Number of wire bits:            915
   Number of public wires:          19
   Number of public wire bits:     205
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                433
     $_NAND_                       293
     $_NOR_                         70
     $_NOT_                         70

That is, our pre-optimized design will require 293 NAND gates, 70 NOR gates, and another 70 NOT gates.

Similarly, after our optimization, you’d get:

=== testcpui ===

   Number of wires:                856
   Number of wire bits:           1046
   Number of public wires:          26
   Number of public wire bits:     214
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                396
     $_DFF_P_                        3
     $_NAND_                       238
     $_NOR_                         80
     $_NOT_                         75

If we don’t count the flip-flops, then you can see that we’ve dropped from 433 gates down to 393. This time the savings isn’t nearly as much, but 10% fewer gates translates to area in the ASIC business, and lower area translates to lower cost.

Conclusion

Perhaps you’ve picked up on a couple of points from this discussion. First, if you are concerned about minimum area and minimum logic, then it’s important to know a bit about what’s happening under the hood. Second, that yosys dump command was what tipped me off to what was going on under the hood. Yosys also supports a show command for those who would rather view this information graphically. Third, you can compare various design techniques in seconds when using yosys. It’s really quite a valuable lesson.

I’ve just rewritten how AutoFPGA handles its address decoding as a result. While I’m still working through some of the pains of that rewrite, it turns out that it wasn’t all that hard to do.

Adjusting how the ZipCPU determines whether or not to write back, and which value to write back to the register file is likely to be next. Perhaps you might want to consider at this point, what formal properties could be used in a switch like this, and how easy (or hard) would it be to convince yourself that the updated code still worked as before?

I should also mention, there is another competing theory out there espoused by the major vendors. This theory is that you should build your design, count up the gates you need, and buy your chip. If your algorithm doesn’t fit in the chip, then you need to buy a bigger chip and possibly even build a bigger board. This also applies to much of the vendor supplied IP. The major vendors have no incentive to build low-logic designs, if their low logic designs would keep you from buying more of their product.

Think about it.

Oh, one more thing, remember that bug that sent me looking into the yosys dump in the first place? Yeah, that one. It was my own fault. The problem was just obscured by the for loop I was using, and the fact that for loops in Verilog don’t work like they do in C++.