Run length encoding an AXI stream

When I am forced to debug a design in actual hardware, I like to use my own internal logic analyzer. It wasn’t hard to build, and my version has a feature not often found in others: run-length encoding (RLE). I call this internal logic analyzer with RLE my “compressed scope”.

Fig 1. Run length encoding

The scheme itself, as illustrated in Fig. 1, is quite simple: 31-bits of data come in, 32-bits go out. When a value first comes in, it gets sent to a holding register, then copied to the output on the next clock cycle after having a zero bit prepended to the MSB of the output word. Any subsequent copies of that same value coming in are then accumulated into a counter, and when a different value enters the holding register the counter is then forwarded to the output stage with a one bit prepended to the output word. These two output words, shown in Fig. 1 on the right, form the simple basis for this encoding.

Fig 2. GPS serial port is active between PPS signals

While the scheme is fairly simple, it’s kept me out of trouble a couple of times. Once, I was building an I2C controller to the wrong spec. The compressed scope helped me discover what wasn’t working. In another example, I was working with a GPS receiver and wanted to know how the one-part-per-second (PPS) signal and the GPS serial port were related. Specifically, when GPS time was transmitted across the serial port, how easy would it be to confuse which PPS it was related to? Being able to generate a trace, looking like Fig. 2 and crossing several secondss, was very useful. In both cases, it was very valuable to be able to record a signal for seconds or even minutes on end to then later analyze what happened within the design.

Some time later, I got serious about verifying my compressed scope implementation and came across a couple issues that needed dealing with along the way. First, I modified the scheme so that it wouldn’t turn on the run-length compression until the capture memory was full. That way, the scope would be primed and ready for its first capture all the sooner. Then, I adjusted the RLE scheme again so that it wouldn’t apply the encoding scheme to the sample that triggered the scope. That meant that I could always back up from the end of the recording to know exactly what was happening at the time of the trigger.

That left me ready for the next problem: when Xilinx’s Memory Interface Generator (MIG) generated a DDR3 core for my Nexys Video board that wasn’t working, I pulled out my compressed scope again to see what was going on. I was able to quickly determine that the MIG core was generating AXI RVALID signals without prior AXI ARVALID requests. Not knowing how to fix what appeared to obviously be a problem with Xilinx’s MIG core, I wrote asking for help.

I was first told that I must be crossing 4kB page boundaries within some AXI burst request. Not possible, I said, since my AXI master wasn’t making burst requests.

I was then told that these RVALID responses must have somehow gotten buried in the MIG system somewhere and so, somehow, my own logic was at fault for not remembering that it had made a prior request. I went back and double checked the formal proof of my design. It checked out. I checked the formal proofs of everything up stream. Those checked. I added a counter to my design–counting the number of ARVALIDs minus the RVALIDs. Sure enough, I was getting an extra RVALID without a request.

But how could I explain that to the Xilinx representative?

In this case, I again turned to my compressed scope, and this time captured EVERYTHING from the AXI interactions–all of the ARVALID and RVALID signals, all of the AWVALID, WVALID, and BVALID signals, everything I thought would be relevant that would fit within 31-bits. When I went back to Xilinx’s tech support the next time, I had a four second recording in just a couple kB of block RAM that captured every transaction from chip startup to the error, to when I stopped the analyzer to look at the trace.

Sure enough, there was an extra RVALID.

My contact at Xilinx had no idea what was going on. I suppose I shouldn’t complain. I wasn’t purchasing paid tech support. I am very grateful that they took the time to even look into my problem, but the reality was that 1) no one else was complaining about their designs being broken in a similar manner, and so 2) I was on my own to try to figure it out. If you remember the story from when I shared it before, you’ll remember the problem was that I had misconfigured Xilinx’s memory controller for the wrong DDR3 SDRAM chip. My point here is simply that the compressed logic analyzer’s capability to collect more than 400 million clock ticks of data in only a few kB’s of block RAM was key to diagnosing and zeroing in on the problem.

More recently, I came back to revisit my Wishbone scope project and give it an AXI capability. First I created an AXI-lite version of the basic scope. Aside from the challenge of creating an AXI-lite processing pipeline that was more than a single clock deep, that design was fairly unremarkable. Then, as I was starting to figure out how to manage the full AXI interface with bursts, I decided to try building a logic analyzer that wrote it’s capture to an external memory. I called the result a MEMSCOPE. It made for a fun project, although to be honest I’m not sure what I’d do with a logic analyzer capable of recording over a GB of data.

This week, I came back to that MEMSCOPE design and thought I might add my run-length encoding scheme to the front of it. The result forms the basis for today’s discussion below.

Design Concept

Fig 3. Design portlist

Since my intent with this design was to build an AXI based design, I thought I’d go all the way and even build an AXI-stream based run-length encoder. It should take 31-bits in, and produce a 32-bits RLE stream out.

Well, that was the basic idea, but as always not everything fits in an AXI stream. For example, many logic analyzer requirements can’t handle backpressure. Any backpressure would corrupt the incoming stream. I also wanted two extra signals: one to indicate whether this word should be encoded in the first place, such as in the initial run-up when memory is being filled for the first time, and the second to indicate if this value should be associated with the trigger for the scope or not. These didn’t fit the AXI stream model either, since they were asynchronous to the stream itself–but I still found the AXI stream model a useful place to start from.

Internally, the design broke apart nicely into four separate stages.

Fig 4. Pipeline stages

There would be the initial skidbuffer stage, followed by a check to see if this new word was the same as the last one. That decision would then feed into the run-length counter proper, and the result of that would be our outgoing AXI stream. I then assigned each of these stages a prefix, so that the registers associated with each of the stages could clearly be identified: S_AXIS_* would describe the input, skd_* the outputs from the skidbuffer, mid_* the results from the same-word detection, run_* for the actual run-length encoder and M_AXIS_* for the output.

Let’s discuss each of these stages briefly in turn from a basic design standpoint. Well, not quite each of them, since we’ve already discussed the basic skidbuffer before. Therefore, let’s start with the same word detection. This one is pretty easy: if an output is available from the skidbuffer, compare it to the last output from this stage and mark the value as mid_same if the two match.

always @(posedge S_AXI_ACLK)
if (skd_valid && skd_ready)
begin
	mid_valid <= 1'b1;
	mid_data  <= skd_data;
	mid_same  <= (mid_valid || run_valid) && (mid_data == skd_data);
end else if (run_ready)
	mid_valid <= 1'b0;

Here, I’ve used the skd_* prefix to describe the values coming from the skidbuffer, the mid_* prefix to describe the values in this same-word detection stage, and the run_* prefix to describe the run-length detection stage that follows.

Ideally, this same-word detection would only check if the new data was the same as the last data,

	mid_same <= (mid_data == skd_data)

There’s a couple of reasons why that’s not quite good enough. It might be, for example, that we’ve just been reset and so never had any data come through here yet. Therefore, we need to gate this against a valid signal somehow. While it makes the most sense to gate this against mid_valid since that flag caveats mid_data having the correct value in it, that’s not good enough since this stage will get flushed as soon as the next stage will allow it. Therefore, we’ll need to gate against both mid_valid and the run_valid signal from the following stage.

The run-length encoding stage is a bit more interesting. This is the one clock cycle where all the work gets done. It’s also a stage that, once filled, will never empty: it will always keep the last value sent within it. When a new value is provided, it will either kick that last value out to the output, or accumulate its count of repeats.

Let’s see how we might build this.

Our logic would focus on whenever new data comes in. We’ll continue to use the run_ready flag to indicate that we’re ready to move data from the mid_* stage to the run_* stage.

The first step will be to set this stage as valid any time we move data into it, and then to copy the data that moves into it at that time.

	always @(posedge S_AXI_ACLK)
	begin
		if (run_ready)
		begin
			run_valid <= mid_valid;
			run_data  <= mid_data;

That’s the easy part.

The next part is where things get a bit more fun. Here’s where we do our actual run-length encoding. If a new value comes from the mid_* stage that’s the same as the last one, that is with mid_same set, then we’ll start a new run and start counting repeats. If we are already in a run when a new value comes in with mid_same set, then we’ll simply increment the length of that run, otherwise we just clear the run length in order to start counting.

			if (mid_valid)
			begin
				// run_active: are we within a run or not?
				run_active <= mid_same;

				if (run_active && mid_same)
					run_length <= run_length + 1;
				else
					run_length <= 0;

			end // else ...
		end
		// ...
	end

This has the pleasant effect of making the counter zeros based. Two consecutive and identical elements would then get encoded as (clear LSB, data), followed by (set LSB, 0). Three consecutive identical elements would be encoded by (clear LSB, data), followed by (set LSB, 1) and so forth.

That brings us to the last stage, the one with the M_AXIS_* prefix. We’ll want to send a value out whenever there’s a value in the run_* stage AND we’re not accumulating that value into a run.

	initial	M_AXIS_TVALID = 1'b0;
	always @(posedge S_AXI_ACLK)
	if (!S_AXI_ARESETN)
		M_AXIS_TVALID <= 1'b0;
	else if (!M_AXIS_TVALID || M_AXIS_TREADY)
	begin
		M_AXIS_TVALID <= 0;

		if (run_valid && run_ready && (!mid_same || !run_active))
			M_AXIS_TVALID <= 1'b1;
	end

Setting TDATA is even easier–since we don’t need to worry about resets. Here, we’ll set TDATA to either the data from the run_* stage, if there’s no run active, or otherwise the length of the run if a run is active.

	always @(posedge S_AXI_ACLK)
	if (!M_AXIS_TVALID || M_AXIS_TREADY)
	begin
		if (run_active)
			M_AXIS_TDATA <= { 1'b1, run_length };
		else
			M_AXIS_TDATA <= { 1'b0, run_data };
	end

That leaves two key pieces of logic that we’ve skipped: run_ready and skd_ready, the two handshaking flags that’ll tell us when we can step these stages forward. These two values are very much key to the success of the algorithm, and they are complex enough that we can’t really gloss over them here.

Internal to our algorithm, all of the ready signals are combinatorial. That means they’ll be simplest at the end of the pipeline, so that’s where we’ll start and work our way forwards.

We can accept data into the run_* stage as long as there’s new data coming from the same word detector and the following stage, the M_AXIS_* stage, is ready to accept data.

	always @(*)
	begin
		run_ready = (!M_AXIS_TVALID || M_AXIS_TREADY) && mid_valid;

Only that’s too restrictive. We can also accept values into the RLE stage even if the output is stalled, but only if those values will be going into our run-length accumulator.

		if (run_active && mid_valid && mid_same)
			run_ready = 1;
	end

As it turns out, that’s the difficult signal. The skd_ready signal is much simpler. Basically, we can move data into the mid_* stage if either there’s nothing in the mid_* stage, or if the mid_* stage’s data is flowing into the run_* stage.

	always @(*)
		skd_ready = !mid_valid || run_ready;

Indeed, this is the common way to propagate ready values upstream when there are no exceptions to the general rule–as there were with run_ready.

There you have the basics of the algorithm. There are some other key details we’ll need to work out when we get into the weeds below. For example, you don’t want to compress the trigger event–even if it is the same as all the events around it. Similarly, you don’t want to compress when you are initially filling memory, since you want to fill that memory as fast as possible and since the logic analyzer can’t trigger until the memory has at least filled once–lest you get confused later by uninitialized data.

There’s also one critical feature we’ll have to deal with below as well: how shall a counter overflow be handled? You don’t want to drop data no matter what happens. Therefore, once the run-length counter reaches its maximum value, we’ll need to output that value no matter what happens next.

Other than those three exceptions, triggers, priming the capture, and overflow handling, that’s all the harder the algorithm is. Indeed, run-length encoding is one of the simplest compression schemes you can implement in hardware: you can operate at full speed, and there’s no real need for any complicated tables to maintain.

Detailed Implementation

Let’s now move from our broad overview to a more detailed explanation of how this run-length encoder works.

The first difference between our broad overview and reality that we’ll have to deal with are the i_encode and i_trigger flags. The first flag, i_encode, enables the run-length encoding. This will be set to zero externally until the capture memory is first filled. The second, i_trigger, is the trigger for the internal logic analyzer. It needs to be kept synchronized with the data. Moreover, if the trigger is ever active then the run-length encoding also needs to be disabled.

The challenge associated with these two flags is that they may arrive asynchronous to the data handshake. That means we’ll need to remember if we get triggered between sample values, and so apply the trigger to the next data sample. Likewise, if we’re ever told not to run-length encode a piece of data, then we’ll need to turn off the encoding on either the sample presented on that clock cycle or the sample following if no sample is valid on that cycle.

The first step in handling this is a set of values I call sticky_* values. These handle keeping track of what should happen to the next value–before that sample value has arrived. They’ll take place before the skidbuffer, and just remember what happens between accepted samples.

	initial	sticky_encode  = 1;
	initial	sticky_trigger = 0;
	always @(posedge S_AXI_ACLK)
	if (!S_AXI_ARESETN)
	begin
		sticky_encode  <= 1;
		sticky_trigger <= 0;
	end else if (S_AXIS_TVALID && S_AXIS_TREADY)
	begin
		// Always reset our values when any sample is accepted
		sticky_encode  <= 1;
		sticky_trigger <= 0;
	end else begin
		// Otherwise, remember what happens between samples
		sticky_encode  <= sticky_encode && i_encode;
		sticky_trigger <= sticky_trigger || i_trigger;
	end

The next step is to go into the skidbuffer. This would be completely straightforward, except … the skidbuffer assumes that incoming data won’t change while TVALID && !TREADY and our sticky_* values might change during this time. This leaves us with a couple possibilities. We could either adjust the skidbuffer’s formal properties so that we don’t make this assumption within it, or we can handle these signals separately.

I chose to handle them separately. Therefore, here’s the skidbuffer’s instantiation without those signals, and we’ll have to deal with them next.

	skidbuffer #(
		.DW(W-1), .OPT_OUTREG(1'b0)
	) skid(
		.i_clk(S_AXI_ACLK), .i_reset(!S_AXI_ARESETN),
		.i_valid(S_AXIS_TVALID), .o_ready(S_AXIS_TREADY),
			.i_data(S_AXIS_TDATA),
		.o_valid(skd_valid), .i_ready(skd_ready),
			.o_data(skd_data)
	);

That means I’ll need to implement a quick skidbuffer by hand here. Therefore, when any value is accepted, we’ll copy it to our (local) skidbuffer registers.

	always @(posedge S_AXI_ACLK)
	if (S_AXIS_TVALID && S_AXIS_TREADY)
	begin
		r_trigger <= (i_trigger || sticky_trigger);
		r_encode  <= i_encode && sticky_encode;
	end

The main skidbuffer will drop S_AXIS_TREADY on any stall, so we can use that as our indication of whether a value got stuck in the buffer or not. If nothing is stuck in the buffer, we just use the current incoming values, otherwise we’ll use the r_* values we just set above.

	always @(*)
	if (S_AXIS_TREADY)
		{ skd_trigger, skd_encode } <= { (i_trigger || sticky_trigger),
					(i_encode && sticky_encode) };
	else
		{ skd_trigger, skd_encode } <= { r_trigger, r_encode };

That’s it for the skidbuffer stage shown in Fig. 4 above. Now, onto the same word detection stage.

You should recognize the first part of this from our preliminary design above.

	always @(posedge S_AXI_ACLK)
	begin
		if (skd_valid && skd_ready)
		begin
			mid_valid <= 1'b1;
			mid_data <= skd_data;
			mid_same <= (mid_valid || run_valid) && (skd_data == mid_data);

The biggest differences are the new mid_trigger and r_triggered signals, as well as our handling of the skd_encode and skd_trigger with respect to same word detection.

Of the two, the trigger signal is easiest to explain: we just pass it from the skidbuffer to the output of the same-word stage. We’ll also use an r_triggered signal to keep us from ever processing a second trigger signal.

			mid_trigger <= skd_trigger && !r_triggered;
			r_triggered <= r_triggered || skd_trigger;

From there we make our exceptions to the same word detection algorithm. If ever the encode flag is false, we’ll need mark this value as different so it won’t get folded into any run-length encoding. Similarly, if ever the skd_trigger is true for this sample, and we haven’t seen the trigger before, then we’ll also keep this value from being folded into the run-length encoding as well.

			if (!skd_encode || (skd_trigger && !r_triggered))
				mid_same <= 1'b0;

Finally, when this value moves forward we’ll clear the valid and trigger signals–that way we don’t count these values more than once. Likewise, on any reset we’ll clear those signals as well as the r_triggered signal.

		end else if (run_ready)
		begin
			mid_valid   <= 1'b0;
			mid_trigger <= 1'b0;
		end

		if (!S_AXI_ARESETN)
		begin
			mid_valid   <= 1'b0;
			mid_trigger <= 1'b0;
			r_triggered <= 1'b0;
		end
	end

Perhaps you may have noticed that this form of a reset is kind of rare for me. Normally, when I write a clocked logic block I will place the reset test first, and only then the non-reset logic following. That form wouldn’t work here–primarily because I’ve left the mid_same and mid_data signals in this always block. Had I placed the reset first, that would have burdened the logic for setting these signals such that they would only be set if the reset wasn’t active.

That leads to a more philosophical discussion of how logic should be broken into blocks. My general rule is that if two or more signals share the same signaling structure, then they can share the same block. This rule generally follows from my focus on minimizing area, and so I dislike adding more logic to a design than necessary. In this case, that might refer to the mid_data signal as an example. Otherwise, if the control logic is different between signals than I will separate their logic blocks within a design. This usually leads to separating any logic for signals requiring resets into separate blocks from the other associated logic.

In this case, the trailing reset check works just as good for only those signals that need to be reset here. While I know others who are dogmatic about using this reset form, I’m still somewhat flexible in how I build my logic. Still, if you haven’t read Olof Kindgren’s article on the topic, it’s worth a read to understand where others’ might be coming from on this issue.

Let’s move on to the run_* section.

We’ll start with the simple stuff: when we’re ready to accept a value from the mid_* stage, we’ll set the run_valid signal and copy the data and trigger signals into this stage.

	always @(posedge S_AXI_ACLK)
	begin
		if (run_ready)
		begin
			run_valid  <= mid_valid;
			run_trigger<= mid_trigger;
			run_data   <= mid_data;

These new values will either start or continue a run if ever there is both 1) a new data value, and 2) if that new value is the same as the last one we just processed. Similarly, if we have already started a run, then we’ll increment it’s length now.

			run_active <= mid_same && mid_valid;
			if (run_active && mid_same)
				run_length <= run_length + 1;
			else
				run_length <= 0;

Don’t overlook the offset here: run_active will get set first, on the same clock that run_length is still getting set to zero. Only on the next value, the third in any run, will run_length start accumulating.

Perhaps a figure might help, so let me reference Fig. 5 below.

Fig 5. Run-length logic demonstration

In this trace, you’ll see that on any first value of a new run, such as the yellow run in the trace, mid_same will be low. When that value first repeats, mid_same will be set but run_active will be low. On the second repeat, run_active will be set and run_length will be zero. From there it starts accumulating as new data is given to it.

The next trick is the run_overflow signal. We skipped this signal earlier. I added it to the basic run-length encoder processing in order to keep the run-length counter from overflowing. We’ll set this value to true any time the run-length will overflow if one more is added to it. I suppose that’s technically different from from overflow, but run_length_counter_will_overflow seemed a bit long for a signal name.

At any rate, this run_overflow signal will be true if ever our run_length counter is all ones. That means we can compare it against all ones but the last–a negative 2 if you will, and spare a bit in our comparison. We’ll clear this flag if we either get a new value that doesn’t increase our run length, or if we are ever forced to output a value at the maximum run length.

			run_overflow <= (run_length >= {
				{(W-2){1'b1}}, 1'b0 });
			if (!mid_same || run_overflow)
				run_overflow <= 1'b0;
		end

I’ll also clear just about all of these signals on reset.

		if (!S_AXI_ARESETN)
		begin
			run_valid    <= 1'b0;
			run_active   <= 1'b0;
			run_overflow <= 1'b0;
			run_length   <= 0;
			run_trigger  <= 0;
		end
	end

That brings us to the ready handshaking signal gating the entrance to the run_* stage. This is one of the more critical pieces in this algorithm. If we get it wrong we might either drop values from our pipeline or accidentally introduce phantom values into it.

We’ll start with the clearest reason for run_ready to be true: We can set run_ready if the next stage is ready for another output.

	always @(*)
	begin
		run_ready = (!M_AXIS_TVALID || M_AXIS_TREADY);

Even if the M_AXIS_* stage is stalled, however, we might still be able to accept a new value into the run_* stage if it were to only increase our run length counter. Hence, we can set run_ready if we are in the middle of a run, run_active, and the next element continues this run, mid_same. The exception to this rule is if adding another element to the run would overflow our run-length counter.

		if (run_active && !run_overflow && mid_same)
			run_ready = 1;

Finally, we don’t want the run_* stage to step at all if there isn’t a new valid to be placed into it.

		if (!mid_valid)
			run_ready = 0;
	end

Remember how the run_* stage works: It starts out idle, but after receiving its first value it then remains valid for all time with the next output value within it. This value gets “stuck” in the run_* stage until we know the end of whatever run that might be within it. That’s the time when the run value (or counter) will get kicked out by a new value from the mid_* stage. It’s for this reason that we need to make certain that the run_* stage never moves forward unless there’s also a value entering from the mid_* stage.

That leads us to the M_AXIS_* stage–the output stage in our processing pipeline. We’ll need to set two values here, the M_AXIS_TVALID value, indicating that we have a value to send out, and the actual value, M_AXIS_TDATA. (This design doesn’t make use of any of the other AXI-stream signals, such as TLAST, TSTRB, TKEEP, or TID.)

We can start defining the M_AXIS_TVALID signal with the logic that would be used for any AXI-based *VALID signal. This part is kind of a fill-in-the-blank logic form.

	initial	M_AXIS_TVALID = 1'b0;
	always @(posedge S_AXI_ACLK)
	if (!S_AXI_ARESETN)
		M_AXIS_TVALID <= 1'b0;
	else if (!M_AXIS_TVALID || M_AXIS_TREADY)
	begin

While this form might seem quite straightforward, do beware that it is still possible to mess it up–as Xilinx did in their AXI Ethernetlite controller. In that controller, they set their design to wait on AXI_RREADY without also checking for !AXI_RVALID. The result was that, when the controller was subjected to a formal check, SymbiYosys managed to hang the core by simply SymbiYosys found a way to make the core hang by simply holding AXI_RREADY low–such as any out-of-order AXI interconnect might do.

That just leaves us with the blank in this logic form to fill in with our M_AXIS_TVALID logic. In this case we’ll default to zero, and sending nothing. After that, we can use the run_ready signal to help us know when to move forward. We’ll place a value into the M_AXIS_* stage if there’s both a valid value in the run_* stage and if run_ready is true so that it’s moving forward. While that’s a good start, however, it isn’t enough.

		M_AXIS_TVALID <= 0;

		if (run_valid && run_ready)
		begin

From the run_* stage, we only move an output forward on two conditions. First, we always move forward if there’s a new value entering into the run_* stage that would overflow the counter. Second, we’ll always move forward on a new or different value. Such a value would end any ongoing run. We’ll also move forward on the first item of any run–since we always need to place the data value before the subsequent run length itself.

			if (run_active && run_overflow)
				M_AXIS_TVALID <= 1'b1;

			if (!mid_same || !run_active)
				M_AXIS_TVALID <= 1'b1;
		end
	end

This also instroduces an unfortunate consequence into our logic analyzer application. As you may recall, the logic analyzer will record data until some programmable time after the trigger. Nothing controls what the oldest value will be in memory at that time. It might be a data value, or it might be a run encoding for … a never to be recovered value. Worse, it might also be a series of run overflow encodings only ending with the value at the time of the trigger. While that’s a potential reality of this particular implementation, the good news is that it hasn’t (yet) seriously impacted any analysis I’ve needed to do. Why not? In most of my examples, the run length encoding is good enough that there’s always still more than enough data left for any necessary analysis.

You can also think of it this way: it requires 2^32 elements to fill a run. At a (rough) 100MHz clock, that’s about 43 seconds of data. Filling up an 8kB buffer with run indications would therefore require about 4 days of data at this rate. While it might happen, it’s definitely not likely.

After TVALID, we’ll set TDATA any time !TVALID || TREADY. Again, this follows from the basic form of setting any AXI data value associated with any AXI stream. Valid’s follow the form above, data follows the form below.

	always @(posedge S_AXI_ACLK)
	if (!M_AXIS_TVALID || M_AXIS_TREADY)
	begin

What value shall we set our result to? Well, if we are ending a run, then we’ll want to output the fact of the run together with the run length. Otherwise, we’ll just output the data we’ve been holding on to.

		if (run_active)
			M_AXIS_TDATA <= { 1'b1, run_length };
		else
			M_AXIS_TDATA <= { 1'b0, run_data };
	end

That leaves only one last loose end to tie up before we complete our design, and that’s the outgoing trigger for the logic analyzer subcore. Remember, we’ve needed to keep this associated with the data that arrives concurrently to it. Therefore, as we’ve moved data through our pipeline, we’ve kept the trigger indication together with the data in each and every stage. Here, we just forward the trigger to the output when it leaves the run_* stage.

	// o_trigger
	initial	o_trigger = 0;
	always @(posedge S_AXI_ACLK)
	if (!S_AXI_ARESETN)
		o_trigger <= 1'b0;
	else if (!M_AXIS_TVALID || M_AXIS_TREADY)
		o_trigger <= run_trigger && run_ready;

What about double checking that the trigger is only ever output for a data word, never a compressed word? Or guaranteeing that this outgoing trigger is only true when M_AXIS_TVALID is also true? For that, we’ll turn to formal methods and double check that we haven’t damaged our algorithm by simplifying it this far.

Indeed, I should point out that even while writing this blog article I’ve done a lot of simplifying. The result you see above you uses only an estimate 130 Xilinx series-7 LUTs. I’ve also leaned heavily on the formal proof to do this. Why not? A full proof of this core takes only takes 20 seconds. That makes it easy to make changes and adjust any logic.

Formal Properties

When I first built this design, I’ll admit I stared at the blank formal property section for a while just wondering what sort of properties I should use. I eventually came up with a couple of ideas for verifying this core that I haven’t discussed much on this blog. They’ve been very useful. These are:

Interface properties for AXI-stream
Counting
Counting special values
Negative logic checking

There are probably better names for these concepts, but these are the best names I have for them so far. Yes, we’ve discussed the AXI stream properties before, and we’ve also counted transactions before when building bus property sets. I suppose, therefore, that these two concepts might not really be all that new to my blog. However, I am pretty sure that I’ve haven’t yet discussed either the special value or negative logic checking ideas before.

Interface properties

While I’ve written an AXI stream interface property file, it’s really overkill for most AXI stream designs. For today’s purpose, we only really need two checks: First, following a reset, all VALID’s need to be cleared. Second, any time the channel is stalled the transaction details are not allowed to change.

For the incoming AXI stream, this amounts to a couple of assumptions.

	always @(posedge S_AXI_ACLK)
	if (!f_past_valid || $past(!S_AXI_ARESETN))
		// Clear valid on reset
		assume(!S_AXIS_TVALID);
	else if ($past(S_AXIS_TVALID && !S_AXIS_TREADY))
	begin
		// Hold valid and data stable on any stall
		assume(S_AXIS_TVALID);
		assume($stable(S_AXIS_TDATA));
	end

The properties are the same for the outgoing stream–only they are now expressed as assertions instead of assumptions.

	always @(posedge S_AXI_ACLK)
	if (!f_past_valid || $past(!S_AXI_ARESETN))
		// Clear valid on reset
		assert(!M_AXIS_TVALID);
	else if ($past(M_AXIS_TVALID && !M_AXIS_TREADY))
	begin
		// Hold valid and data stable on any stall
		assert(M_AXIS_TVALID);
		assert($stable(M_AXIS_TDATA));
	end

My AXI stream property file has many more properties–overkill for most problems. Things like minimum and maximum packet sizes, properties on the maximum time something can be stalled, or even properties forcing TKEEP and TSTRB to match. Most cores don’t need this, and so sadly my AXI stream property file hasn’t gotten much testing.

Counting

The idea for counting is really simple: let’s count the number of items that enter into this design, and then verify that it matches the number of items output. We did this for our Wishbone property file, as well as when we built our AXI-lite property file. In both cases, we used a count of outstanding transactions as part of a check that no acknowledgment would be returned without a prior request.

We’ll do roughly the same thing here. The first step will be to count the number of items that are outstanding–that is, the number of items that have come into this core but haven’t yet been output. The logic should look very similar to the logic for calculating the fill of a FIFO. Notice the reset logic and the case statement–keys to doing both.

	initial	f_outstanding = 0;
	always @(posedge S_AXI_ACLK)
	if (!S_AXI_ARESETN)
		f_outstanding <= 0;
	else case({ (S_AXIS_TVALID && S_AXIS_TREADY),
			(M_AXIS_TVALID && M_AXIS_TREADY) })
	2'b00: begin end
	2'b01: if (M_AXIS_TDATA[MSB])
			// Count the items in the run
			f_outstanding <= f_outstanding - M_AXIS_TDATA[W-2:0]-1;
		else
			f_outstanding <= f_outstanding - 1;
	2'b10: f_outstanding <= f_outstanding + 1;
	2'b11: if (M_AXIS_TDATA[MSB])
		// One new one, minus the number of items in the run
		f_outstanding <= f_outstanding - M_AXIS_TDATA[W-2:0];
	endcase

What rules might we make of this counter? We can start with the obvious one: we can’t place an item into the output unless there’s been a corresponding input. The same counts for runs on the output–you need a corresponding number of values on the input before you can output a run.

	always @(*)
	if (M_AXIS_TVALID)
	begin
		if (M_AXIS_TDATA[MSB])
			// Runs
			assert(f_outstanding > M_AXIS_TDATA[W-2:0]);
		else
			// Data elements
			assert(f_outstanding > 0);
	end

While that makes for a great contract, it will never pass induction. The induction engine will always find a way to start the design in an inconsistent state, and hence break these properties. Over time, I’ve learned that this tends to happen with any inequality assertion–like the one above. If you want to pass induction, therefore, you’ll need to recount how many items are in the pipeline and then compare that to our outstanding number above–using an equality assertion.

	always @(*)
	begin
		// First, count everything in the pipeline:
		f_recount = 0;
		if (!S_AXIS_TREADY && skd_valid)
			// Count any items in the skidbuffer
			f_recount = f_recount + 1;
		if (mid_valid)
			// Count any item in the mid stage
			f_recount = f_recount + 1;
		if (run_valid && run_active)
			// Count the current run
			f_recount = f_recount + run_length + 1;
		else if (run_valid)
			// We we aren't in a run, then just count the one item
			// in the run length processor
			f_recount = f_recount + 1;
		if (M_AXIS_TVALID && M_AXIS_TDATA[MSB])
			// Count runs on the output
			f_recount = f_recount + M_AXIS_TDATA[W-2:0] + 1;
		else if (M_AXIS_TVALID)
			// ... or just any output
			f_recount = f_recount + 1;

		// Now check that the count matches f_outstanding
		//
		// Since the skid buffer valid is sensitive to reset, we'll
		// have to only make our check if the reset isn't active.
		if (S_AXI_ARESETN)
			assert(f_recount == f_outstanding);

For good measure, I thought I’d add another property to this check: if the design is ever outputting a run, then there should be a new piece of data in the run_* stage–rather than a second run building. Only, I didn’t think about the fact that two runs in a row might take place if the first one overflowed–at least not until the SymbiYosys caught that bug. Therefore, the check (now) includes an exception for overflow.

		if (M_AXIS_TVALID && M_AXIS_TDATA[MSB] && !(&M_AXIS_TDATA))
			assert(!run_active);
	end

I don’t know about you, but just counting items within the processing pipeline didn’t really convince me that my own design worked. It’s really not enough that the right amount of data is passing through the pipeline, what I really want to know is whether or not the right data is passing through the pipeline. For that, I needed another type of check.

Counting Special data

So, I decided to count specific items within the pipeline as a measure of whether or not the right data is passing through it properly. Here’s the idea: if for every possible data value going through this logic, the number in always matched the number output, then I can be sure that the right data is getting sent out.

The first step in this check is to let the solver pick an arbitrary data value.

	(* anyconst *)	reg	[W-2:0]	f_special_data;

Every time I explain this concept to a student I’m reminded of a magician performing a card trick: “Pick a card. Any card.” The same sort of thing is true here. The solver is challenged to pick a data value–any data value, and I intend to prove the design works no matter which value the solver picks.

For practical reasons, I also needed to keep track of whether or not any output run, in the M_AXIS_* stage, corresponded to this special value. That’s because when the output data value contains a run, there’s no data associated with the run value–so it’s impossible to tell if it’s the special data value or not any more. Therefore, anytime M_AXIS_* could accept a new value, I used a formal-only register, f_special_tdata, to kept track of whether or not that new value was associated with the f_special_data value or not.

	always @(posedge S_AXI_ACLK)
	if (!M_AXIS_TVALID || M_AXIS_TREADY)
		f_special_tdata <= (run_data == f_special_data);

Just to make certain I had this done right, a quick assertion could verify that if the output wasn’t encoding a run, f_special_tdata truly encoded whether or not the outgoing data was our special value or not.

	always @(*)
	if (M_AXIS_TVALID && !M_AXIS_TDATA[MSB])
		assert(f_special_tdata
			== (M_AXIS_TDATA[W-2:0] == f_special_data));

Remember the comment about equalities versus inequalities? The same can be said for implications verses if–and–only–if implications. By using an == sign in the test above, I can check both halves of the implication. First, if f_special_tdata is true, and then if it isn’t. Checking both is always a good idea–if it is appropriate to your problem.

Now with that little bit of logic under my belt, I can now count how many times this special value, f_special_data, has entered my pipeline minus the number of times it has left. As before, this counter requires a reset and a case statement.

	initial	f_special_count = 0;
	always @(posedge S_AXI_ACLK)
	if (!S_AXI_ARESETN)
		f_special_count <= 0;
	else case({ (S_AXIS_TVALID && S_AXIS_TREADY
					&& S_AXIS_TDATA == f_special_data),
			(M_AXIS_TVALID && M_AXIS_TREADY && f_special_tdata) })
	2'b00: begin end	// No change
	2'b01: if (M_AXIS_TDATA[MSB])
			// Outgoing run of special values
			f_special_count <= f_special_count
						- M_AXIS_TDATA[W-2:0] - 1;
		else
			// Outgoing special data value
			f_special_count <= f_special_count - 1;
	2'b10:	// New incoming special data value
		f_special_count <= f_special_count + 1;
	2'b11: if (M_AXIS_TDATA[MSB])
		// Outgoing run and incoming value at the same time
		f_special_count <= f_special_count - M_AXIS_TDATA[W-2:0];
	endcase

I can now repeat the same contract checks as I did before: If a piece of data is being output, there should be enough counts of that data item within our pipeline to support that output.

	always @(*)
	begin
		if (M_AXIS_TVALID && f_special_tdata)
		begin
			if (M_AXIS_TDATA[MSB])
				assert(f_special_count > M_AXIS_TDATA[W-2:0]);
			else
				assert(f_special_count > 0);
		end

		assert(f_special_count <= f_outstanding);
	end

As before, this isn’t enough to handle induction. Inequalities will rarely make it through an induction check. Therefore, let’s count every time this value shows up in our pipeline, and compare that combinatorial count to the registered one above.

	// f_special_recount
	always @(*)
	begin
		// Count how many times our special data value is in
		// our pipeline
		f_special_recount = 0;

		if (!S_AXIS_TREADY && skd_valid && skd_data == f_special_data)
			// Count our special item when it's in the skidbuffer
			f_special_recount = f_special_recount + 1;
		if (mid_valid && mid_data == f_special_data)
			// Count the special item when its in the mid stage
			f_special_recount = f_special_recount + 1;
		if (run_valid && run_data == f_special_data)
		begin
			// Count the number in the currently building run
			f_special_recount = f_special_recount + 1;
			if (run_active)
				f_special_recount = f_special_recount + run_length;
		end
		if (M_AXIS_TVALID && f_special_tdata)
		begin
			// Count the number on the output stage
			f_special_recount = f_special_recount + 1;
			if (M_AXIS_TDATA[MSB])
				f_special_recount = f_special_recount + M_AXIS_TDATA[W-2:0];
		end

		// Since the skid buffer valid is sensitive to reset, we'll
		// have to only make our check if the reset is inactive.
		assert(!S_AXI_ARESETN || f_special_recount == f_special_count);
	end

Other Induction checks

Before getting to the negative logic checks, I should point out that I also have a lot of simple and basic checks on the various registers from within the core. In general, you’ll want to pin down any register within your design using an assertion based upon its surrounding logic. That way the induction engine doesn’t surprise you.

There’s another reason for doing this as well. If you pin everything down sufficiently, then you’ll known quickly if some update in the future breaks the design. So let’s run through a couple of these quick checks.

We’ll start with the run_* stage. This stage always holds on to a value, only replacing it when there’s something replacing it. Therefore, if run_valid ever becomes true, it should remain true for all time. Likewise, if ever the mid_* stage became valid, then run_valid should become high on the next clock, assuming it wasn’t high already, and then remain high for the rest of time.

	// run_valid
	always @(posedge S_AXI_ACLK)
	if (!f_past_valid || !$past(S_AXI_ARESETN))
		assert(!run_valid);
	else if ($past(run_valid))
		assert(run_valid);
	else if ($past(mid_valid))
		assert(run_valid);

That also means that you can’t ever get to an output without going through the run_* stage and leaving run_valid true behind you.

	always @(*)
	if (M_AXIS_TVALID)
		assert(run_valid);

Looking over mid_same, it should only ever be true if the data in mid_data matches the current run_data. Likewise, if there’s nothing in the mid_* stage, then mid_data should match what’s in the next stage.

	always @(*)
	if (mid_valid && mid_same)
		assert(run_valid && mid_data == run_data);
	else if (!mid_valid && run_valid)
		assert(mid_data == run_data);

Let’s now turn to the run_active signal. This signal should be low if there’s nothing (yet) in the run_* stage. Further, anytime run_active is low, the run_length should also be zero.

	always @(*)
	if (!run_valid)
	begin
		assert(!run_active);
		assert(run_length == 0);
	end else if (!run_active)
		assert(run_length == 0);

Our overflow pending flag is specific: it should only ever be high if the run_length is about to overflow. This will only be true, as proved above, if run_valid && run_active are also true–so I’ve removed those last two conditions from this check.

	always @(*)
		assert(run_overflow == (&run_length));

The trigger in the middle section can only ever be true if the mid section has a valid item within it, if it’s not marked as being the same as the previous item, and if the r_triggered flag is now marked to indicate a trigger signal has been seen and processed.

	always @(*)
	if (!mid_valid || mid_same || !r_triggered)
		assert(!mid_trigger);

The same can be said of the trigger signal in the run_* stage,

	always @(*)
	if (!run_valid || run_active || !r_triggered)
		assert(!run_trigger);

and again of our final output trigger.

	always @(*)
	if (!M_AXIS_TVALID || M_AXIS_TDATA[MSB] || !r_triggered)
		assert(!o_trigger);

Now that that’s out of the way, let’s look at the negative logic check. That one was fun.

Negative Logic Checking

Here’s the idea behind the negative logic check: If a particular value never enters into the design, then it should never be output. It’s pretty simple, though it can also be pretty useful.

The first step is to decide whether or not we want to do this check in the first place.

	(* anyconst *)	reg		f_never_check;

Here’s the problem that necessitates this f_never_check flag: Making an undue assumption about the input might somehow allow the proof to pass vacuously elsewhere. Therefore, if there’s a possibility the proof might fail without this doing this check, then we want to keep that possibility open. For that reason we’ll gate our assumptions and assertions below based upon this f_never_check signal.

The next step is to let the solver pick an arbitrary value that we’ll never see. I chose to call this value, f_never_data.

	(* anyconst *)	reg	[W-2:0]	f_never_data;

Here’s how the check works: First, we’ll assume that this data never enters into our design.

	always @(*)
	if (f_never_check)
	begin
		if (S_AXIS_TVALID)
			assume(S_AXIS_TDATA != f_never_data);

Our goal will be to prove that this item never comes out of the design at any time, so here’s that assertion.

		if (M_AXIS_TVALID)
			assert(M_AXIS_TDATA != { 1'b0, f_never_data });

Now, to pass induction, let’s make sure we never have that value internally anywhere. If the skidbuffer is producing a valid output, then we need to check that it’s not in the skidbuffer.

		if (skd_valid)
			assert(skd_data != f_never_data);

Likewise, if there are items in either the mid_* stage or the run_* stage, then those stages shouldn’t be containing this value at all.

		if (mid_valid)
			assert(mid_data != f_never_data);

		if (run_valid)
			assert(run_data != f_never_data);
	end

Together, these formal properties gave me a strong confidence that this run-length encoder core would never do the wrong thing.

Cover Checks

Unfortunately, that confidence was misplaced.

While I really like assertion based formal property verification, I’ve been caught more than once with a resulting design that didn’t work. Indeed, that was the case many times with this design as I was building it–the proof above would pass, but the design wouldn’t have worked.

I caught several of these last bugs using some cover() checks. That’s also one of the reasons why I require a cover check before I will sign-off any of my own designs.

Here was my idea: I wanted to verify that this design could indeed generate compressed data. Therefore, I wanted to see an output from the core that showed a series of compressed data values. I chose to use an index for this, forcing the design to produce alternating run-length encoded and non-runlength encoded data, as it ran through a series of outputs.

	initial	cvr_index = 0;
	always @(posedge S_AXI_ACLK)
	if (!S_AXI_ARESETN)
		cvr_index <= 0;
	else if (M_AXIS_TVALID && M_AXIS_TREADY
			&& M_AXIS_TDATA == { cvr_index[0], cvr_index[W-2:0] })
		cvr_index <= cvr_index + 1;

	always @(*)
		cover(cvr_index == 7 && o_trigger);
endmodule

That said, the logic I’ve shared above was the conclusion of getting this cover check (and some others) to pass. In other words, it works now.

So, how does it look? Using this cover statement, I was able to generate the trace shown in Fig. 6 below.

Fig 6. Example cover trace

Let’s walk through how this works. The first data value, 0, can be immediately output. After two more 0’s are entered, a run of 2 is output. This is encoded as one less, allowing a run of zero to encode a repeat of the last value–as we’ve discussed above. Then we have a data value of 2–as forced by the cvr_index check above. This is followed by an additional four 2’s, and so a run of 4 (encoded as a 3) is output. We then have seven 4’s. This results in an output of 4 (the first data element) followed by a run of 6 (encoded as a 5). The last value 6 is then received, followed by an arbitrary (different) data value having the trigger set. Sure enough, these two values appear in the output as desired.

That’s what I wanted!

Conclusion

Building this simple run-length encoder turned into a fun afternoon project! Even better, I learned a lot when verifying it. From a blank slate, I think I came up with some very useful properties for knowing whether it works or not. I also think I can say confidently that the outgoing encoding will be a valid representation of the input. Further, I can also confidently say that the resulting stream will be AXI stream protocol compliant. The only thing I’m not quite so certain of is how well the core will handle either the trigger input or the run-length encoding enable input. Those two checks might require a bit more work.

Fig 7. The wrapper doesn't integrate the components well yet

While this run length encoder looks like it will work quite well, the entire compressed memory scope design, shown in Fig. 7 on the left, is (sadly) not finished yet. Yes, it’s posted, but if details matter (and they often do), then I’m going to need to go back and pay attention to how this compression scheme is integrated into the rest of the design. It’s not well integrated at present. As I see it, there are four problems I’ll need to come back and pay some more attention to.

Since the internal memory scope subcore handles the AXI-lite bus, only the MEMSCOPE subcore currently knows about any request for a manual trigger. That means that the manual trigger will therefore be applied downstream of our run-length encoder. If this happens, the manual trigger might take place in the middle of a run, violating our rule that the trigger must only happen on a data element and never on a run-length coded element.
The scope’s semantics require that it be primed before it can be triggered. Priming involves filling all of the scope’s memory with valid data. That way, when one reads out the memory later, there won’t be any confusion over which memory values have been initialized with good data and which have not. Sadly, the wrapper shown in Fig. 7 doesn’t (yet) know exactly when the MEMSCOPE within it has been primed, and so it might not activate the run-length encoder until a buffer (or two) after all memory has been filled.
The MEMSCOPE subcore includes AXI stream overflow detection. This check circuit looks for any violations of the AXI stream protocol as a way of knowing if it can or can’t keep up with the data source. The problem is that we’ve just formally verified that this run-length encoding core will never violate the AXI stream protocol. Any potential overflows will be upstream of this core–where there’s no overflow detection. This renders the overflow detection circuit irrelevant.

Perhaps I shouldn’t be so down on myself–the original compressed (run-length encoded) scope didn’t support these features at first either. They were added as an afterthought long after that core had been used successfully many times over.

Until then, I’ll just note the problems with the compressed memory scope wrapper in its file header. That way I’ll remember these issues later when I come back to update the wrapper.