Run length encoding an AXI stream
When I am forced to debug a design in actual hardware, I like to use my own internal logic analyzer. It wasn’t hard to build, and my version has a feature not often found in others: run-length encoding (RLE). I call this internal logic analyzer with RLE my “compressed scope”.
The scheme itself, as illustrated in Fig. 1, is quite simple: 31-bits of data come in, 32-bits go out. When a value first comes in, it gets sent to a holding register, then copied to the output on the next clock cycle after having a zero bit prepended to the MSB of the output word. Any subsequent copies of that same value coming in are then accumulated into a counter, and when a different value enters the holding register the counter is then forwarded to the output stage with a one bit prepended to the output word. These two output words, shown in Fig. 1 on the right, form the simple basis for this encoding.
While the scheme is fairly simple, it’s kept me out of trouble a couple of times. Once, I was building an I2C controller to the wrong spec. The compressed scope helped me discover what wasn’t working. In another example, I was working with a GPS receiver and wanted to know how the one-part-per-second (PPS) signal and the GPS serial port were related. Specifically, when GPS time was transmitted across the serial port, how easy would it be to confuse which PPS it was related to? Being able to generate a trace, looking like Fig. 2 and crossing several secondss, was very useful. In both cases, it was very valuable to be able to record a signal for seconds or even minutes on end to then later analyze what happened within the design.
Some time later, I got serious about verifying my compressed scope implementation and came across a couple issues that needed dealing with along the way. First, I modified the scheme so that it wouldn’t turn on the run-length compression until the capture memory was full. That way, the scope would be primed and ready for its first capture all the sooner. Then, I adjusted the RLE scheme again so that it wouldn’t apply the encoding scheme to the sample that triggered the scope. That meant that I could always back up from the end of the recording to know exactly what was happening at the time of the trigger.
That left me ready for the next problem: when Xilinx’s Memory Interface Generator (MIG) generated a DDR3 core for my Nexys Video board that wasn’t working, I pulled out my compressed scope again to see what was going on. I was able to quickly determine that the MIG core was generating AXI RVALID signals without prior AXI ARVALID requests. Not knowing how to fix what appeared to obviously be a problem with Xilinx’s MIG core, I wrote asking for help.
I was first told that I must be crossing 4kB page boundaries within some AXI burst request. Not possible, I said, since my AXI master wasn’t making burst requests.
I was then told that these RVALID responses must have somehow gotten buried in the MIG system somewhere and so, somehow, my own logic was at fault for not remembering that it had made a prior request. I went back and double checked the formal proof of my design. It checked out. I checked the formal proofs of everything up stream. Those checked. I added a counter to my design–counting the number of ARVALIDs minus the RVALIDs. Sure enough, I was getting an extra RVALID without a request.
But how could I explain that to the Xilinx representative?
In this case, I again turned to my compressed scope, and this time captured EVERYTHING from the AXI interactions–all of the ARVALID and RVALID signals, all of the AWVALID, WVALID, and BVALID signals, everything I thought would be relevant that would fit within 31-bits. When I went back to Xilinx’s tech support the next time, I had a four second recording in just a couple kB of block RAM that captured every transaction from chip startup to the error, to when I stopped the analyzer to look at the trace.
Sure enough, there was an extra RVALID.
My contact at Xilinx had no idea what was going on. I suppose I shouldn’t complain. I wasn’t purchasing paid tech support. I am very grateful that they took the time to even look into my problem, but the reality was that 1) no one else was complaining about their designs being broken in a similar manner, and so 2) I was on my own to try to figure it out. If you remember the story from when I shared it before, you’ll remember the problem was that I had misconfigured Xilinx’s memory controller for the wrong DDR3 SDRAM chip. My point here is simply that the compressed logic analyzer’s capability to collect more than 400 million clock ticks of data in only a few kB’s of block RAM was key to diagnosing and zeroing in on the problem.
More recently, I came back to revisit my Wishbone scope project and give it an AXI capability. First I created an AXI-lite version of the basic scope. Aside from the challenge of creating an AXI-lite processing pipeline that was more than a single clock deep, that design was fairly unremarkable. Then, as I was starting to figure out how to manage the full AXI interface with bursts, I decided to try building a logic analyzer that wrote it’s capture to an external memory. I called the result a MEMSCOPE. It made for a fun project, although to be honest I’m not sure what I’d do with a logic analyzer capable of recording over a GB of data.
This week, I came back to that MEMSCOPE design and thought I might add my run-length encoding scheme to the front of it. The result forms the basis for today’s discussion below.
Design Concept
Since my intent with this design was to build an AXI based design, I thought I’d go all the way and even build an AXI-stream based run-length encoder. It should take 31-bits in, and produce a 32-bits RLE stream out.
Well, that was the basic idea, but as always not everything fits in an AXI stream. For example, many logic analyzer requirements can’t handle backpressure. Any backpressure would corrupt the incoming stream. I also wanted two extra signals: one to indicate whether this word should be encoded in the first place, such as in the initial run-up when memory is being filled for the first time, and the second to indicate if this value should be associated with the trigger for the scope or not. These didn’t fit the AXI stream model either, since they were asynchronous to the stream itself–but I still found the AXI stream model a useful place to start from.
Internally, the design broke apart nicely into four separate stages.
There would be the initial
skidbuffer
stage, followed by a check to see if this new word was the same as the last
one. That decision would then feed into the run-length counter proper, and
the result of that would be our outgoing AXI stream. I then assigned each
of these stages a prefix, so that the registers associated with each of
the stages could clearly be identified: S_AXIS_*
would describe the input,
skd_*
the outputs from the
skidbuffer, mid_*
the results from the same-word detection, run_*
for the actual run-length
encoder and M_AXIS_*
for the output.
Let’s discuss each of these stages briefly in turn from a basic design
standpoint. Well, not quite each of them, since we’ve already discussed the
basic skidbuffer
before. Therefore, let’s start with the same word detection.
This one is pretty easy: if an output is available from the
skidbuffer, compare
it to the last output from this stage and mark the value as mid_same
if
the two match.
Here, I’ve used the skd_*
prefix to describe the values coming from the
skidbuffer,
the mid_*
prefix to describe the values in this same-word
detection stage, and the run_*
prefix to describe the run-length detection
stage that follows.
Ideally, this same-word detection would only check if the new data was the same as the last data,
There’s a couple of reasons why that’s not quite good enough. It might be,
for example, that we’ve just been reset and so never had any data come through
here yet. Therefore, we need to gate this against a valid signal somehow.
While it makes the most sense to gate this against mid_valid
since that
flag caveats mid_data
having the correct value in it, that’s not good enough
since this stage will get flushed as soon as the next stage will allow it.
Therefore, we’ll need to gate against both mid_valid
and the run_valid
signal from the following stage.
The run-length encoding stage is a bit more interesting. This is the one clock cycle where all the work gets done. It’s also a stage that, once filled, will never empty: it will always keep the last value sent within it. When a new value is provided, it will either kick that last value out to the output, or accumulate its count of repeats.
Let’s see how we might build this.
Our logic would focus on whenever new data comes in. We’ll continue to use
the run_ready
flag to indicate that we’re ready to move data from the mid_*
stage to the run_*
stage.
The first step will be to set this stage as valid any time we move data into it, and then to copy the data that moves into it at that time.
That’s the easy part.
The next part is where things get a bit more fun. Here’s where we do our
actual run-length encoding. If a new value comes from the mid_*
stage
that’s the same as the last one, that is with mid_same
set, then we’ll start
a new run and start counting repeats. If we are already in a run when a
new value comes in with mid_same
set, then we’ll simply increment the length
of that run, otherwise we just clear the run length in order to start counting.
This has the pleasant effect of making the counter zeros
based. Two consecutive and identical elements would then get encoded as
(clear LSB, data)
, followed by (set LSB, 0)
. Three consecutive identical
elements would be encoded by (clear LSB, data)
, followed by (set LSB, 1)
and so forth.
That brings us to the last stage, the one with the M_AXIS_*
prefix. We’ll
want to send a value out whenever there’s a value in the run_*
stage AND
we’re not accumulating that value into a run.
Setting TDATA
is even easier–since we don’t need to worry about resets.
Here, we’ll set TDATA
to either the data from the run_*
stage, if there’s
no run active, or otherwise the length of the run if a run is active.
That leaves two key pieces of logic that we’ve skipped: run_ready
and
skd_ready
, the two handshaking flags that’ll tell us when we can step these
stages forward. These two values are very much key to the success of the
algorithm, and they are complex enough that we can’t really gloss over them
here.
Internal to our algorithm, all of the ready signals are combinatorial. That means they’ll be simplest at the end of the pipeline, so that’s where we’ll start and work our way forwards.
We can accept data into the run_*
stage as long as there’s new data coming
from the same word detector and the following stage, the M_AXIS_*
stage,
is ready to accept data.
Only that’s too restrictive. We can also accept values into the RLE stage even if the output is stalled, but only if those values will be going into our run-length accumulator.
As it turns out, that’s the difficult signal. The skd_ready
signal is
much simpler. Basically, we can move data into the mid_*
stage if either
there’s nothing in the mid_*
stage, or if the mid_*
stage’s data is flowing
into the run_*
stage.
Indeed, this is the common way to propagate ready values upstream when there
are no exceptions to the general rule–as there were with run_ready
.
There you have the basics of the algorithm. There are some other key details we’ll need to work out when we get into the weeds below. For example, you don’t want to compress the trigger event–even if it is the same as all the events around it. Similarly, you don’t want to compress when you are initially filling memory, since you want to fill that memory as fast as possible and since the logic analyzer can’t trigger until the memory has at least filled once–lest you get confused later by uninitialized data.
There’s also one critical feature we’ll have to deal with below as well: how shall a counter overflow be handled? You don’t want to drop data no matter what happens. Therefore, once the run-length counter reaches its maximum value, we’ll need to output that value no matter what happens next.
Other than those three exceptions, triggers, priming the capture, and overflow handling, that’s all the harder the algorithm is. Indeed, run-length encoding is one of the simplest compression schemes you can implement in hardware: you can operate at full speed, and there’s no real need for any complicated tables to maintain.
Detailed Implementation
Let’s now move from our broad overview to a more detailed explanation of how this run-length encoder works.
The first difference between our broad overview and reality that we’ll have
to deal with are the i_encode
and i_trigger
flags.
The first flag, i_encode
, enables the run-length
encoding.
This will be set to zero externally until the capture memory is first filled.
The second, i_trigger
, is the trigger for the
internal logic analyzer.
It needs to be kept synchronized with the data. Moreover, if the trigger is
ever active then the run-length
encoding
also needs to be disabled.
The challenge associated with these two flags is that they may arrive asynchronous to the data handshake. That means we’ll need to remember if we get triggered between sample values, and so apply the trigger to the next data sample. Likewise, if we’re ever told not to run-length encode a piece of data, then we’ll need to turn off the encoding on either the sample presented on that clock cycle or the sample following if no sample is valid on that cycle.
The first step in handling this is a set of values I call sticky_*
values.
These handle keeping track of what should happen to the next value–before
that sample value has arrived. They’ll take place before the
skidbuffer, and just
remember what happens between accepted samples.
The next step is to go into the
skidbuffer.
This would be completely straightforward, except … the
skidbuffer
assumes that incoming data
won’t change while TVALID && !TREADY
and our sticky_*
values might
change during this time. This leaves us with a couple possibilities. We
could either adjust the
skidbuffer’s formal
properties so that we don’t make this assumption within it, or we can
handle these signals separately.
I chose to handle them separately. Therefore, here’s the skidbuffer’s instantiation without those signals, and we’ll have to deal with them next.
That means I’ll need to implement a quick skidbuffer by hand here. Therefore, when any value is accepted, we’ll copy it to our (local) skidbuffer registers.
The main skidbuffer
will drop S_AXIS_TREADY
on any stall, so we can use that
as our indication of whether a value got stuck in the
buffer
or not. If nothing is stuck in the
buffer,
we just use the current incoming values, otherwise we’ll use the r_*
values we just set above.
That’s it for the skidbuffer stage shown in Fig. 4 above. Now, onto the same word detection stage.
You should recognize the first part of this from our preliminary design above.
The biggest differences are the new mid_trigger
and r_triggered
signals,
as well as our handling of the skd_encode
and skd_trigger
with respect
to same word detection.
Of the two, the trigger signal is easiest to explain: we just pass it from
the skidbuffer
to the output of the same-word stage. We’ll also use an r_triggered
signal
to keep us from ever processing a second trigger signal.
From there we make our exceptions to the same word detection algorithm.
If ever the encode flag is false, we’ll need mark this value as different so it
won’t get folded into any run-length
encoding. Similarly, if
ever the skd_trigger
is true for this sample, and we haven’t seen the
trigger before, then we’ll also keep this value from being folded into the
run-length encoding
as well.
Finally, when this value moves forward we’ll clear the valid and trigger
signals–that way we don’t count these values more than once. Likewise,
on any reset we’ll clear those signals as well as the r_triggered
signal.
Perhaps you may have noticed that this form of a reset is kind of rare for me.
Normally, when I write a clocked logic block I will place the reset test
first, and only then the non-reset logic following. That form wouldn’t work
here–primarily because I’ve left the mid_same
and mid_data
signals in
this always block. Had I placed the reset first, that would have burdened the
logic for setting these signals such that they would only be set if the reset
wasn’t active.
That leads to a more philosophical discussion of how logic should be broken into
blocks. My general rule is that if two or more signals share the same signaling
structure, then they can share the same block. This rule generally follows
from my focus on minimizing
area, and so I
dislike adding more logic to a design than necessary. In this case, that
might refer to the mid_data
signal as an example. Otherwise, if the control
logic is different between signals than I will separate their logic blocks
within a design. This usually leads to separating any logic for signals
requiring resets into separate blocks from the other associated logic.
In this case, the trailing reset check works just as good for only those signals that need to be reset here. While I know others who are dogmatic about using this reset form, I’m still somewhat flexible in how I build my logic. Still, if you haven’t read Olof Kindgren’s article on the topic, it’s worth a read to understand where others’ might be coming from on this issue.
Let’s move on to the run_*
section.
We’ll start with the simple stuff: when we’re ready to accept a value from
the mid_*
stage, we’ll set the run_valid
signal and copy the data and
trigger signals into this stage.
These new values will either start or continue a run if ever there is both 1) a new data value, and 2) if that new value is the same as the last one we just processed. Similarly, if we have already started a run, then we’ll increment it’s length now.
Don’t overlook the offset here: run_active
will get set first, on the
same clock that run_length
is still getting set to zero. Only on the
next value, the third in any run, will run_length
start accumulating.
Perhaps a figure might help, so let me reference Fig. 5 below.
In this trace, you’ll see that on any first value of a new run, such as the
yellow run in the trace, mid_same
will be low. When that value first
repeats, mid_same
will be set but run_active
will be low. On the
second repeat, run_active
will be set and run_length
will be zero. From
there it starts accumulating as new data is given to it.
The next trick is the run_overflow
signal. We skipped this signal earlier.
I added it to the basic run-length
encoder processing in
order to keep the run-length counter from overflowing. We’ll set this value
to true any time the run-length will overflow if one more is added to it.
I suppose
that’s technically different from from overflow, but
run_length_counter_will_overflow
seemed a bit long for a signal name.
At any rate, this run_overflow
signal will be true if ever our run_length
counter is all ones. That means we can compare it against all ones but the
last–a negative 2 if you will, and spare a bit in our comparison. We’ll
clear this flag if we either get a new value that doesn’t increase our run
length, or if we are ever forced to output a value at the maximum run length.
I’ll also clear just about all of these signals on reset.
That brings us to the ready handshaking
signal
gating the entrance to the run_*
stage. This is one of the more critical
pieces in this algorithm. If we get it wrong we might either drop values from
our pipeline or accidentally introduce phantom values into it.
We’ll start with the clearest reason for run_ready
to be true: We can set
run_ready
if the next stage is ready for another output.
Even if the M_AXIS_*
stage is stalled, however, we might still be able to
accept a new value into the run_*
stage if it were to only increase our
run length counter. Hence, we can set run_ready
if we are in the middle
of a run, run_active
, and the next element continues this run, mid_same
.
The exception to this rule is if adding another element to the run would
overflow our run-length counter.
Finally, we don’t want the run_*
stage to step at all if there isn’t a new
valid to be placed into it.
Remember how the run_*
stage works: It starts out idle, but after receiving
its first value it then remains valid for all time with the next output value
within it. This value gets “stuck” in the run_*
stage until we know
the end of whatever run that might be within it. That’s the time when the
run value (or counter) will get kicked out by a new value from the mid_*
stage. It’s for this reason that we need to make certain that the run_*
stage never moves forward unless there’s also a value entering from the
mid_*
stage.
That leads us to the M_AXIS_*
stage–the output stage in our processing
pipeline. We’ll need to set two values here, the M_AXIS_TVALID
value,
indicating that we have a value to send out, and the actual value,
M_AXIS_TDATA
. (This design doesn’t make use of any of the other AXI-stream
signals, such as TLAST
, TSTRB
, TKEEP
, or TID
.)
We can start defining the M_AXIS_TVALID
signal with the logic that would
be used for any AXI-based *VALID
signal. This part is kind of a
fill-in-the-blank logic form.
While this form might seem quite straightforward, do beware that it is still
possible to mess it up–as Xilinx did in their AXI Ethernetlite controller.
In that controller, they set their design to wait on AXI_RREADY
without also
checking for !AXI_RVALID
. The result was that, when the controller was
subjected to a formal check, SymbiYosys managed to hang the core by simply
SymbiYosys found a way to make
the core hang by simply holding AXI_RREADY
low–such as any out-of-order
AXI interconnect might do.
That just leaves us with the blank in this logic form to fill in with
our M_AXIS_TVALID
logic. In this case we’ll default to zero, and sending
nothing. After that, we can use the run_ready
signal to help us know when
to move forward. We’ll place a value into the
M_AXIS_*
stage if there’s both a valid value in the run_*
stage and if
run_ready
is true so that it’s moving forward. While that’s a good
start, however, it isn’t enough.
From the run_*
stage, we only move an output forward on two conditions.
First, we always move forward if there’s a new value entering into the
run_*
stage that would overflow the counter. Second, we’ll always move
forward on a new or different value. Such a value would end any ongoing run.
We’ll also move forward on the first item of any run–since we always need to
place the data value before the subsequent run length itself.
This also instroduces an unfortunate consequence into our logic analyzer application. As you may recall, the logic analyzer will record data until some programmable time after the trigger. Nothing controls what the oldest value will be in memory at that time. It might be a data value, or it might be a run encoding for … a never to be recovered value. Worse, it might also be a series of run overflow encodings only ending with the value at the time of the trigger. While that’s a potential reality of this particular implementation, the good news is that it hasn’t (yet) seriously impacted any analysis I’ve needed to do. Why not? In most of my examples, the run length encoding is good enough that there’s always still more than enough data left for any necessary analysis.
You can also think of it this way: it requires 2^32
elements to fill a run.
At a (rough) 100MHz clock, that’s about 43 seconds of data. Filling up an
8kB buffer with run indications would therefore require about 4 days of data
at this rate. While it might happen, it’s definitely not likely.
After TVALID
, we’ll set TDATA
any time !TVALID || TREADY
.
Again, this follows from the basic form of setting any AXI data value
associated with any AXI stream. Valid’s follow the form above, data follows
the form below.
What value shall we set our result to? Well, if we are ending a run, then we’ll want to output the fact of the run together with the run length. Otherwise, we’ll just output the data we’ve been holding on to.
That leaves only one last loose end to tie up before we complete our design,
and that’s the outgoing trigger for the logic analyzer
subcore.
Remember, we’ve needed to keep
this associated with the data that arrives concurrently to it. Therefore,
as we’ve moved data through our pipeline, we’ve kept the trigger indication
together with the data in each and every stage. Here, we just forward the
trigger to the output when it leaves the run_*
stage.
What about double checking that the trigger is only ever output for a data
word, never a compressed word? Or guaranteeing that this outgoing trigger
is only true when M_AXIS_TVALID
is also true? For that, we’ll turn to
formal methods and double check that we haven’t damaged our algorithm
by simplifying it this far.
Indeed, I should point out that even while writing this blog article I’ve done a lot of simplifying. The result you see above you uses only an estimate 130 Xilinx series-7 LUTs. I’ve also leaned heavily on the formal proof to do this. Why not? A full proof of this core takes only takes 20 seconds. That makes it easy to make changes and adjust any logic.
Formal Properties
When I first built this design, I’ll admit I stared at the blank formal property section for a while just wondering what sort of properties I should use. I eventually came up with a couple of ideas for verifying this core that I haven’t discussed much on this blog. They’ve been very useful. These are:
- Interface properties for AXI-stream
- Counting
- Counting special values
- Negative logic checking
There are probably better names for these concepts, but these are the best names I have for them so far. Yes, we’ve discussed the AXI stream properties before, and we’ve also counted transactions before when building bus property sets. I suppose, therefore, that these two concepts might not really be all that new to my blog. However, I am pretty sure that I’ve haven’t yet discussed either the special value or negative logic checking ideas before.
Interface properties
While I’ve written an AXI stream interface property file, it’s really overkill for most AXI stream designs. For today’s purpose, we only really need two checks: First, following a reset, all VALID’s need to be cleared. Second, any time the channel is stalled the transaction details are not allowed to change.
For the incoming AXI stream, this amounts to a couple of assumptions.
The properties are the same for the outgoing stream–only they are now expressed as assertions instead of assumptions.
My AXI stream property
file
has many more properties–overkill for most problems. Things like minimum and
maximum packet sizes, properties on the maximum time something can be stalled,
or even properties forcing TKEEP
and TSTRB
to match. Most cores don’t need
this, and so sadly my AXI stream property
file
hasn’t gotten much testing.
Counting
The idea for counting is really simple: let’s count the number of items that enter into this design, and then verify that it matches the number of items output. We did this for our Wishbone property file, as well as when we built our AXI-lite property file. In both cases, we used a count of outstanding transactions as part of a check that no acknowledgment would be returned without a prior request.
We’ll do roughly the same thing here. The first step will be to count the number of items that are outstanding–that is, the number of items that have come into this core but haven’t yet been output. The logic should look very similar to the logic for calculating the fill of a FIFO. Notice the reset logic and the case statement–keys to doing both.
What rules might we make of this counter? We can start with the obvious one: we can’t place an item into the output unless there’s been a corresponding input. The same counts for runs on the output–you need a corresponding number of values on the input before you can output a run.
While that makes for a great contract, it will never pass induction. The induction engine will always find a way to start the design in an inconsistent state, and hence break these properties. Over time, I’ve learned that this tends to happen with any inequality assertion–like the one above. If you want to pass induction, therefore, you’ll need to recount how many items are in the pipeline and then compare that to our outstanding number above–using an equality assertion.
For good measure, I thought I’d add another property to this check: if the
design is ever outputting a run, then there should be a new piece of data
in the run_*
stage–rather than a second run building. Only, I didn’t
think about the fact that two runs in a row might take place if the first
one overflowed–at least not until the
SymbiYosys
caught that bug. Therefore, the check (now) includes an exception for
overflow.
I don’t know about you, but just counting items within the processing pipeline didn’t really convince me that my own design worked. It’s really not enough that the right amount of data is passing through the pipeline, what I really want to know is whether or not the right data is passing through the pipeline. For that, I needed another type of check.
Counting Special data
So, I decided to count specific items within the pipeline as a measure of whether or not the right data is passing through it properly. Here’s the idea: if for every possible data value going through this logic, the number in always matched the number output, then I can be sure that the right data is getting sent out.
The first step in this check is to let the solver pick an arbitrary data value.
Every time I explain this concept to a student I’m reminded of a magician performing a card trick: “Pick a card. Any card.” The same sort of thing is true here. The solver is challenged to pick a data value–any data value, and I intend to prove the design works no matter which value the solver picks.
For practical reasons, I also needed to keep track of whether or not any
output run, in the M_AXIS_*
stage, corresponded to this special value.
That’s because when the output data value contains a run, there’s no data
associated with the run value–so it’s impossible to tell if it’s the special
data value or not any more. Therefore, anytime M_AXIS_*
could accept a new
value, I used a formal-only register, f_special_tdata
, to kept track of
whether or not that new value was associated with the f_special_data
value
or not.
Just to make certain I had this done right, a quick assertion could verify
that if the output wasn’t encoding a run, f_special_tdata
truly encoded
whether or not the outgoing data was our special value or not.
Remember the comment about equalities versus inequalities? The same can be
said for implications verses if–and–only–if implications. By using an
==
sign in the test above, I can check both halves of the implication.
First, if f_special_tdata
is true, and then if it isn’t. Checking both
is always a good idea–if it is appropriate to your problem.
Now with that little bit of logic under my belt, I can now count how many
times this special value, f_special_data
, has entered my pipeline minus
the number of times it has left. As before, this counter requires a reset and
a case statement.
I can now repeat the same contract checks as I did before: If a piece of data is being output, there should be enough counts of that data item within our pipeline to support that output.
As before, this isn’t enough to handle induction. Inequalities will rarely make it through an induction check. Therefore, let’s count every time this value shows up in our pipeline, and compare that combinatorial count to the registered one above.
Other Induction checks
Before getting to the negative logic checks, I should point out that I also have a lot of simple and basic checks on the various registers from within the core. In general, you’ll want to pin down any register within your design using an assertion based upon its surrounding logic. That way the induction engine doesn’t surprise you.
There’s another reason for doing this as well. If you pin everything down sufficiently, then you’ll known quickly if some update in the future breaks the design. So let’s run through a couple of these quick checks.
We’ll start with the run_*
stage. This stage always holds on to a value,
only replacing it when there’s something replacing it. Therefore, if
run_valid
ever becomes true, it should remain true for all time. Likewise,
if ever the mid_*
stage became valid, then run_valid
should become high on
the next clock, assuming it wasn’t high already, and then remain high for
the rest of time.
That also means that you can’t ever get to an output without going through
the run_*
stage and leaving run_valid
true behind you.
Looking over mid_same
, it should only ever be true if the data in
mid_data
matches the current run_data
. Likewise, if there’s nothing in
the mid_*
stage, then mid_data
should match what’s in the next stage.
Let’s now turn to the run_active
signal. This signal should be low if
there’s nothing (yet) in the run_*
stage. Further, anytime run_active
is low, the run_length
should also be zero.
Our overflow pending flag is specific: it should only ever be high if the
run_length
is about to overflow. This will only be true, as proved above,
if run_valid && run_active
are also true–so I’ve removed those last two
conditions from this check.
The trigger in the middle section can only ever be true if the mid section
has a valid item within it, if it’s not marked as being the same as the previous
item, and if the r_triggered
flag is now marked to indicate a trigger
signal has been seen and processed.
The same can be said of the trigger signal in the run_*
stage,
and again of our final output trigger.
Now that that’s out of the way, let’s look at the negative logic check. That one was fun.
Negative Logic Checking
Here’s the idea behind the negative logic check: If a particular value never enters into the design, then it should never be output. It’s pretty simple, though it can also be pretty useful.
The first step is to decide whether or not we want to do this check in the first place.
Here’s the problem that necessitates this f_never_check
flag: Making an
undue assumption about
the input might somehow allow the proof to pass vacuously elsewhere. Therefore,
if there’s a possibility the proof might fail without this doing this check,
then we want to keep that possibility open. For that reason we’ll gate our
assumptions and assertions below based upon this f_never_check
signal.
The next step is to let the solver pick an arbitrary value that we’ll never
see. I chose to call this value, f_never_data
.
Here’s how the check works: First, we’ll assume that this data never enters into our design.
Our goal will be to prove that this item never comes out of the design at any time, so here’s that assertion.
Now, to pass induction, let’s make sure we never have that value internally anywhere. If the skidbuffer is producing a valid output, then we need to check that it’s not in the skidbuffer.
Likewise, if there are items in either the mid_*
stage or the run_*
stage, then those stages shouldn’t be containing this value at all.
Together, these formal properties gave me a strong confidence that this run-length encoder core would never do the wrong thing.
Cover Checks
Unfortunately, that confidence was misplaced.
While I really like assertion based formal property verification, I’ve been caught more than once with a resulting design that didn’t work. Indeed, that was the case many times with this design as I was building it–the proof above would pass, but the design wouldn’t have worked.
I caught several of these last bugs using some cover()
checks.
That’s also one of the reasons why I require a cover check
before I will sign-off any of my own
designs.
Here was my idea: I wanted to verify that this design could indeed generate compressed data. Therefore, I wanted to see an output from the core that showed a series of compressed data values. I chose to use an index for this, forcing the design to produce alternating run-length encoded and non-runlength encoded data, as it ran through a series of outputs.
That said, the logic I’ve shared above was the conclusion of getting this cover check (and some others) to pass. In other words, it works now.
So, how does it look? Using this cover statement, I was able to generate the trace shown in Fig. 6 below.
Let’s walk through how this works. The first data value, 0
, can be
immediately output. After two more 0
’s are entered, a run of 2 is output.
This is encoded as one less, allowing a run of zero to encode a repeat of the
last value–as we’ve discussed above. Then we have a data value of 2
–as
forced by the cvr_index
check above. This is followed by an additional
four 2
’s, and so a run of 4 (encoded as a 3) is output. We then have seven
4
’s. This results in an output of 4
(the first data element) followed
by a run of 6 (encoded as a 5). The last value 6
is then received, followed
by an arbitrary (different) data value having the trigger set. Sure enough,
these two values appear in the output as desired.
That’s what I wanted!
Conclusion
Building this simple run-length encoder turned into a fun afternoon project! Even better, I learned a lot when verifying it. From a blank slate, I think I came up with some very useful properties for knowing whether it works or not. I also think I can say confidently that the outgoing encoding will be a valid representation of the input. Further, I can also confidently say that the resulting stream will be AXI stream protocol compliant. The only thing I’m not quite so certain of is how well the core will handle either the trigger input or the run-length encoding enable input. Those two checks might require a bit more work.
While this run length encoder looks like it will work quite well, the entire compressed memory scope design, shown in Fig. 7 on the left, is (sadly) not finished yet. Yes, it’s posted, but if details matter (and they often do), then I’m going to need to go back and pay attention to how this compression scheme is integrated into the rest of the design. It’s not well integrated at present. As I see it, there are four problems I’ll need to come back and pay some more attention to.
-
Since the internal memory scope subcore handles the AXI-lite bus, only the MEMSCOPE subcore currently knows about any request for a manual trigger. That means that the manual trigger will therefore be applied downstream of our run-length encoder. If this happens, the manual trigger might take place in the middle of a run, violating our rule that the trigger must only happen on a data element and never on a run-length coded element.
-
The scope’s semantics require that it be primed before it can be triggered. Priming involves filling all of the scope’s memory with valid data. That way, when one reads out the memory later, there won’t be any confusion over which memory values have been initialized with good data and which have not. Sadly, the wrapper shown in Fig. 7 doesn’t (yet) know exactly when the MEMSCOPE within it has been primed, and so it might not activate the run-length encoder until a buffer (or two) after all memory has been filled.
-
The MEMSCOPE subcore includes AXI stream overflow detection. This check circuit looks for any violations of the AXI stream protocol as a way of knowing if it can or can’t keep up with the data source. The problem is that we’ve just formally verified that this run-length encoding core will never violate the AXI stream protocol. Any potential overflows will be upstream of this core–where there’s no overflow detection. This renders the overflow detection circuit irrelevant.
Perhaps I shouldn’t be so down on myself–the original compressed (run-length encoded) scope didn’t support these features at first either. They were added as an afterthought long after that core had been used successfully many times over.
Until then, I’ll just note the problems with the compressed memory scope wrapper in its file header. That way I’ll remember these issues later when I come back to update the wrapper.
Who can number the clouds in wisdom? or who can stay the bottles of heaven (Job 38:37)