VRoom! blog - bugs bugs bugs04 Apr 2023
The last couple of weeks we’ve been running and writing regression tests - found some interesting bugs ….
Our current full regression testing regime takes ~18 hours.
The hardest problem to track down was in one of the generic RISC-V compliance tests - the test ran OK and generated the correct output, but the actual output phase generated an extra new line between each line of output.
This turned out to be a trace cache related bug - each trace cache line includes a predicted next address, usually this is simply the address of the next instruction after the last instruction in a line, or if the last instruction is a branch then it’s the target of the branch. In trace cache lines that include subroutine calls and returns we cut the trace cache lines short so that the BTC’s call return stack can be kept up to date, in these cases the last instruction in the line is the call or return.
In this case the problem instruction was a relative subroutine call (not one done indirectly through a register) - this is a unique case - in our design there are two sorts of unconditional jump instructions: subroutine calls that go into the commit queue because they have to write a return register and normal unconditional branches that are effectively swallowed by the fetch/decode stage and never reach the commit queue (they don’t even get counted when we report the number of instructions executed, or IPC numbers - we’re actually slightly faster than we report because of this).
Why was this instruction different? well as we pass an instruction bundle through the machine we carry with every branch instruction the predicted address of its target, initially we just used this for triggering BTC misses, but when we implemented the trace cache, we also started using this to generate the trace cache’s predicted next address. It turns out that for relative unpredicted jump instructions the program counter unit was generating the predicted address incorrectly, prior to using it for the trace cache it was being ignored (because the branch unit doesn’t need to check if an unpredicted branch branched to the correct place). This was one of those cases where the failure occurred a few million clocks after the actual bug.
Found an interesting bug in instruction fusion - we weren’t handling 64-bit addw vs. add correctly when merging an add instruction with a subsequent branch (in this case a subroutine return). This was an easy bug to find and fix but reminds us that once you start fusing instructions testing starts to have an N squared sort of behavior - if we get serious about instruction fusing we’ll have to generate a lot more tests.
On previous CPUs we’ve worked on we’ve had someone generate enormous quantities of random instruction streams (especially good for stressing pipelines), as well as streams of instruction pairs, we’re probably going to eventually need tools like this.
A few months back we rewrote the load/store unit to add much more parallelism - it now processes everything in 2 clocks and handles 6 address translations and 4 each load/stores at once - one area we didn’t touch was the Virtual Memory Queue (VMQ) - this is a structure that queues TLB misses to the second level TLB cache and table walker, it has 12 entries and worst case it must be able to accept 6 TLB misses per clock. If the queue fills commit Q instructions much be able to back off and try again.
We had a stress test for this, but after the load/store unit rewrite it stopped forcing VMQ overflows, now we have a test that does that and we found new bugs caused by the rewrite, luckily we were able to find a bug fix with fewer/faster gates than the previous logic.
Going to spend some more time with the trace cache.
Next time: More trace stuff