17 May 2023
Introduction
We’ve been working on the Trace Cache - if you read back a couple of blog entries you
will remember that it was performing better than not having a trace cache - but suffered
from mispredicted branches (while equivalent non trace-cache cases didn’t).
Now we’ve upgraded the trace cache to include some local prediction history - it gives us another 10%
performance bringing our Dhrystone number up to 11.3 DMips/MHz
Trace Cache
Our original trace cache had N (N=64) lines of I (I=8) instructions. Each cache line has 1-8 instructions
and a ‘next instruction’ entry telling the PC controller where to go next. We terminate filling a line
when we find a subroutine call or return - because we want to keep the branch predictor’s call/return
predictor up to date and use its results on returns.
In essence each line already has some branch prediction capability, it predicts all the branches
in the line (a line could be up to 8 branches), and predicts where the last instruction jumps to (or
what instruction it is followed by).
So the problem we’re trying to solve here is: for each trace line where sometimes it runs to completion and
sometimes it takes a branch from the middle (or even the last instruction) of the trace line what happens?. What we need to predict are:
- Which of the first J or the I instructions in the line will be executed
- What will be the next instruction to execute after the Jth instruction is executed
Our normal (non
trace) branch prediction uses a dual predictor that predicts either from a global branch history
or a simple bimodal predictor - we essentially already have the simple predictor in place in our trace line
that predicts a
next instruction address, and that all the valid instructions in the line will be executed.
What we’ve done is added a local predictor history to each cache line, it’s M (initially M=5) bits updated
on every fetch and every branch misprediction, 1 if the prediction should be taken and 0 if it wasn’t.
For every line we also have A (initially A=2) sets of <next instruction, num of instructions to issue> tuples, each set also has B (initially B=4) sets of histories and 2-bit counters. Essentially if the line hits in the cache then if its local history mask matches one of the masks and its associated counter isn’t 0 then we
predict a non-default branch and that it’s destination is from the associated <next instruction> and only the associated
<num of instructions to issue> are issued - otherwise the entire line and the default next instruction
is used.
History masks, counters and <next instruction, num of instructions to issue> tuples are allocated on
mispredicted branches - counters are incremented on correct predictions, and decremented on
mispredictions, and slowly decay if not used over time (as do trace cache entries themselves). If a <next instruction, num of instructions to issue> set runs out of history masks to allocate then another <next instruction, num of instructions to issue> tuple with the same results can be allocated.
Having multiple <next instruction, num of instructions to issue> tuples per line is intended to handle the case
of a line with multiple mispredicted branched in it - something like a decision tree in a case statement
or the like - all this stuff is parameterized - it may turn out that after a lot of simulation we find out
that that doesn’t happen much and maybe we just need A=1 (one next address/etc) but a larger B (more masks)
and/or larger/smaller masks.
Results
On our dhrystone benchmark we now get 10% more performance - we see an issue rate of 11.31 DMips/Mhz - another 10% more
performance above our previous results. We’re seeing an issue rate for 6.12 instructions per clock but
are executing 4.15 instructions per clock (IPC). This is with 3 ALUs, if we instantiate 4 ALUs performance
increases by a tiny fraction of a percent, by itself not really worth doing. Theoretically if we can get
the execution rate to match the issue rate we’d see a DMips/Mhz ~16-17 - another 50% more performance.
Why aren’t wee seeing this? largely it seems to be because of stalls in the memory subsystem, mostly
loads and stores waiting for stores that haven’t resolved their addresses yet - there are various ways
to potentially resolve this issue, mostly speculative loads that might cause cache activity and as a result
leave us open to Meltdown like security issues - we’ve been pretty conservative in this area and probably
wont return to this issue for a while.
Despite this >4 IPC is pretty amazing, 4 IPC was our initial goal for carefully crafted media/etc
kernels, exceeding it with branchy, compiled C code is great!.
it with
Dhrystone has also been done to death here, it’s been useful because it’s helped to balance our
fetch/decode and execution stages, next time we do serious benchmarking we’ll probably build
a much wider range of tests (we need to tweak the FP performance for example).
What’s Next?
It’s time to work on a vector ALU(s) - we have all the components in place and a core that
can feed them efficiently.
Next time: Vectors!
04 Apr 2023
Introduction
The last couple of weeks we’ve been running and writing regression tests - found some interesting bugs ….
Our current full regression testing regime takes ~18 hours.
Trace Cache
The hardest problem to track down was in one of the generic RISC-V compliance tests - the test ran
OK and generated the correct output, but the actual output phase generated an extra new line
between each line of output.
This turned out to be a trace cache related bug - each trace cache line includes a predicted next address,
usually this is simply the address of the next instruction after the last instruction in a line, or if
the last instruction is a branch then it’s the target of the branch. In trace cache lines that include
subroutine calls and returns we cut the trace cache lines short so that the BTC’s call return stack can be kept up to date, in these cases the last instruction in the line is the call or return.
In this case the problem instruction was a relative subroutine call (not one done
indirectly through a register) - this is a unique case - in our design there are two sorts of
unconditional jump instructions:
subroutine calls that go into the commit queue because they have to write a return register
and normal unconditional branches that are effectively swallowed by the fetch/decode stage and
never reach the commit queue (they don’t even get counted when we report the number of instructions executed,
or IPC numbers - we’re actually slightly faster than we report because of this).
Why was this instruction different? well as we pass an instruction bundle through the machine we carry with every branch instruction the predicted address of its target, initially we just used this for triggering BTC misses,
but when we implemented the trace cache, we also started using this to generate the trace cache’s predicted
next address. It turns out that for relative unpredicted jump instructions the program counter unit was generating
the predicted address incorrectly, prior to using it for the trace cache it was being ignored (because the
branch unit doesn’t need to check if an unpredicted branch branched to the correct place). This was
one of those cases where the failure occurred a few million clocks after the actual bug.
Instruction Fusion
Found an interesting bug in instruction fusion - we weren’t handling 64-bit addw vs. add correctly when
merging an add instruction with a subsequent branch (in this case a subroutine return). This was an
easy bug to find and fix but reminds us that once you start fusing instructions testing starts to
have an N squared sort of behavior - if we get serious about instruction fusing we’ll have to generate
a lot more tests.
On previous CPUs we’ve worked on we’ve had someone generate enormous quantities of random
instruction streams (especially good for stressing pipelines), as well as streams of instruction
pairs, we’re probably going to eventually need
tools like this.
VMQ overflow
A few months back we rewrote the load/store unit to add much more parallelism - it now processes everything
in 2 clocks and handles 6 address translations and 4 each load/stores at once - one area we didn’t touch
was the Virtual Memory Queue (VMQ) - this is a structure that queues TLB misses to the second level
TLB cache and table walker, it has 12 entries and worst case it must be able to accept 6 TLB misses per
clock. If the queue fills commit Q instructions much be able to back off and try again.
We had a stress test for this, but after the load/store unit rewrite it stopped forcing VMQ overflows,
now we have a test that does that and we found new bugs caused by the rewrite, luckily we were able to find a bug fix
with fewer/faster gates than the previous logic.
What’s Next?
Going to spend some more time with the trace cache.
Next time: More trace stuff
11 Mar 2023
Introduction
Last week we found some Dhrystone performance we’d left unreported from when we had implemented
the B instructions, but largely for a couple of months now our performance has been limited
by the issue rate in the fetch/decode stages, in Dhrystone this is limited by the branchy nature of the instruction
stream.
This week we resurrected our broken trace cache and got a further 5% performance (now at 10.3 DMips/MHz)
Trace Cache
Last year we spent quite a while building a simple trace cache for VRoom! the experiment
was successful in the sense that it actually worked, but not so much because its performance sucked,
it never performed better than not having a trace cache.
So what is a trace cache in this context? essentially it’s a virtually tagged cache of already decoded
instructions recorded from the spot in the CPU where we retire instructions - in particular it’s a
recorded stream of post branch predictor instructions - our trace cache is 8 instructions
wide and is capable of feeding fully 8 instructions per clock into the renamer.

As mentioned above our previous experiments of creating a simple trace cache sucked, mostly
that was because it interacted badly with the branch target cache (BTC) - when we use the BTC alone
Dhrystone is predicted perfectly, when we mix it with a trace cache we see mispredicted
branches and the associated performance drain.
So now that we’re in a position where we’ve been improving the execution side of things
for a while, but not the issue side of things, so it was worth revisiting the broken trace cache,
to our surprise it performs better than not having it now - we now see ~10.3 DMips/MHz which is
~5% faster than last week’s numbers - even with the broken trace cache taking mispredicted branches.
Here’s a great example trace from the middle of Dhrystone:

Note that it starts with mispredicted branch, but after that we see a bunch of mostly full feeds from
the trace cache - runs of 8 instructions all starting on the same clock.
In fact we measure an
issue rate of 6.2 instructions per clock over the benchmark (previously we were getting ~3.7)
sadly though, many of those are being discarded by mispredicted branches (which flush all the instruction
after them in the commitQ when they resolve), our final IPC (instructions
per clock, which is a measured after branch prediction, a count of actual instructions completed) is now at 3.77 (out of 8) - our design goal has
always been 4 on sustained streams, that we’re getting this close on a branchy benchmark is pretty
amazing!.
What’s Next?
It’s pretty obvious that we need to fix the trace cache and up the issue rate - next step is likely to be
to have the BTC learn to predict whether a trace cache hit should be followed or not
Next time: More trace stuff
07 Mar 2023
Introduction
This week we’ve done some more experiments in macro op fusion, and found an unrelated
surprising performance boost - ~12% to 9.76 DMips/Mhz - almost at 10!
Limits
A reminder from last time, we’ve set some limits on our macro op fusion experiments:
- we’re going to avoid creating new instructions that would require
adding more read/write register ports to ALUs
- we’re also going to avoid (this time around)
trying to deal with instructions where the second of a pair might fault and not be executed.
For example if we fuse ‘lui t0, hi(xx);ld t0, lo(xx)(t0)’ the load might take a page fault
and get restarted expecting the side effect of the lui had actually happened - it’s not that
this can’t necessarily be done, just that this is hard and we’re performing a simple experiment
Choices
Last time we merged return instructions and preceding add instructions, and constant loads and compares.
This time around we’ve add 4 more optimizations: merged lui/auipc instructions and add instructions, and lui instructions and load/store instructions. These are the ‘traditional’ candidates for fusion.
Planning ahead for this we’d previously plumbed 32-bit constants everywhere through our design, up until now
they’ve either been high 20 bit lui/auipc values or low 12-bit signed constants. Now we’ve added logic to
combine these values - if we limit merging to constants that the decoder actually makes then it doesn’t
require a full 32-bit adder per renamer (7 of them for 8 renamers
to merge the data between two adjacent units) instead the lower 12 bit just get passed and the upper 20 bits
have an optional decrementer. We share this logic for the 4 newly merged instructions.
The new 4 instructions pair we’re supporting are:
1)
lui a, #hi(X)
add a, a, #lo(X)
becomes:
2)
auipc a, #hi(X)
add a, a, #lo(X)
becomes:
3)
lui a, #hi(X)
ld.x y, #lo(X)(a)
becomes:
lui a, #hi(X)
ld.x y, #X(x0)
4)
lui a, #hi(X)
st.x y, #lo(X)(a)
becomes:
lui a, #hi(X)
st.x y, #X(x0)
Some notes:
- as per last time nops get swallowed in the commit stage
- the new auipc created in 2) gets the pc of the original pc
- load/store instructions still perform the lui (so that exceptions can be restarted, it doesn’t
break the second limitation above), but can run a clock
earlier because they don’t have to wait for the lui - we may remove this limitation in the future
Like previous optimizations this doesn’t affect Dhrystone numbers at all, particularly because
Dhrystone doesn’t use any of these in its core. Nothing surprising here …. however ….
However this week we also did some ‘maintenance’. We’d been using an old library for our benchmarks,
but a few weeks back we switched to supporting the B and K RISCV architectural extensions, the B
extension in particular contains instructions designed to help with string processing and strcmp
is a core part of Dhrystone, this week we upgraded the library to use the new libraries, we also optimized
them for VRoom! (mostly this involved aligning code to 16-byte boundaries).
The results have been better than we expected - a 12% increase in Dhrystone numbers, 9.76 DMips/MHz - a
better number than we’d previously assumed possible without a trace cache - we’re
now seeing an IPC of 3.58 (max 8) and an issue rate of 3.66 (out of 8). In the future with a functioning
trace cache (and a n issue rate close to 8) we can predict that the max performance could be
in the ~20 DMips/MHz range (assuming that
our execution side improvements keep up, at this point they’re already ahead a bit).
So excellent news, not from recent changes, but essentially due some of the changes due to updates a month or so ago that had not fully been counted yet.
Next time: More misc extensions
05 Mar 2023
Introduction
Spent some time this week experimenting with macro op fusion. Many of the
arguments around the RISCV ISA have been addressed with “you can solve that
with macro op fusion” which is the idea that you can take adjacent instructions,
next to each other, in an instruction stream and merge them into a single instruction
that by itself can’t be expressed directly in the instruction set.
We’ve always planned that VRoom! would support fusion that made sense in the
renamer stage (post decode) - today’s work is a trial experiment fusing some
common sequences just to get a feel for how hard they are to implement.
Limits
First some limits:
- we’re going to avoid creating new instructions that would require
adding more read/write register ports to ALUs
- we’re also going to avoid (this time around)
trying to deal with instructions where the second of a pair might fault and not be executed.
For example if we fuse ‘lui t0, hi(xx);ld t0, lo(xx)(t0)’ the load might take a page fault
and get restarted expecting the side effect of the lui had actually happened - it’s not that
this can’t necessarily be done, just that this is hard and we’re performing a simple experiment
Choices
We decided to choose 3 simple instruction pairs, these occur reasonably often in instruction streams
and because they involve fused branches and ALU operations they particularly work well with
our change a while back to merge our ALU and branch units, the downside is that previously
it was possible to build ALUs which shared the adder used by normal addition operations
and the magnitude comparator used for branches, now we will need to be able to do both at once.
The first choice is to merge an add immediate, or an add to x0 instruction, and an indirect branch that doesn’t
write an output register, this catches a whole bunch of common code that happens at the end
of a subroutine:
add sp, sp, 40
ret
li a0, 0
ret
mv a0, s0
ret
The one limitation is that the destination of the add must not be the indirect register used by the jump/ret.
The second and third choices marry a load immediate to a register (add immediate to x0) instruction and a conditional
branch comparing that register with another into one instruction
li a0, #5
beq a0, a1, target
li a0, #5
beq a1, a0, target
becomes:
li a0, #5;beq #5, a1, target
li a0, #5;beq a1, #5, target
This mostly means the branch instruction runs one clock earlier, which makes no difference in
performance if it
is correctly predicted (a predicted branch just checks that it did the right thing and
quietly continues on), but reduces mispredicted branch times by effectively 1 clock.
All 3 optimizations free up an ALU slot for another instruction.
Implementation
We perform this optimization in our renamer - remember we have 8 renamers renaming up to 8 decoded
instructions from the previous clock. We add logic to each renamer unit to recognize the types of
addition operations we want to recognize for the first instruction, and pass that information to the
next renamer (the first renamer gets told “no instructions match”).
Each renamer also recognizes the second instruction of the pair and gates that with the match from the
previous renamer in the chain. Care is taken to make logic paths that do not go through the full
layer of renamers.
If a renamer matches a branch (and a previous add) then it modifies the ‘control’ field (the portion
that describes when sort operation is being done, it also tags itself as now creating an output register
(and tags the addition’s output register), it also copies the immediate portion of the addition
and the source register (in the ret operation case which it fetches as rs2).
If a renamer discovers that it’s feeding an add to a fused pair then it marks itself as not creating
an output (doesn’t write a register), this effectively makes it a no-op, the commit stage
will mark this instruction as done and it will never be scheduled to an ALU. The renamer logic
already contains a layer that packs down decoded instructions, if we used that to remove this instruction
it would create circular logic that wouldn’t be synthesisable, If we removed this in the next stage
(added a ‘pack’ layer to the logic that feeds the commitQ) it would be very expensive and would
break the renaming (which renames to future commitQ locations) - so instead we create a no-op
entry (but see below).
The actual implementation in the ALU is pretty simple, we have to allow for a r2+immed addition
and for immediates to be fed into the branch comparator.
Further optimization
You likely noticed that we can’t optimize add/branch pairs that cross instruction bundle boundaries (the boundaries in the instruction stream that our 4-8 instruction decoder decodes from). Once we implement
a trace cache instruction streams will be presented to the renamer again, each time they are replayed,
if such a split instruction pair is presented again here they can be re-optimized.
If we’re running a trace cache we can also purge no-op instructions from the cached instruction stream
as it’s filled.
Here’s a simple example of an add/ret optimization, before the return (in black)

and after:

Notice that the add (at 0x36d4) is nullified, all that green in the instructions around there
means that the ALUs are busy so removing one helps. Those blue dotted lines coming down out of the
add instruction (dependencies on the new SP) now come out of the branch (return).
So we know it’s working - how much faster is it on our favorite dhrystone benchmark? Like last time it makes no
difference at all - at the moment on that benchmark our performance there is completely limited by the issue
rate from the instruction fetch/decoders.
We actually don’t expect these optimizations to help dhrystone much anyway as all it’s branches are
predicted, and the big advantage of executing branches early is that it speeds up mispredictions. However
it will also speed up instruction streams that are ALU bound as it frees up a slot.
Any design is a balance between the decode rate and the
rate that the decoded instructions are executed, that balance is going to change with different
sets of code.
Essentially our current bunch of changes, and last week’s, are building pressure on the execution side
of that trade off, we can’t easily speed up the decoder but we will return to our trace cache design
which should be able it issue 8 instructions/clock to support that we’ll need more execution capability.
Next time: More misc extensions