VRoom! blog - Adding Branch Prediction to the Trace Cache

Introduction

We’ve been working on the Trace Cache - if you read back a couple of blog entries you will remember that it was performing better than not having a trace cache - but suffered from mispredicted branches (while equivalent non trace-cache cases didn’t).

Now we’ve upgraded the trace cache to include some local prediction history - it gives us another 10% performance bringing our Dhrystone number up to 11.3 DMips/MHz

Trace Cache

Our original trace cache had N (N=64) lines of I (I=8) instructions. Each cache line has 1-8 instructions and a ‘next instruction’ entry telling the PC controller where to go next. We terminate filling a line when we find a subroutine call or return - because we want to keep the branch predictor’s call/return predictor up to date and use its results on returns.

In essence each line already has some branch prediction capability, it predicts all the branches in the line (a line could be up to 8 branches), and predicts where the last instruction jumps to (or what instruction it is followed by).

So the problem we’re trying to solve here is: for each trace line where sometimes it runs to completion and sometimes it takes a branch from the middle (or even the last instruction) of the trace line what happens?. What we need to predict are:

  • Which of the first J or the I instructions in the line will be executed
  • What will be the next instruction to execute after the Jth instruction is executed

Our normal (non trace) branch prediction uses a dual predictor that predicts either from a global branch history or a simple bimodal predictor - we essentially already have the simple predictor in place in our trace line that predicts a next instruction address, and that all the valid instructions in the line will be executed.

What we’ve done is added a local predictor history to each cache line, it’s M (initially M=5) bits updated on every fetch and every branch misprediction, 1 if the prediction should be taken and 0 if it wasn’t.

For every line we also have A (initially A=2) sets of <next instruction, num of instructions to issue> tuples, each set also has B (initially B=4) sets of histories and 2-bit counters. Essentially if the line hits in the cache then if its local history mask matches one of the masks and its associated counter isn’t 0 then we predict a non-default branch and that it’s destination is from the associated <next instruction> and only the associated <num of instructions to issue> are issued - otherwise the entire line and the default next instruction is used.

History masks, counters and <next instruction, num of instructions to issue> tuples are allocated on mispredicted branches - counters are incremented on correct predictions, and decremented on mispredictions, and slowly decay if not used over time (as do trace cache entries themselves). If a <next instruction, num of instructions to issue> set runs out of history masks to allocate then another <next instruction, num of instructions to issue> tuple with the same results can be allocated.

Having multiple <next instruction, num of instructions to issue> tuples per line is intended to handle the case of a line with multiple mispredicted branched in it - something like a decision tree in a case statement or the like - all this stuff is parameterized - it may turn out that after a lot of simulation we find out that that doesn’t happen much and maybe we just need A=1 (one next address/etc) but a larger B (more masks) and/or larger/smaller masks.

Results

On our dhrystone benchmark we now get 10% more performance - we see an issue rate of 11.31 DMips/Mhz - another 10% more performance above our previous results. We’re seeing an issue rate for 6.12 instructions per clock but are executing 4.15 instructions per clock (IPC). This is with 3 ALUs, if we instantiate 4 ALUs performance increases by a tiny fraction of a percent, by itself not really worth doing. Theoretically if we can get the execution rate to match the issue rate we’d see a DMips/Mhz ~16-17 - another 50% more performance.

Why aren’t wee seeing this? largely it seems to be because of stalls in the memory subsystem, mostly loads and stores waiting for stores that haven’t resolved their addresses yet - there are various ways to potentially resolve this issue, mostly speculative loads that might cause cache activity and as a result leave us open to Meltdown like security issues - we’ve been pretty conservative in this area and probably wont return to this issue for a while.

Despite this >4 IPC is pretty amazing, 4 IPC was our initial goal for carefully crafted media/etc kernels, exceeding it with branchy, compiled C code is great!. it with

Dhrystone has also been done to death here, it’s been useful because it’s helped to balance our fetch/decode and execution stages, next time we do serious benchmarking we’ll probably build a much wider range of tests (we need to tweak the FP performance for example).

What’s Next?

It’s time to work on a vector ALU(s) - we have all the components in place and a core that can feed them efficiently.

Next time: Vectors!

VRoom! blog - bugs bugs bugs

Introduction

The last couple of weeks we’ve been running and writing regression tests - found some interesting bugs ….

Our current full regression testing regime takes ~18 hours.

Trace Cache

The hardest problem to track down was in one of the generic RISC-V compliance tests - the test ran OK and generated the correct output, but the actual output phase generated an extra new line between each line of output.

This turned out to be a trace cache related bug - each trace cache line includes a predicted next address, usually this is simply the address of the next instruction after the last instruction in a line, or if the last instruction is a branch then it’s the target of the branch. In trace cache lines that include subroutine calls and returns we cut the trace cache lines short so that the BTC’s call return stack can be kept up to date, in these cases the last instruction in the line is the call or return.

In this case the problem instruction was a relative subroutine call (not one done indirectly through a register) - this is a unique case - in our design there are two sorts of unconditional jump instructions: subroutine calls that go into the commit queue because they have to write a return register and normal unconditional branches that are effectively swallowed by the fetch/decode stage and never reach the commit queue (they don’t even get counted when we report the number of instructions executed, or IPC numbers - we’re actually slightly faster than we report because of this).

Why was this instruction different? well as we pass an instruction bundle through the machine we carry with every branch instruction the predicted address of its target, initially we just used this for triggering BTC misses, but when we implemented the trace cache, we also started using this to generate the trace cache’s predicted next address. It turns out that for relative unpredicted jump instructions the program counter unit was generating the predicted address incorrectly, prior to using it for the trace cache it was being ignored (because the branch unit doesn’t need to check if an unpredicted branch branched to the correct place). This was one of those cases where the failure occurred a few million clocks after the actual bug.

Instruction Fusion

Found an interesting bug in instruction fusion - we weren’t handling 64-bit addw vs. add correctly when merging an add instruction with a subsequent branch (in this case a subroutine return). This was an easy bug to find and fix but reminds us that once you start fusing instructions testing starts to have an N squared sort of behavior - if we get serious about instruction fusing we’ll have to generate a lot more tests.

On previous CPUs we’ve worked on we’ve had someone generate enormous quantities of random instruction streams (especially good for stressing pipelines), as well as streams of instruction pairs, we’re probably going to eventually need tools like this.

VMQ overflow

A few months back we rewrote the load/store unit to add much more parallelism - it now processes everything in 2 clocks and handles 6 address translations and 4 each load/stores at once - one area we didn’t touch was the Virtual Memory Queue (VMQ) - this is a structure that queues TLB misses to the second level TLB cache and table walker, it has 12 entries and worst case it must be able to accept 6 TLB misses per clock. If the queue fills commit Q instructions much be able to back off and try again.

We had a stress test for this, but after the load/store unit rewrite it stopped forcing VMQ overflows, now we have a test that does that and we found new bugs caused by the rewrite, luckily we were able to find a bug fix with fewer/faster gates than the previous logic.

What’s Next?

Going to spend some more time with the trace cache.

Next time: More trace stuff

VRoom! blog - Trace Cache Redux

Introduction

Last week we found some Dhrystone performance we’d left unreported from when we had implemented the B instructions, but largely for a couple of months now our performance has been limited by the issue rate in the fetch/decode stages, in Dhrystone this is limited by the branchy nature of the instruction stream.

This week we resurrected our broken trace cache and got a further 5% performance (now at 10.3 DMips/MHz)

Trace Cache

Last year we spent quite a while building a simple trace cache for VRoom! the experiment was successful in the sense that it actually worked, but not so much because its performance sucked, it never performed better than not having a trace cache.

So what is a trace cache in this context? essentially it’s a virtually tagged cache of already decoded instructions recorded from the spot in the CPU where we retire instructions - in particular it’s a recorded stream of post branch predictor instructions - our trace cache is 8 instructions wide and is capable of feeding fully 8 instructions per clock into the renamer.

placeholder

As mentioned above our previous experiments of creating a simple trace cache sucked, mostly that was because it interacted badly with the branch target cache (BTC) - when we use the BTC alone Dhrystone is predicted perfectly, when we mix it with a trace cache we see mispredicted branches and the associated performance drain.

Performance

So now that we’re in a position where we’ve been improving the execution side of things for a while, but not the issue side of things, so it was worth revisiting the broken trace cache, to our surprise it performs better than not having it now - we now see ~10.3 DMips/MHz which is ~5% faster than last week’s numbers - even with the broken trace cache taking mispredicted branches.

Here’s a great example trace from the middle of Dhrystone:

placeholder

Note that it starts with mispredicted branch, but after that we see a bunch of mostly full feeds from the trace cache - runs of 8 instructions all starting on the same clock.

In fact we measure an issue rate of 6.2 instructions per clock over the benchmark (previously we were getting ~3.7) sadly though, many of those are being discarded by mispredicted branches (which flush all the instruction after them in the commitQ when they resolve), our final IPC (instructions per clock, which is a measured after branch prediction, a count of actual instructions completed) is now at 3.77 (out of 8) - our design goal has always been 4 on sustained streams, that we’re getting this close on a branchy benchmark is pretty amazing!.

What’s Next?

It’s pretty obvious that we need to fix the trace cache and up the issue rate - next step is likely to be to have the BTC learn to predict whether a trace cache hit should be followed or not

Next time: More trace stuff

VRoom! blog - More Experiments in Macro Op Fusion

Introduction

This week we’ve done some more experiments in macro op fusion, and found an unrelated surprising performance boost - ~12% to 9.76 DMips/Mhz - almost at 10!

Limits

A reminder from last time, we’ve set some limits on our macro op fusion experiments:

  • we’re going to avoid creating new instructions that would require adding more read/write register ports to ALUs
  • we’re also going to avoid (this time around) trying to deal with instructions where the second of a pair might fault and not be executed. For example if we fuse ‘lui t0, hi(xx);ld t0, lo(xx)(t0)’ the load might take a page fault and get restarted expecting the side effect of the lui had actually happened - it’s not that this can’t necessarily be done, just that this is hard and we’re performing a simple experiment

Choices

Last time we merged return instructions and preceding add instructions, and constant loads and compares.

This time around we’ve add 4 more optimizations: merged lui/auipc instructions and add instructions, and lui instructions and load/store instructions. These are the ‘traditional’ candidates for fusion.

Planning ahead for this we’d previously plumbed 32-bit constants everywhere through our design, up until now they’ve either been high 20 bit lui/auipc values or low 12-bit signed constants. Now we’ve added logic to combine these values - if we limit merging to constants that the decoder actually makes then it doesn’t require a full 32-bit adder per renamer (7 of them for 8 renamers to merge the data between two adjacent units) instead the lower 12 bit just get passed and the upper 20 bits have an optional decrementer. We share this logic for the 4 newly merged instructions.

The new 4 instructions pair we’re supporting are:

1)

lui	a, #hi(X)
add	a, a, #lo(X)

becomes:

nop
lui	a, #X

2)

auipc	a, #hi(X)
add	a, a, #lo(X)

becomes:

nop
auipc	a, #X 

3)

lui	a, #hi(X)
ld.x	y, #lo(X)(a)

becomes:

lui	a, #hi(X)
ld.x	y, #X(x0)

4)

lui	a, #hi(X)
st.x	y, #lo(X)(a)

becomes:

lui	a, #hi(X)
st.x	y, #X(x0)

Some notes:

  • as per last time nops get swallowed in the commit stage
  • the new auipc created in 2) gets the pc of the original pc
  • load/store instructions still perform the lui (so that exceptions can be restarted, it doesn’t break the second limitation above), but can run a clock earlier because they don’t have to wait for the lui - we may remove this limitation in the future

Performance

Like previous optimizations this doesn’t affect Dhrystone numbers at all, particularly because Dhrystone doesn’t use any of these in its core. Nothing surprising here …. however ….

More Performance

However this week we also did some ‘maintenance’. We’d been using an old library for our benchmarks, but a few weeks back we switched to supporting the B and K RISCV architectural extensions, the B extension in particular contains instructions designed to help with string processing and strcmp is a core part of Dhrystone, this week we upgraded the library to use the new libraries, we also optimized them for VRoom! (mostly this involved aligning code to 16-byte boundaries).

The results have been better than we expected - a 12% increase in Dhrystone numbers, 9.76 DMips/MHz - a better number than we’d previously assumed possible without a trace cache - we’re now seeing an IPC of 3.58 (max 8) and an issue rate of 3.66 (out of 8). In the future with a functioning trace cache (and a n issue rate close to 8) we can predict that the max performance could be in the ~20 DMips/MHz range (assuming that our execution side improvements keep up, at this point they’re already ahead a bit).

So excellent news, not from recent changes, but essentially due some of the changes due to updates a month or so ago that had not fully been counted yet.

Next time: More misc extensions

VRoom! blog - Experiments in Macro Op Fusion

Introduction

Spent some time this week experimenting with macro op fusion. Many of the arguments around the RISCV ISA have been addressed with “you can solve that with macro op fusion” which is the idea that you can take adjacent instructions, next to each other, in an instruction stream and merge them into a single instruction that by itself can’t be expressed directly in the instruction set.

We’ve always planned that VRoom! would support fusion that made sense in the renamer stage (post decode) - today’s work is a trial experiment fusing some common sequences just to get a feel for how hard they are to implement.

Limits

First some limits:

  • we’re going to avoid creating new instructions that would require adding more read/write register ports to ALUs
  • we’re also going to avoid (this time around) trying to deal with instructions where the second of a pair might fault and not be executed. For example if we fuse ‘lui t0, hi(xx);ld t0, lo(xx)(t0)’ the load might take a page fault and get restarted expecting the side effect of the lui had actually happened - it’s not that this can’t necessarily be done, just that this is hard and we’re performing a simple experiment

Choices

We decided to choose 3 simple instruction pairs, these occur reasonably often in instruction streams and because they involve fused branches and ALU operations they particularly work well with our change a while back to merge our ALU and branch units, the downside is that previously it was possible to build ALUs which shared the adder used by normal addition operations and the magnitude comparator used for branches, now we will need to be able to do both at once.

The first choice is to merge an add immediate, or an add to x0 instruction, and an indirect branch that doesn’t write an output register, this catches a whole bunch of common code that happens at the end of a subroutine:

add	sp, sp, 40
ret

li	a0, 0
ret

mv	a0, s0
ret

The one limitation is that the destination of the add must not be the indirect register used by the jump/ret.

The second and third choices marry a load immediate to a register (add immediate to x0) instruction and a conditional branch comparing that register with another into one instruction

li	a0, #5
beq	a0, a1, target

li	a0, #5
beq	a1, a0, target

becomes:

li      a0, #5;beq	#5, a1, target

li      a0, #5;beq	a1, #5, target

This mostly means the branch instruction runs one clock earlier, which makes no difference in performance if it is correctly predicted (a predicted branch just checks that it did the right thing and quietly continues on), but reduces mispredicted branch times by effectively 1 clock.

All 3 optimizations free up an ALU slot for another instruction.

Implementation

We perform this optimization in our renamer - remember we have 8 renamers renaming up to 8 decoded instructions from the previous clock. We add logic to each renamer unit to recognize the types of addition operations we want to recognize for the first instruction, and pass that information to the next renamer (the first renamer gets told “no instructions match”).

Each renamer also recognizes the second instruction of the pair and gates that with the match from the previous renamer in the chain. Care is taken to make logic paths that do not go through the full layer of renamers.

If a renamer matches a branch (and a previous add) then it modifies the ‘control’ field (the portion that describes when sort operation is being done, it also tags itself as now creating an output register (and tags the addition’s output register), it also copies the immediate portion of the addition and the source register (in the ret operation case which it fetches as rs2).

If a renamer discovers that it’s feeding an add to a fused pair then it marks itself as not creating an output (doesn’t write a register), this effectively makes it a no-op, the commit stage will mark this instruction as done and it will never be scheduled to an ALU. The renamer logic already contains a layer that packs down decoded instructions, if we used that to remove this instruction it would create circular logic that wouldn’t be synthesisable, If we removed this in the next stage (added a ‘pack’ layer to the logic that feeds the commitQ) it would be very expensive and would break the renaming (which renames to future commitQ locations) - so instead we create a no-op entry (but see below).

The actual implementation in the ALU is pretty simple, we have to allow for a r2+immed addition and for immediates to be fed into the branch comparator.

Further optimization

You likely noticed that we can’t optimize add/branch pairs that cross instruction bundle boundaries (the boundaries in the instruction stream that our 4-8 instruction decoder decodes from). Once we implement a trace cache instruction streams will be presented to the renamer again, each time they are replayed, if such a split instruction pair is presented again here they can be re-optimized.

If we’re running a trace cache we can also purge no-op instructions from the cached instruction stream as it’s filled.

Performance

Here’s a simple example of an add/ret optimization, before the return (in black)

placeholder

and after:

placeholder

Notice that the add (at 0x36d4) is nullified, all that green in the instructions around there means that the ALUs are busy so removing one helps. Those blue dotted lines coming down out of the add instruction (dependencies on the new SP) now come out of the branch (return).

So we know it’s working - how much faster is it on our favorite dhrystone benchmark? Like last time it makes no difference at all - at the moment on that benchmark our performance there is completely limited by the issue rate from the instruction fetch/decoders.

We actually don’t expect these optimizations to help dhrystone much anyway as all it’s branches are predicted, and the big advantage of executing branches early is that it speeds up mispredictions. However it will also speed up instruction streams that are ALU bound as it frees up a slot.

Any design is a balance between the decode rate and the rate that the decoded instructions are executed, that balance is going to change with different sets of code. Essentially our current bunch of changes, and last week’s, are building pressure on the execution side of that trade off, we can’t easily speed up the decoder but we will return to our trace cache design which should be able it issue 8 instructions/clock to support that we’ll need more execution capability.

Next time: More misc extensions