VRoom! Blog - Branch Target Cache [BTC] (part 1) Predicting Multiple Branches per Clock
07 Nov 2021This is going to be the first of an occasional series of articles on the VRoom!/RVoom RISC-V CPU. I’m going to start with issues around the Branch Target Caches (BTC). Until recently our BTC has been pretty broken - this was a good thing as it forced our core to exercise its partial pipeline shootdown logic a lot - experience tells us that this is where one finds the hardest bugs ….. Now that it’s fixed and working let’s talk a bit about how it works.
Background
First some background - modern high end CPUs speculatively execute instructions across conditional branches. To get high performance CPUs need to be able to correctly predict a large proportion of branches - predicting a branch wrongly can cost 5-10 clocks (it’s actually worse in multi-issue CPUs like ours).
There’s a whole lot of research around the design of BTCs and predictors, we currently use a combined predictor (see McFarling’s “Combining Branch Predictors” [1]). It has a pair of different predictors:
- the first is a binomial predictor, which tries to predict branches that mostly go one way, it uses a hash of the program counter (PC) to index into a prediction table
- the second is a global history predictor, it uses a hash of the PC and a history of recent branches to index a second prediction table, this is better for predicting branches that have some contextual pattern, and
- finally a combined predictor uses a table indexed by a hash of the PC to predict which of the first two predictors is best for a particular branch.
There’s nothing particularly special about using this design, these days it’s relatively conservative - McFarling suggests it should predict 97% of branches correctly. We’ll talk more about how the tables are updated in the next blog post.
Bundles versus instructions
On a more traditional single instruction system system prediction would be applied to every instruction in order. Our system though is different - we decode up to 8 instructions per clock - we read a 128 bit (16 byte) naturally aligned ‘decode-bundle’ on every clock. In the RISC-V C-mode instruction set instructions can be 16 or 32-bit, and are aligned on 16-bit boundaries. So each of our decode-bundles can contain from 4-8 instructions. Depending on where we enter the bundle and whether a branch causes us to leave early we may actually end up decoding between 1 and 8 instructions per clock.
There’s an architectural rule of thumb that (very!) roughly one instruction in five is a branch - this means that there’s a pretty good chance that most of our bundles contain a branch - it also means that decoding bigger bundles (more instructions per clock) may not give us a lot more performance, in fact our 4-8 may well be a sweet spot for this sort of parallelism.
So for our system the thing that we are trying to predict is the behavior of decode-bundles rather than that of individual instructions. It’s entirely possible that there is more that one branch instruction in a decode-bundle in fact there could be up to 8 - luckily all we have to predict is the branch instruction that’s the first branch that’s taken in each bundle, not every branch in the bundle.
For every decode-bundle that we’re currently reading from the icache we want to predict:
- if there’s a branch in the currently decode-bundle being fetched?
- what is the instruction’s offset within the bundle? (so we don’t decode subsequent instructions)
- will it be it taken?
- which decode-bundle to predict next - either a branch target or the next decode-bundle in memory.
Our fetch/decode system looks like this (red lines are clock pipeline stages):
During every clock we use the PC to fetch a decode-bundle from the icache, we want to use this PC to predict the PC to use in the next clock, but we won’t know anything about the contents of the data being fetched until one clock later (during the instruction decode - between the lower two red lines in the diagram above).
Our instruction decoder is capable of doing some work to help us make sense of the instruction stream. It runs in two modes starting from the instruction at the offset into the bundle given by the fetch PC’s lower bits, what it does depends on whether the BTC gave us a prediction:
- unpredicted mode - it stops decoding at the first unconditional or backwards jumping conditional branch
- predicted mode - with a bit mask containing one predicted branch bit it will stop at that offset if it’s a branch, or it will stop at any earlier unconditional branch
The decoders also return the branch destination for any non-indirect branch, and the offset (if any) of the branch that was taken
In unpredicted mode we can discover an initial prediction for a decode-bundle and enter it into the BTC, however it takes 2 clocks (cache fetch and decode time) before we can choose the next decode-bundle - we depend on the BTC to help us avoid this - we call this a ‘micro-prediction’ because a miss costs us a single clock rather than 5+ clocks that a misprediction resolved in an ALU costs.
At the end of a decode clock we can check whether the decoder’s branch destination matches the prediction we had made a clock before of what the next decode-bundle’s address should be. If the PC address we’ve just fetched doesn’t match the currently decoding bundle’s output we take a micro-prediction miss and refetch it. This can happen when a bundle is first accessed, due to aliasing in the BTC, and if code is being changed in memory on the fly. We also check and update the prediction of which instruction in the decode-bundle branched.
Global History
Our BTC uses a global history predictor - traditionally this contains a bitmask of the N most recent branches, whether they were taken or not - but we’re not predicting instructions, we’re predicting decode-bundles.
Consider the following piece of code - it’s made up of 2 decode-bundles, it loops 50 times, taking the branch at 7a: on every second loop and finally taking the branch at 74: when it’s finished
p3: 70: 03200513 li a0,50 <- bundle 1 74: c901 beqz a0,84 p3+0x14 76: 00157593 andi a1,a0,1 7a: e199 bnez a1,80 p3+0x10 7c: 157d addi a0,a0,-1 7e: bfdd j 74 p3+0x4 80: 157d addi a0,a0,-1 <- bundle 2 82: bfcd j 74 p3+0x4 84: 8082 ret
The first bundle contains 3 branches - no matter what happens one of those branches is taken every time around the loop. As a source of global history “always branches” is pretty useless for creating a history to predict which of the three to take next time. This is because it we need something in the global history vector to help distinguish these different branches.
We need some information to tell us which of the two branches are taken each time we decode the first bundle (it’s either the conditional one at 7a:, or the unconditional one at 7e: except at the end of the loop when it’s 74:) - remember that the branch predictor predicts:
- whether a branch is taken at all
- branch destination address
- branch offset with in its source bundle
That last piece tells us which branch in the bundle to take.
Instead of just using a single bit of taken/not-taken history per branch instruction our implementation use 4 bits for each decode-bundle (a ‘taken’ bit and an index into the bundle of the branch that was taken - that same “branch offset within its source bundle” that the branch predictor tries to predict) - this makes for largish history vectors but we can cheaply hash them into something smaller to match the sizes of the global tables.
To Recap
The important ideas here:
- we decode large bundles of many instructions every clock
- we predict bundles not instructions
- our BTC’s global history includes the bundle offsets of the sources of branches taken
Next time: BTC (Part 2) Living in a Speculative World
[1] McFarling, Scott (June 1993). “Combining Branch Predictors” (PDF). Digital Western Research Lab (WRL) Technical Report, TN-36.