VRoom! blog - Branch Target Cache [BTC] (part 2) Living in a Speculative World

14 Nov 2021

This is the second of an occasional series of articles on the VRoom!/RVoom RISC-V CPU. This week we’re going to talk about how we can create speculative entries in the Branch Target Cache (BTC). A quick reminder of some of what we learned in the previous blog.

we decode large bundles of many instructions every clock
we predict bundles not instructions

System Architecture

Let’s have a look at an overview of our system architecture:

placeholder

Instructions are fetched from the instruction cache (I$1), decoded, renamed to commit registers and entered into the commitQ. Once in the commitQ instructions can be executed in any order and even after they are executed they can be discarded by a trap or a preceding mispredicted branch.

At the end of the commitQ are committed instructions, when they and all instructions before them are done, all branches are correctly predicted, there are no pending traps, then the CPU will retire up to the last 8 committed instructions per clock.

BTC Operations

Our BTC contains tables that are accessed using hashes constructed from the program counter (PC) and a global history. On a traditional system the BTC tables and global history would be updated on every clock, In VROOM! we can have many instructions outstanding, often they are speculative instructions that may never be committed - when a branch instruction is processed by the branch ALU and is discovered to have been speculated incorrectly it can trigger the discard of subsequent speculative instructions (including other branch instructions), branch instructions can also be processed out of order.

Processing branch instructions is important, we want to use branch mis-predictions to update the BTC tables - if we assumed that a branch would be taken and it wasn’t we need to back out and change the value that we had previously set them too.

A simple solution (and the one that we originally tried) was to wait until the commit stage of the CPU and then only use instructions that have been committed (and branches that have been resolved) to update the BTC tables. It works, but it’s very slow to update (remember that a big version of our CPU can have ~150 instructions in flight at once). Consider something like this:

clear:
1:	std	x0, (a0)
	add	a0, a0, 8
	add	a1, a1, -1
	bnez	a1, 1b
	ret

The core of this is a single decode-bundle that always loops and contains 4 instruction in its core. The CPU may push 150/4 = 30+ bundles, 30 times around the loop, before we can start updating the BTC - this is particularly a problem with global history predictors which to be useful really to be updated on every prediction.

A better solution

Let’s look at how our BTC works:

placeholder

The PC and the global history (a bit vector) are hashed (just some xors and bit swapping) into indexes into the 3 BTC tables, they’re used to look up the 3 tables. The combined output is used to choose whether the bimodal or global history predictor is best for a particular bundle.

Our solution to the above problems is to create a ‘pending prediction’ queue in front of the normal BTC tables. Each entry contains the information about a bundle prediction. Including the PC used for it and the global history at the point that it was created. Newer predictions are performed by looking into the queue from most recent to oldest for each of the 3 hashes (individually and independently) comparing them with the hashes of the PC and global history generated in each entry, if no match is found then the value from the main tables is used.

placeholder

The per-bundle BTC pending prediction queue acts similarly to the main per-instruction commitQ - in this case each instruction in the commitQ carries a reference to the decode-bundle it came from.

Actions

If a branch instruction in the commitQ is discovered to be mispredicted all the instructions following it are discarded. At the same time entries in the BTC prediction queue that are newer than the bundle corresponding to the mispredicted instruction are also discarded, and the prediction queue entry that does correspond to the mispredicted branch has its data updated. Finally that prediction queue’s copy of the global history along with the new branch information is used to reload the global history vector used for the next BTC prediction.

Finally each BTC pending queue entry has a ‘committed’ bit - it gets set when all the instructions in the corresponding bundle hit the commit state in the commitQ, that means none of them will be mispredicted, or already have been. On every clock if the oldest entry in the BTC pending queue has its committed bit set then its data is copied back into the main BTC tables.

To Recap

The important ideas here:

we maintain a queue of pending uncommitted BTC predictions
each entry overrides the ones after it and the main tables
the queue is flushed and the top entry updated on a misprediction
committed data is pushed into the main tables

Next time: BTC (Part 3) Predicting Subroutine Returns

VRoom!

VRoom! blog - Branch Target Cache [BTC] (part 2) Living in a Speculative World

System Architecture

BTC Operations

A better solution

Actions

To Recap

Related Posts

VRoom! blog - Vector ALU Patterns 19 Jun 2023

VRoom! blog - Early Vector Processor Architecture 05 Jun 2023

VRoom! blog - Adding Branch Prediction to the Trace Cache 17 May 2023