Vroom! blog - Core VRoom! Architecture

Introduction

This week we’re going to try and explain as simply as possible how our core architecture works. Here’s an overview of the system:

placeholder

The core structure is a first-in-first out queue called the ‘commitQ’. Instruction-bundles are inserted in-order at one end, and removed in-order at the other end once they have been committed.

While in the commitQ instructions can be executed in any order (with some limits like memory aliasing for load/store instructions). At any time, if a branch instruction is discovered to have been wrongly predicted, the instructions following it in the queue will be discarded.

A note on terminology - previously I’ve referred to ‘decode-bundle’s which are a bunch of bytes fetched from the i-cache and fed to the decoders. Here we talk about ‘instruction-bundle’s which are a bunch of data being passed around through the CPU representing an instruction and what it can do - data in an instruction-bundle includes its PC, its input/output registers, immediate constants, which ALU type it needs to be fed to, what operation will be performed on it there, etc.

You can think of the decoders as taking a decode-bundle and producing 1-8 instruction-bundles per clock to feed to the renamer.

Registers and Register Renaming

We have a split register file:

placeholder

On the right we have the 31 architectural integer register and the 32 floating point registers. If the commitQ is empty (after a trap for example) then they will contain the exact state of the machine.

The commit registers are each staticly bound to one commitQ entry.

When an instruction-bundle leaves the decode stage it contains the architectural register numbers of its source and destination registers, the renamer pipe stage assigns a commitQ entry (and therefore a commit register) to each instruction and tags its output register to be that commit register.

The renamer also keeps a table of the latest commit register that each architectural register is bound to (if any) - it uses these tables to point each source register to either a particular commit register or an architectural register.

Execution

Once an instruction-bundle is in the commitQ it can’t be assigned an ALU until each of its source registers is either an architectural register or it’s a commit register that has reached the ‘completed’ state (ie it’s either in a commit register, or is being written to one and can be bypassed). Once a commitQ entry is in this state it goes into contention to be assigned an ALU and will execute ASAP.

Once a commitQ entry has been executed its result goes into its associated commit register, and the commitQ entry waits until it can be committed

Each commitQ entry is continually watching the state of it’s source registers, if one of them moves from the commitQ to the architectural registers it updates the source register it will use when executing.

Completion

In every clock the CPU looks at the first 8 entries in the commitQ, it takes the first N contiguous entries that are completed, marks them ‘committed’ and removes them from the queue.

Committing an entry causes its associated commit register to be written back into the architectural registers (if multiple commit entries write the same architectural register only the last one succeeds). It also releases any store instructions in the load-store unit’s write buffer to be actually written to caches or IO space.

Speculative Misses

As mentioned above when a branch instruction hits a branch ALU and is discovered to have been mispredicted - either a conditional branch where the condition was mispredicted, or an indirect branch where the destination was mispredicted - then the instructions in the commitQ after the mispredicted instruction are discarded and the instruction fetcher starts filling again from that new address.

While this is happening the renamer must update its state, to reflect the new truncated commitQ.

Branch instructions effectively act as barriers in the commitQ, until they are resolved into the completed state (ie their speculative state has been resolved) instructions after them cannot be committed - which means that their commit registers can’t be written into the architectural registers, and data waiting to be stored into cache or main memory can’t be written.

However instructions after an unresolved branch can still be executed (out of order) using the results of other speculative operations from their commit registers. Speculative store instructions can write data into the load-store unit’s write queue, and speculative loads can also snoop that data from the write queue and from the cache - I’ll talk more about this in another blog post.

Also branches can be executed out of order resulting in smaller/later portions of the commitQ being discarded.

Traps and Interrupts

Finally - the CSR ALU (containing the CSR registers and the trap and interrupt logic) is special - only an instruction at the very beginning of the commitQ can enter the CSR - that effectively means only one CSR/trap instruction can be committed in that clock (actually the next clock) this acts as a synchonising mechanism.

Load/store traps (MMU/PMAP/etc) are generated by converting a load/store tagged instruction in the commitQ into a trap instruction.

Fetch traps (again MMU/PMAP but also invalid instructions) are injected into the commitQ by the instruction fetcher and decoder which then stalls.

Interrupts are treated much the same as fetch traps, they’re injected into the instruction stream which then stops fetching.

When any of these hit the CSR unit all instructions prior to them have been executed and committed. They then trigger a full commitQ flush, throwing away all subsequent instructions in the queue. They then tell the fetcher to start fetching from a new address.

Some other CSR ALU operations also generate commitQ flushes: some of the MMU flush operations, loading page tables, fence.i etc. We also optimise interrupts that occur when you write a register to unmask interrupts and make them synchronous (so that we don’t end up executing a few instructions after changing the state).

To Recap

The important ideas here:

  • we have two register files one for speculative results, one for the architectural registers
  • our commitQ is an in-order list of instructions to execute
  • each commitQ entry has an associated output register for its speculative result
  • commitQ instructions are executed in any order
  • however instructions are committed/retired in chunks, in order
  • at that point their results are copied from the speculative commit registers to the architectural registers
  • badly speculated branches cause the parts of the commit Q after them to be discarded

Next time: Memory Layout

VRoom! blog - Branch Target Cache [BTC] (part 3) Managing a speculative subroutine call stack

This is the third of an occasional series of articles on the VRoom!/RVoom RISC-V CPU. This week a shorter update, we’re going to talk about how we can create speculative entries in the Branch Target Cache (BTC) call-return stack. A quick reminder of some of what we learned in the previous blog.

  • we decode large bundles of many instructions every clock
  • we predict bundles not instructions
  • we maintain a queue of pending uncommitted BTC predictions
  • each entry overrides the ones after it and the main tables

Introduction

Last time we talked about how we manage speculated branch target cache (BTC) entries, in short we have a queue of speculated BTC updates in front of the normal BTC tables, those entries are discarded or updated when a speculated branch is found to have been mispredicted, and retired into the main tables when a corresponding branch (or rather bundle of instructions) is committed.

placeholder

Subroutine call stack

In addition to the standard BTC tables we also have a per-access mode call-return stack, RISC-V defines for us the instructions that should be used to infer these (and which not to), in the BTC we provide a 32-entry stack (smaller for M-mode) of the last 32 subroutine calls so we can predict the return address when decoding ret instructions (if we get it wrong it’s not the end of the world, it will be corrected when the ret instruction hits the branch ALU)

Initially we thought we’d just tag branches with a pointer to the top-of stack, and wind it back when we got a misprediction, then we found out a case which turned out to have a particularly heavy impact on performance:


	.....
	call	a
r1:	call	b
r2	....

a:	bnez	a0, 1f		< branch mispredicted
	ret
1:	....

b:	....

essentially what happens here is that we speculatively fetch the following instruction stream:

	Instruction		call stack
	===========		==========
				

	call	a		r1 -
	bnez 	a0, 1f		r1 -	< remembers TOS
	ret			-	< everything from here on is 
	call	b		r2 -	  mispredicted
	....				

Sometime later the bnez is resolved in the branch ALU and discovered to have been mispredicted. But by that point the return address ‘r1’ on the call stack has been popped and replaced with ‘r2’, even though we got the top of stack (TOS) correct, its contents are wrong, and can’t be undone when we discover the misprediction. This means that at some later time when we finally do return from subroutine ‘a’ the prediction will be to ‘r2’ rather than ‘r1’ - this in turn will result in another misprediction when that return instruction hits the branch ALU, which is exactly what were were trying to avoid.

Solution

Our solution is to take a similar approach to the one we took for the BTC tables and put a queue of prediction history in front of the call/return stack, in this case it’s a queue of push and pop actions. Instead of being indexed by hashes of the PC/global history, it’s simply the TOS address that’s used instead. Top of stack always returns the latest push that matches the current TOS (and the top of the actual backing stack otherwise).

Like BTC speculative queue mentioned in the previous blog entry this queue is managed by the same bundle tag we attach to instructions in the commitQ - when a mispredicted branch instruction is discovered then the return stack speculative queue is truncated removing entries that correspond to the discarded instruction bundles and the TOS updated to the oldest retained entry.

Similarly when the last instruction in an instruction bundle is committed then related call-return stack speculative queue entries are tagged ‘committed’, and then written back in order into the call-return stack.

To Recap

The important ideas here:

  • we have call-return stacks to predict the address of return instructions
  • we maintain a queue of speculative push/pop transitions
  • each entry overrides the ones after it and the main tables
  • the queue is flushed on a misprediction
  • committed data is pushed into the main tables

Next time: How our main pipe works - an overview

VRoom! blog - Branch Target Cache [BTC] (part 2) Living in a Speculative World

This is the second of an occasional series of articles on the VRoom!/RVoom RISC-V CPU. This week we’re going to talk about how we can create speculative entries in the Branch Target Cache (BTC). A quick reminder of some of what we learned in the previous blog.

  • we decode large bundles of many instructions every clock
  • we predict bundles not instructions

System Architecture

Let’s have a look at an overview of our system architecture:

placeholder

Instructions are fetched from the instruction cache (I$1), decoded, renamed to commit registers and entered into the commitQ. Once in the commitQ instructions can be executed in any order and even after they are executed they can be discarded by a trap or a preceding mispredicted branch.

At the end of the commitQ are committed instructions, when they and all instructions before them are done, all branches are correctly predicted, there are no pending traps, then the CPU will retire up to the last 8 committed instructions per clock.

BTC Operations

Our BTC contains tables that are accessed using hashes constructed from the program counter (PC) and a global history. On a traditional system the BTC tables and global history would be updated on every clock, In VROOM! we can have many instructions outstanding, often they are speculative instructions that may never be committed - when a branch instruction is processed by the branch ALU and is discovered to have been speculated incorrectly it can trigger the discard of subsequent speculative instructions (including other branch instructions), branch instructions can also be processed out of order.

Processing branch instructions is important, we want to use branch mis-predictions to update the BTC tables - if we assumed that a branch would be taken and it wasn’t we need to back out and change the value that we had previously set them too.

A simple solution (and the one that we originally tried) was to wait until the commit stage of the CPU and then only use instructions that have been committed (and branches that have been resolved) to update the BTC tables. It works, but it’s very slow to update (remember that a big version of our CPU can have ~150 instructions in flight at once). Consider something like this:

clear:
1:	std	x0, (a0)
	add	a0, a0, 8
	add	a1, a1, -1
	bnez	a1, 1b
	ret

The core of this is a single decode-bundle that always loops and contains 4 instruction in its core. The CPU may push 150/4 = 30+ bundles, 30 times around the loop, before we can start updating the BTC - this is particularly a problem with global history predictors which to be useful really to be updated on every prediction.

A better solution

Let’s look at how our BTC works:

placeholder

The PC and the global history (a bit vector) are hashed (just some xors and bit swapping) into indexes into the 3 BTC tables, they’re used to look up the 3 tables. The combined output is used to choose whether the bimodal or global history predictor is best for a particular bundle.

Our solution to the above problems is to create a ‘pending prediction’ queue in front of the normal BTC tables. Each entry contains the information about a bundle prediction. Including the PC used for it and the global history at the point that it was created. Newer predictions are performed by looking into the queue from most recent to oldest for each of the 3 hashes (individually and independently) comparing them with the hashes of the PC and global history generated in each entry, if no match is found then the value from the main tables is used.

placeholder

The per-bundle BTC pending prediction queue acts similarly to the main per-instruction commitQ - in this case each instruction in the commitQ carries a reference to the decode-bundle it came from.

Actions

If a branch instruction in the commitQ is discovered to be mispredicted all the instructions following it are discarded. At the same time entries in the BTC prediction queue that are newer than the bundle corresponding to the mispredicted instruction are also discarded, and the prediction queue entry that does correspond to the mispredicted branch has its data updated. Finally that prediction queue’s copy of the global history along with the new branch information is used to reload the global history vector used for the next BTC prediction.

Finally each BTC pending queue entry has a ‘committed’ bit - it gets set when all the instructions in the corresponding bundle hit the commit state in the commitQ, that means none of them will be mispredicted, or already have been. On every clock if the oldest entry in the BTC pending queue has its committed bit set then its data is copied back into the main BTC tables.

To Recap

The important ideas here:

  • we maintain a queue of pending uncommitted BTC predictions
  • each entry overrides the ones after it and the main tables
  • the queue is flushed and the top entry updated on a misprediction
  • committed data is pushed into the main tables

Next time: BTC (Part 3) Predicting Subroutine Returns

VRoom! Blog - Branch Target Cache [BTC] (part 1) Predicting Multiple Branches per Clock

This is going to be the first of an occasional series of articles on the VRoom!/RVoom RISC-V CPU. I’m going to start with issues around the Branch Target Caches (BTC). Until recently our BTC has been pretty broken - this was a good thing as it forced our core to exercise its partial pipeline shootdown logic a lot - experience tells us that this is where one finds the hardest bugs ….. Now that it’s fixed and working let’s talk a bit about how it works.

Background

First some background - modern high end CPUs speculatively execute instructions across conditional branches. To get high performance CPUs need to be able to correctly predict a large proportion of branches - predicting a branch wrongly can cost 5-10 clocks (it’s actually worse in multi-issue CPUs like ours).

There’s a whole lot of research around the design of BTCs and predictors, we currently use a combined predictor (see McFarling’s “Combining Branch Predictors” [1]). It has a pair of different predictors:

  • the first is a binomial predictor, which tries to predict branches that mostly go one way, it uses a hash of the program counter (PC) to index into a prediction table
  • the second is a global history predictor, it uses a hash of the PC and a history of recent branches to index a second prediction table, this is better for predicting branches that have some contextual pattern, and
  • finally a combined predictor uses a table indexed by a hash of the PC to predict which of the first two predictors is best for a particular branch.

placeholder

There’s nothing particularly special about using this design, these days it’s relatively conservative - McFarling suggests it should predict 97% of branches correctly. We’ll talk more about how the tables are updated in the next blog post.

Bundles versus instructions

On a more traditional single instruction system system prediction would be applied to every instruction in order. Our system though is different - we decode up to 8 instructions per clock - we read a 128 bit (16 byte) naturally aligned ‘decode-bundle’ on every clock. In the RISC-V C-mode instruction set instructions can be 16 or 32-bit, and are aligned on 16-bit boundaries. So each of our decode-bundles can contain from 4-8 instructions. Depending on where we enter the bundle and whether a branch causes us to leave early we may actually end up decoding between 1 and 8 instructions per clock.

There’s an architectural rule of thumb that (very!) roughly one instruction in five is a branch - this means that there’s a pretty good chance that most of our bundles contain a branch - it also means that decoding bigger bundles (more instructions per clock) may not give us a lot more performance, in fact our 4-8 may well be a sweet spot for this sort of parallelism.

So for our system the thing that we are trying to predict is the behavior of decode-bundles rather than that of individual instructions. It’s entirely possible that there is more that one branch instruction in a decode-bundle in fact there could be up to 8 - luckily all we have to predict is the branch instruction that’s the first branch that’s taken in each bundle, not every branch in the bundle.

For every decode-bundle that we’re currently reading from the icache we want to predict:

  • if there’s a branch in the currently decode-bundle being fetched?
  • what is the instruction’s offset within the bundle? (so we don’t decode subsequent instructions)
  • will it be it taken?
  • which decode-bundle to predict next - either a branch target or the next decode-bundle in memory.

Our fetch/decode system looks like this (red lines are clock pipeline stages):

placeholder

During every clock we use the PC to fetch a decode-bundle from the icache, we want to use this PC to predict the PC to use in the next clock, but we won’t know anything about the contents of the data being fetched until one clock later (during the instruction decode - between the lower two red lines in the diagram above).

Our instruction decoder is capable of doing some work to help us make sense of the instruction stream. It runs in two modes starting from the instruction at the offset into the bundle given by the fetch PC’s lower bits, what it does depends on whether the BTC gave us a prediction:

  • unpredicted mode - it stops decoding at the first unconditional or backwards jumping conditional branch
  • predicted mode - with a bit mask containing one predicted branch bit it will stop at that offset if it’s a branch, or it will stop at any earlier unconditional branch

The decoders also return the branch destination for any non-indirect branch, and the offset (if any) of the branch that was taken

In unpredicted mode we can discover an initial prediction for a decode-bundle and enter it into the BTC, however it takes 2 clocks (cache fetch and decode time) before we can choose the next decode-bundle - we depend on the BTC to help us avoid this - we call this a ‘micro-prediction’ because a miss costs us a single clock rather than 5+ clocks that a misprediction resolved in an ALU costs.

At the end of a decode clock we can check whether the decoder’s branch destination matches the prediction we had made a clock before of what the next decode-bundle’s address should be. If the PC address we’ve just fetched doesn’t match the currently decoding bundle’s output we take a micro-prediction miss and refetch it. This can happen when a bundle is first accessed, due to aliasing in the BTC, and if code is being changed in memory on the fly. We also check and update the prediction of which instruction in the decode-bundle branched.

Global History

Our BTC uses a global history predictor - traditionally this contains a bitmask of the N most recent branches, whether they were taken or not - but we’re not predicting instructions, we’re predicting decode-bundles.

Consider the following piece of code - it’s made up of 2 decode-bundles, it loops 50 times, taking the branch at 7a: on every second loop and finally taking the branch at 74: when it’s finished

p3:
  70:   03200513                li      a0,50		<- bundle 1
  74:   c901                    beqz    a0,84 p3+0x14		
  76:   00157593                andi    a1,a0,1
  7a:   e199                    bnez    a1,80 p3+0x10
  7c:   157d                    addi    a0,a0,-1
  7e:   bfdd                    j       74 p3+0x4

  80:   157d                    addi    a0,a0,-1	<- bundle 2
  82:   bfcd                    j       74 p3+0x4
  84:   8082                    ret

The first bundle contains 3 branches - no matter what happens one of those branches is taken every time around the loop. As a source of global history “always branches” is pretty useless for creating a history to predict which of the three to take next time. This is because it we need something in the global history vector to help distinguish these different branches.

We need some information to tell us which of the two branches are taken each time we decode the first bundle (it’s either the conditional one at 7a:, or the unconditional one at 7e: except at the end of the loop when it’s 74:) - remember that the branch predictor predicts:

  • whether a branch is taken at all
  • branch destination address
  • branch offset with in its source bundle

That last piece tells us which branch in the bundle to take.

Instead of just using a single bit of taken/not-taken history per branch instruction our implementation use 4 bits for each decode-bundle (a ‘taken’ bit and an index into the bundle of the branch that was taken - that same “branch offset within its source bundle” that the branch predictor tries to predict) - this makes for largish history vectors but we can cheaply hash them into something smaller to match the sizes of the global tables.

To Recap

The important ideas here:

  • we decode large bundles of many instructions every clock
  • we predict bundles not instructions
  • our BTC’s global history includes the bundle offsets of the sources of branches taken

Next time: BTC (Part 2) Living in a Speculative World

[1] McFarling, Scott (June 1993). “Combining Branch Predictors” (PDF). Digital Western Research Lab (WRL) Technical Report, TN-36.

Blog - Introducing Vroom!

placeholder

Executive Summary

  • Very high end RISC-V implementation – goal cloud server class
  • Out of order, super scalar, speculative
  • RV64-IMAFDCHB(V)
  • Up to 8 IPC (instructions per clock) peak, goal ~4 average on ALU heavy work
  • 2-way simultaneous multithreading capable
  • Multi-core
  • Early (low) dhrystone numbers: ~3.6 DMips/MHz - still a work in progress. Goal ~4-5
  • Currently boots Linux on an AWS-FPGA instance
  • GPL3 – dual licensing possible

Downloads

VRoom! is hosted with GitHub. Head to the GitHub repository for downloads.

Licensing

VRoom! is currently licensed GPL3. We recognize that for many reasons one cannot practically build a large GPL3d chip design - VRoom! is also available to be commercial licensed.