VRoom! blog - Branch Target Cache [BTC] (part 3) Managing a speculative subroutine call stack

20 Nov 2021

This is the third of an occasional series of articles on the VRoom!/RVoom RISC-V CPU. This week a shorter update, we’re going to talk about how we can create speculative entries in the Branch Target Cache (BTC) call-return stack. A quick reminder of some of what we learned in the previous blog.

we decode large bundles of many instructions every clock
we predict bundles not instructions
we maintain a queue of pending uncommitted BTC predictions
each entry overrides the ones after it and the main tables

Introduction

Last time we talked about how we manage speculated branch target cache (BTC) entries, in short we have a queue of speculated BTC updates in front of the normal BTC tables, those entries are discarded or updated when a speculated branch is found to have been mispredicted, and retired into the main tables when a corresponding branch (or rather bundle of instructions) is committed.

placeholder

Subroutine call stack

In addition to the standard BTC tables we also have a per-access mode call-return stack, RISC-V defines for us the instructions that should be used to infer these (and which not to), in the BTC we provide a 32-entry stack (smaller for M-mode) of the last 32 subroutine calls so we can predict the return address when decoding ret instructions (if we get it wrong it’s not the end of the world, it will be corrected when the ret instruction hits the branch ALU)

Initially we thought we’d just tag branches with a pointer to the top-of stack, and wind it back when we got a misprediction, then we found out a case which turned out to have a particularly heavy impact on performance:


	.....
	call	a
r1:	call	b
r2	....

a:	bnez	a0, 1f		< branch mispredicted
	ret
1:	....

b:	....

essentially what happens here is that we speculatively fetch the following instruction stream:

	Instruction		call stack
	===========		==========
				

	call	a		r1 -
	bnez 	a0, 1f		r1 -	< remembers TOS
	ret			-	< everything from here on is 
	call	b		r2 -	  mispredicted
	....

Sometime later the bnez is resolved in the branch ALU and discovered to have been mispredicted. But by that point the return address ‘r1’ on the call stack has been popped and replaced with ‘r2’, even though we got the top of stack (TOS) correct, its contents are wrong, and can’t be undone when we discover the misprediction. This means that at some later time when we finally do return from subroutine ‘a’ the prediction will be to ‘r2’ rather than ‘r1’ - this in turn will result in another misprediction when that return instruction hits the branch ALU, which is exactly what were were trying to avoid.

Solution

Our solution is to take a similar approach to the one we took for the BTC tables and put a queue of prediction history in front of the call/return stack, in this case it’s a queue of push and pop actions. Instead of being indexed by hashes of the PC/global history, it’s simply the TOS address that’s used instead. Top of stack always returns the latest push that matches the current TOS (and the top of the actual backing stack otherwise).

Like BTC speculative queue mentioned in the previous blog entry this queue is managed by the same bundle tag we attach to instructions in the commitQ - when a mispredicted branch instruction is discovered then the return stack speculative queue is truncated removing entries that correspond to the discarded instruction bundles and the TOS updated to the oldest retained entry.

Similarly when the last instruction in an instruction bundle is committed then related call-return stack speculative queue entries are tagged ‘committed’, and then written back in order into the call-return stack.

To Recap

The important ideas here:

we have call-return stacks to predict the address of return instructions
we maintain a queue of speculative push/pop transitions
each entry overrides the ones after it and the main tables
the queue is flushed on a misprediction
committed data is pushed into the main tables

Next time: How our main pipe works - an overview

VRoom!

VRoom! blog - Branch Target Cache [BTC] (part 3) Managing a speculative subroutine call stack

Introduction

Subroutine call stack

Solution

To Recap

Related Posts

VRoom! blog - Vector ALU Patterns 19 Jun 2023

VRoom! blog - Early Vector Processor Architecture 05 Jun 2023

VRoom! blog - Adding Branch Prediction to the Trace Cache 17 May 2023