VROOM!

A new high-end RISC-V implementation

Paul Campbell - October 2021

taniwha@gmail.com @moonbaseotago

(Minor changes Nov 6 2021, March 2022, Feb 2023)

Executive summary

Goal: Very high end RISC-V implementation – cloud server class
Out of order, super scalar, speculative
RV64-IMAFDCHBK(V)
Up to 8 IPC (instructions per clock) peak, goal ~4 average on ALU heavy work (already exceeded)
2-way simultaneous multithreading capable
Multi-core
Currently boots Linux on an AWS-FPGA instance
Current dhrystone numbers: ~11.3 DMips/MHz - still a work in progress.
GPL3 – dual licensing possible

Making something big and fast ...

Our goal is >4 IPC (average) with >97% branch prediction and lots of cache, deep out-of-order pipelines and speculative execution for managing cache and branch miss latency
General long term goal is a high end server class CPU, 5GHz+, multithreaded, 100+ instructions in flight at any one time

Architectural Overview

1-3 Fetch and decoder

4 32-bit
instructions
Or 8 16-bit
instructions
Or a mix
Some instructions
swallowed (no-ops,
jumps)

Branch Target cache

Separate BTCs for each access mode, user mode flushed on MMU table switch
32 entry call/return stack
Combined global history/bimodal branch predictors
Support for speculative branches/calls/returns

Instruction Bundles

Decoded instructions are passed between stages in bundles containing:

Functional unit type
Command information (add/sub load/store etc)
Source and dest registers (and renamed source registers)
Immediate constant
PC
Branch target

Eventually we'll do some instruction combining using this information (best place may be at entry to I$0 trace cache), or possibly at the rename stage

Registers

We use a combined
register file
Commit registers are
for instruction’s results
and are either
eventually written to real
registers or abandoned, one
commit register for every
commitQ entry
Once a commitQ entry is committed it’s value is transferred to an architectural register
Commit registers are shared between integer and FP regs

4 Renaming Stage

Packs instruction
bundles
Renames source
registers to pick up
speculative results
from commit registers,
scoreboard keeps track of
where the latest version
of each architectural register will be stored
Keeps track of state when we do speculative misses

5+ Commit Queue

Circular queue of pending
instructions
At some point
they are assigned
ALUs
When near the end
they are committed
(currently last 8 can be
committed per clock)
A resolved mispredicted branch or a trap can cause a partial or full commitQ flush

ALUs (functional units)

3 arithmetic (add/sub/and/or/xor/etc)
1 shift
1 multiply/divide
>=1 FP
3 branch [now merged into the arithmetic units]
1 CSR/TRAP/privileged
1 Load/Store (4 address/4 load/4 store per clock)
Each commitQ entry is tagged for one of these

ALUs (functional units) 2

Inputs to a functional unit can be:

Result of 1 or two register reads
An immediate constant
PC of the instruction

An instruction will not trigger execution until all its input registers are available.

Schedulers

Each type of functional unit has a scheduler, they are independent
It looks for instructions ready to execute (ie who’s source registers will have been calculated by the next clock)
Schedules the N instructions ready to run closest to the commit end of the commitQ
Load/Store scheduler wont re-order stores past stores, or loads past stores (but will reorder loads) – once scheduled (and virt->phys translation) further reordering can happen

6-7-8 schedulers

Basic ALU flow
looks like this
Heavily pipelined
(input to reg write
can be bypassed
to output of reg
Basic ALU flow
read)

Load/Store/Fence Unit

Note: this area is under active development, check out the blog for more up to date information

Single Unit
Address stage handles 6 V to P TLB lookups in parallel
Can handle 4 concurrent loads and 4 concurrent stores
Loads can run in 1 clock if in cache (or snooped from storeQ)
Also 1 clock speculatively if in storeQ
Stores/Fences go in the storeQ, are executed in order, but only once their commitQ instructions are committed, they are abandoned if a speculative store is discarded
Loads go in storeQ if they miss in cache, are fenced, or blocked by a pending access to the same cache line

Load/Store Unit

Virtual Memory

Separate instruction and data 32 entry fully associative L1 TLBs
Shared L2 TLB and table walker – 4 way associative 128+ entries
Small cache of page data (to avoid upper page table refetches, takes part in cache coherency protocol)
Table walker shares the instruction fetch port to the cache fabric (both are read only – I$1 cache coherency ports can have up to 8 concurrent transactions running at the same time)
16-bit unified (between HARTs) address space ID, or 15-bit unique (per HART) one

Note: 'HART' is a RISC-V term that refers to a unit of execution - for example one portion of a multithreaded core

Branch Unit

Conditional branches - compares 2 register values, if their relationship is not as predicted, forces a partial commitQ (after the branch instruction) flush and a new PC
Subroutine calls – writes PC+2/4 to register as output
Indirect branches (if not as predicted forces a partial commitQ flush and a new PC)
Note: This unit has been merged into the integer ALU allowing us to resolve multiple branches/clock.

CSR Unit

Creates most system CSRs (some are in other units FP/Vect)
IRET, syscall instructions
Interrupts and traps – equivalent to a branch with system state change
load/stores that fail get converted into traps in the commitQ
Interrupts and fetch/decode traps get forced into the instruction stream, instruction fetch then stops
Always handled at last spot in commitQ – traps can flush subsequent instructions

Performance

Still a work in progress. Observed in the current implementation:

Peak 8 instructions decoded per clock
Peak 8 committed per clock
5 clock branch misprediction penalty (often less or zero depending on what’s in the pipeline - mispredictions caught deep in the pipeline can be resolved at effectively 0 cost)

Theoretical (one HART):

Max 88 instructions in flight (104 if you count pending stores)
(currently) 8 concurrent 512bit cache line fetches per L1 cache

Benchmarks

Now with linux booting we're starting to run dhrystone in linux user mode on the real hardware, and also in machine mode on the simulator. The xilinx based hardware runs with a much smaller BTC - all numbers are at 25MHz

Source	Dhrystone/sec	DMips	DMips/MHz
Hardware (very old now)	142314	80.9	3.23
Simulator	496800	282.7	11.3

With the BTC largely functional now we can see in architectural traces that we're now predicting all the branches now correctly - the simulator shows that its larger BTC is definitely a plus, it's also now running with the new more parallel load-store units and a combined ALU/branch units - we haven't built an FPGA version yet. Dhystone has been a useful proxy for performance for a while but it's rapidly losing its usefulness.

Putting it all together

Multithreading

A little further ...

This is what we have working today (without the L2)

Die Size

Building Systems

Current System

Written in Verilog, some parts auto-generated from C

Currently being tested on an AWS FPGA instance - boots linux

Xilinx VU9P Ultrascale (which is really 3 dies)

Cut down to fit: 2 load 1 store units, 2 ALUs, 1 mult, 1 shifter, 32 entry commitQ, 8 entry storeQ, 32k I$1, 32k D$1, no L2, small BTC, max 56 instructions in flight

Uses our cache coherency fabric, AWS’s DRAM controller

Runs at 25MHz (largely for faster synthesis/routing times)

Software support for serial and minimal hard drive, no networking yet

AWS FPGA architecture

The CPU is instantiated in a VU9P FPGA along with AWS's support logic (top 1/3 of the RHS two dies on the previous page). In our case we use the provided DRAM controller and PCIE interfaces to allow the host Linux CPU access to a small extra UART, a 'fake' disk controller, and a reset controller.

On the host CPU a small user space program talks to registers in memory mapped PCIE space to provide a console (through the UART) and access to a 'disk' image in a file.

Where are we up to?

Still very much a work in progress
Most of the design is in place
PLIC/CLIC/CLNT
uart/faux disk/timers
Coherent caching fabric
Boots Linux on AWS FPGA instance
Coded for multithreading (very not tested) and multiple CPUs (again not tested) – like cache/btc/queue sizes these are simple build options

Eventual Goal

Much of the design is parameterized, we can change stuff easily, here’s a back of the envelope sketch of our goal:

1-5GHz (will likely involve adding ~2 pipe stages to the above description)
2 HARTs/CPU (ie multithreaded)
CommitQ 64 entries - per HART
StoreQ 64 entries - shared
I$0 64+ entries of 8 instruction bundles - per HART
I$1 64kbtytes - shared
D$1 64kbytes - shared
Combined L2 2-4Mb - shared
This means ~192 max instructions in flight
3 integer ALUs - shared
1 shifter - shared
4 load 4 store (per clock) load store unit - shared
1 or 2 multipliers - shared
1 or 2 FPUs - shared
1 vector unit - shared
PLIC/CLIC/CLNT - shared
Bit/crypto extensions (mostly in shifter) - shared

Meltdown/Spectre etc

We’re not perfect (yet, still a work in progress), we do do the following mitigations:

Separate BTCs between M/S/U operating modes
BTC flushed on VM switch
No speculative fetches to L1/2 caches until they pass VM access
Fully associative TLB L1 with random replacement
Wide D$1/I$1 way-ness (currently 32-way – also allows for large L1 caches with parallel TLB lookup) combined with random way replacement this muddies any signal an attacker is receiving
Optional D$1 random replacement

Research

One of the shorter term goals has been to get a system working well enough so that we can do benchmarks enabling us to optimize things like

Cache sizes
BTC size and architecture
commitQ size
storeQ size
Test a multithreaded system

We’re at a point now though where the size of the AWS FPGA instances may limit what we can test at large scale

Next steps

Planned work:

I$0 trace cache
Expand LS unit to 3/2 from 2/1 load/store
Rewrite cache coherency fabric with L2
Spend some time on timing, we’ve purposely avoided spending too much time on low level timing – the current FPGA is big and slow, and nets that cross between dies kind of mess with any hope of representative timing – but it’s worth spending some time to hunt down particularly bad paths, we expect to repipe the final design by a couple of pipe stages to get to the Ghz range so some early warning would be useful
B – bit manipulation – tested
K – cryptography, both NIST and ShangMi extensions – tested
H - Virtualization – about 50% done
V - Vector Unit (waiting for FP)
Debug
Crypto

I$0 Trace cache

This is probably the most interesting enhancement we can do to the current system to up the issue rate in inner loops to a fixed 8 bundles/clock no matter what size the original instruction was

Virtually tagged
Contains instruction bundles recorded from the commit stage of the commitQ
This is a great place to do instruction combining (timing wise)
Bundles issue directly to the renamer saving a few clocks in the pipeline

Licensing

Once it’s usable by others:

GPL 3
Dual licensing available – looking for partners to actually build one