VROOM!
A new high-end RISC-V implementation
Paul Campbell - October 2021
taniwha@gmail.com @moonbaseotago
(Minor changes Nov 6 2021, March 2022, Feb 2023)
(C) Copyright Moonbase Otago 2021-2023
All rights reserved
Executive summary
- Goal: Very high end RISC-V implementation – cloud server class
- Out of order, super scalar, speculative
- RV64-IMAFDCHBK(V)
- Up to 8 IPC (instructions per clock) peak, goal ~4 average on ALU heavy work (already exceeded)
- 2-way simultaneous multithreading capable
- Multi-core
- Currently boots Linux on an AWS-FPGA instance
- Current dhrystone numbers: ~11.3 DMips/MHz - still a work in progress.
- GPL3 – dual licensing possible
Making something big and fast ...
- Our goal is >4 IPC (average) with >97% branch prediction and lots of cache, deep out-of-order pipelines and speculative execution for managing cache and branch miss latency
- General long term goal is a high end server class CPU, 5GHz+, multithreaded, 100+ instructions in flight at any one time
Architectural Overview
1-3 Fetch and decoder
- 4 32-bit
instructions
- Or 8 16-bit
instructions
- Or a mix
- Some instructions
swallowed (no-ops,
jumps)
Branch Target cache
- Separate BTCs for each access mode, user mode flushed on MMU table switch
- 32 entry call/return stack
- Combined global history/bimodal branch predictors
- Support for speculative branches/calls/returns
Instruction Bundles
Decoded instructions are passed between stages in bundles containing:
- Functional unit type
- Command information (add/sub load/store etc)
- Source and dest registers (and renamed source registers)
- Immediate constant
- PC
- Branch target
Eventually we'll do some instruction combining using this information (best place may be at entry to I$0 trace cache), or possibly at the rename stage
Registers
- We use a combined
register file
- Commit registers are
for instruction’s results
and are either
eventually written to real
registers or abandoned, one
commit register for every
commitQ entry
- Once a commitQ entry is committed it’s value is transferred to an
architectural register
- Commit registers are shared between integer and FP regs
4 Renaming Stage
- Packs instruction
bundles
- Renames source
registers to pick up
speculative results
from commit registers,
scoreboard keeps track of
where the latest version
of each architectural register will be stored
- Keeps track of state when we do speculative misses
5+ Commit Queue
- Circular queue of pending
instructions
- At some point
they are assigned
ALUs
- When near the end
they are committed
(currently last 8 can be
committed per clock)
- A resolved mispredicted branch or a trap can cause a partial or full commitQ flush
ALUs (functional units)
- 3 arithmetic (add/sub/and/or/xor/etc)
- 1 shift
- 1 multiply/divide
- >=1 FP
- 3 branch [now merged into the arithmetic units]
- 1 CSR/TRAP/privileged
- 1 Load/Store (4 address/4 load/4 store per clock)
- Each commitQ entry is tagged for one of these
ALUs (functional units) 2
Inputs to a functional unit can be:
- Result of 1 or two register reads
- An immediate constant
- PC of the instruction
An instruction will not trigger execution until all its input registers are available.
Schedulers
- Each type of functional unit has a scheduler, they are independent
- It looks for instructions ready to execute (ie who’s source registers will have been calculated by the next clock)
- Schedules the N instructions ready to run closest to the commit end of the commitQ
- Load/Store scheduler wont re-order stores past stores, or loads past stores (but will reorder loads) – once scheduled (and virt->phys translation) further reordering can happen
6-7-8 schedulers
- Basic ALU flow
looks like this
- Heavily pipelined
(input to reg write
can be bypassed
to output of reg
Basic ALU flow
read)
Load/Store/Fence Unit
Note: this area is under active development, check out the blog for more up to date information
- Single Unit
- Address stage handles 6 V to P TLB lookups in parallel
- Can handle 4 concurrent loads and 4 concurrent stores
- Loads can run in 1 clock if in cache (or snooped from storeQ)
- Also 1 clock speculatively if in storeQ
- Stores/Fences go in the storeQ, are executed in order, but only once their commitQ instructions are committed, they are abandoned if a speculative store is discarded
- Loads go in storeQ if they miss in cache, are fenced, or blocked by a pending access to the same cache line
Load/Store Unit
Virtual Memory
- Separate instruction and data 32 entry fully associative L1 TLBs
- Shared L2 TLB and table walker – 4 way associative 128+ entries
- Small cache of page data (to avoid upper page table refetches, takes part in cache coherency protocol)
- Table walker shares the instruction fetch port to the cache fabric (both are read only – I$1 cache coherency ports can have up to 8 concurrent transactions running at the same time)
- 16-bit unified (between HARTs) address space ID, or 15-bit unique (per HART) one
Note: 'HART' is a RISC-V term that refers to a unit of execution - for example one portion of a multithreaded core
Branch Unit
- Conditional branches - compares 2 register values, if their relationship is not as predicted, forces a partial commitQ (after the branch instruction) flush and a new PC
- Subroutine calls – writes PC+2/4 to register as output
- Indirect branches (if not as predicted forces a partial commitQ flush and a new PC)
- Note: This unit has been merged into the integer ALU allowing us to resolve multiple branches/clock.
CSR Unit
- Creates most system CSRs (some are in other units FP/Vect)
- IRET, syscall instructions
- Interrupts and traps – equivalent to a branch with system state change
- load/stores that fail get converted into traps in the commitQ
- Interrupts and fetch/decode traps get forced into the instruction stream, instruction fetch then stops
- Always handled at last spot in commitQ – traps can flush subsequent instructions
Performance
Still a work in progress. Observed in the current implementation:
- Peak 8 instructions decoded per clock
- Peak 8 committed per clock
- 5 clock branch misprediction penalty (often less or zero depending on what’s in the pipeline - mispredictions caught deep in the pipeline can be resolved at effectively 0 cost)
Theoretical (one HART):
- Max 88 instructions in flight (104 if you count pending stores)
- (currently) 8 concurrent 512bit cache line fetches per L1 cache
Benchmarks
Now with linux booting we're starting to run dhrystone in linux user mode on the real hardware, and also in machine mode on the simulator. The xilinx based hardware runs with a much smaller BTC - all numbers are at 25MHz
Source | Dhrystone/sec | DMips | DMips/MHz |
Hardware (very old now) | 142314 | 80.9 | 3.23 |
Simulator | 496800 | 282.7 | 11.3 |
With the BTC largely functional now we can see in architectural traces that we're now predicting all the branches now correctly - the simulator shows that its larger BTC is definitely a plus, it's also now running with the new more parallel load-store units and a combined ALU/branch units - we haven't built an FPGA version yet. Dhystone has been a useful proxy for performance for a while but it's rapidly losing its usefulness.
Putting it all together
Multithreading
A little further ...
This is what we have working today (without the L2)
Die Size
Building Systems
Current System
- Written in Verilog, some parts auto-generated from C
- Currently being tested on an AWS FPGA instance - boots linux
- Xilinx VU9P Ultrascale (which is really 3 dies)
- Cut down to fit: 2 load 1 store units, 2 ALUs, 1 mult, 1 shifter, 32 entry commitQ, 8 entry storeQ, 32k I$1, 32k D$1, no L2, small BTC, max 56 instructions in flight
- Uses our cache coherency fabric, AWS’s DRAM controller
- Runs at 25MHz (largely for faster synthesis/routing times)
- Software support for serial and minimal hard drive, no networking yet
AWS FPGA architecture
The CPU is instantiated in a VU9P FPGA along with AWS's support logic (top 1/3 of the RHS two dies on the previous page).
In our case we use the provided DRAM controller and PCIE interfaces to allow the host Linux CPU access to
a small extra UART, a 'fake' disk controller, and a reset controller.
On the host CPU a small user space program talks to registers in memory mapped
PCIE space to provide a console (through the UART) and access to a 'disk' image
in a file.
Where are we up to?
- Still very much a work in progress
- Most of the design is in place
- PLIC/CLIC/CLNT
- uart/faux disk/timers
- Coherent caching fabric
- Boots Linux on AWS FPGA instance
- Coded for multithreading (very not tested) and multiple CPUs (again not tested) – like cache/btc/queue sizes these are simple build options
Eventual Goal
Much of the design is parameterized, we can change stuff easily, here’s a back of the envelope sketch of our goal:
- 1-5GHz (will likely involve adding ~2 pipe stages to the above description)
- 2 HARTs/CPU (ie multithreaded)
- CommitQ 64 entries - per HART
- StoreQ 64 entries - shared
- I$0 64+ entries of 8 instruction bundles - per HART
- I$1 64kbtytes - shared
- D$1 64kbytes - shared
- Combined L2 2-4Mb - shared
- This means ~192 max instructions in flight
- 3 integer ALUs - shared
- 1 shifter - shared
- 4 load 4 store (per clock) load store unit - shared
- 1 or 2 multipliers - shared
- 1 or 2 FPUs - shared
- 1 vector unit - shared
- PLIC/CLIC/CLNT - shared
- Bit/crypto extensions (mostly in shifter) - shared
Meltdown/Spectre etc
We’re not perfect (yet, still a work in progress), we do do the following mitigations:
- Separate BTCs between M/S/U operating modes
- BTC flushed on VM switch
- No speculative fetches to L1/2 caches until they pass VM access
- Fully associative TLB L1 with random replacement
- Wide D$1/I$1 way-ness (currently 32-way – also allows for large L1 caches with parallel TLB lookup) combined with random way replacement this muddies any signal an attacker is receiving
- Optional D$1 random replacement
Research
One of the shorter term goals has been to get a system working well enough so that we can do benchmarks enabling us to optimize things like
- Cache sizes
- BTC size and architecture
- commitQ size
- storeQ size
- Test a multithreaded system
We’re at a point now though where the size of the AWS FPGA instances may limit what we can test at large scale
Next steps
Planned work:
- I$0 trace cache
- Expand LS unit to 3/2 from 2/1 load/store
- Rewrite cache coherency fabric with L2
- Spend some time on timing, we’ve purposely avoided spending too much time on low level timing – the current FPGA is big and slow, and nets that cross between dies kind of mess with any hope of representative timing – but it’s worth spending some time to hunt down particularly bad paths, we expect to repipe the final design by a couple of pipe stages to get to the Ghz range so some early warning would be useful
- B – bit manipulation – tested
- K – cryptography, both NIST and ShangMi extensions – tested
- H - Virtualization – about 50% done
- V - Vector Unit (waiting for FP)
- Debug
- Crypto
I$0 Trace cache
This is probably the most interesting enhancement we can do to the current system to up the issue rate in inner loops to a fixed 8 bundles/clock no matter what size the original instruction was
- Virtually tagged
- Contains instruction bundles recorded from the commit stage of the commitQ
- This is a great place to do instruction combining (timing wise)
- Bundles issue directly to the renamer saving a few clocks in the pipeline
Licensing
Once it’s usable by others:
- GPL 3
- Dual licensing available – looking for partners to actually build one