VROOM!

A new high-end RISC-V implementation

Paul Campbell - October 2021

taniwha@gmail.com @moonbaseotago


(Minor changes Nov 6 2021, March 2022, Feb 2023)

(C) Copyright Moonbase Otago 2021-2023

All rights reserved

Executive summary

  • Goal: Very high end RISC-V implementation – cloud server class
  • Out of order, super scalar, speculative
  • RV64-IMAFDCHBK(V)
  • Up to 8 IPC (instructions per clock) peak, goal ~4 average on ALU heavy work (already exceeded)
  • 2-way simultaneous multithreading capable
  • Multi-core
  • Currently boots Linux on an AWS-FPGA instance
  • Current dhrystone numbers: ~11.3 DMips/MHz - still a work in progress.
  • GPL3 – dual licensing possible

Making something big and fast ...

  • Our goal is >4 IPC (average) with >97% branch prediction and lots of cache, deep out-of-order pipelines and speculative execution for managing cache and branch miss latency
  • General long term goal is a high end server class CPU, 5GHz+, multithreaded, 100+ instructions in flight at any one time

Architectural Overview

1-3 Fetch and decoder

  • 4 32-bit
    instructions
  • Or 8 16-bit
    instructions
  • Or a mix
  • Some instructions
    swallowed (no-ops,
    jumps)

Branch Target cache

  • Separate BTCs for each access mode, user mode flushed on MMU table switch
  • 32 entry call/return stack
  • Combined global history/bimodal branch predictors
  • Support for speculative branches/calls/returns

Instruction Bundles

Decoded instructions are passed between stages in bundles containing:

  • Functional unit type
  • Command information (add/sub load/store etc)
  • Source and dest registers (and renamed source registers)
  • Immediate constant
  • PC
  • Branch target

Eventually we'll do some instruction combining using this information (best place may be at entry to I$0 trace cache), or possibly at the rename stage

Registers

  • We use a combined
    register file
  • Commit registers are
    for instruction’s results
    and are either
    eventually written to real
    registers or abandoned, one
    commit register for every
    commitQ entry
  • Once a commitQ entry is committed it’s value is transferred to an architectural register
  • Commit registers are shared between integer and FP regs

4 Renaming Stage

  • Packs instruction
    bundles
  • Renames source
    registers to pick up
    speculative results
    from commit registers,
    scoreboard keeps track of
    where the latest version
    of each architectural register will be stored
  • Keeps track of state when we do speculative misses

5+ Commit Queue

  • Circular queue of pending
    instructions
  • At some point
    they are assigned
    ALUs
  • When near the end
    they are committed
    (currently last 8 can be
    committed per clock)
  • A resolved mispredicted branch or a trap can cause a partial or full commitQ flush

ALUs (functional units)

  • 3 arithmetic (add/sub/and/or/xor/etc)
  • 1 shift
  • 1 multiply/divide
  • >=1 FP
  • 3 branch [now merged into the arithmetic units]
  • 1 CSR/TRAP/privileged
  • 1 Load/Store (4 address/4 load/4 store per clock)
  • Each commitQ entry is tagged for one of these

ALUs (functional units) 2

Inputs to a functional unit can be:

  • Result of 1 or two register reads
  • An immediate constant
  • PC of the instruction

An instruction will not trigger execution until all its input registers are available.

Schedulers

  • Each type of functional unit has a scheduler, they are independent
  • It looks for instructions ready to execute (ie who’s source registers will have been calculated by the next clock)
  • Schedules the N instructions ready to run closest to the commit end of the commitQ
  • Load/Store scheduler wont re-order stores past stores, or loads past stores (but will reorder loads) – once scheduled (and virt->phys translation) further reordering can happen

6-7-8 schedulers

  • Basic ALU flow
    looks like this
  • Heavily pipelined
    (input to reg write
    can be bypassed
    to output of reg
    Basic ALU flow
    read)

Load/Store/Fence Unit

Note: this area is under active development, check out the blog for more up to date information
  • Single Unit
  • Address stage handles 6 V to P TLB lookups in parallel
  • Can handle 4 concurrent loads and 4 concurrent stores
  • Loads can run in 1 clock if in cache (or snooped from storeQ)
  • Also 1 clock speculatively if in storeQ
  • Stores/Fences go in the storeQ, are executed in order, but only once their commitQ instructions are committed, they are abandoned if a speculative store is discarded
  • Loads go in storeQ if they miss in cache, are fenced, or blocked by a pending access to the same cache line

Load/Store Unit

Virtual Memory

  • Separate instruction and data 32 entry fully associative L1 TLBs
  • Shared L2 TLB and table walker – 4 way associative 128+ entries
  • Small cache of page data (to avoid upper page table refetches, takes part in cache coherency protocol)
  • Table walker shares the instruction fetch port to the cache fabric (both are read only – I$1 cache coherency ports can have up to 8 concurrent transactions running at the same time)
  • 16-bit unified (between HARTs) address space ID, or 15-bit unique (per HART) one

Note: 'HART' is a RISC-V term that refers to a unit of execution - for example one portion of a multithreaded core

Branch Unit

  • Conditional branches - compares 2 register values, if their relationship is not as predicted, forces a partial commitQ (after the branch instruction) flush and a new PC
  • Subroutine calls – writes PC+2/4 to register as output
  • Indirect branches (if not as predicted forces a partial commitQ flush and a new PC)
  • Note: This unit has been merged into the integer ALU allowing us to resolve multiple branches/clock.

CSR Unit

  • Creates most system CSRs (some are in other units FP/Vect)
  • IRET, syscall instructions
  • Interrupts and traps – equivalent to a branch with system state change
  • load/stores that fail get converted into traps in the commitQ
  • Interrupts and fetch/decode traps get forced into the instruction stream, instruction fetch then stops
  • Always handled at last spot in commitQ – traps can flush subsequent instructions

Performance

Still a work in progress. Observed in the current implementation:

  • Peak 8 instructions decoded per clock
  • Peak 8 committed per clock
  • 5 clock branch misprediction penalty (often less or zero depending on what’s in the pipeline - mispredictions caught deep in the pipeline can be resolved at effectively 0 cost)

Theoretical (one HART):

  • Max 88 instructions in flight (104 if you count pending stores)
  • (currently) 8 concurrent 512bit cache line fetches per L1 cache

Benchmarks

Now with linux booting we're starting to run dhrystone in linux user mode on the real hardware, and also in machine mode on the simulator. The xilinx based hardware runs with a much smaller BTC - all numbers are at 25MHz

SourceDhrystone/secDMipsDMips/MHz
Hardware (very old now)14231480.93.23
Simulator496800282.711.3

With the BTC largely functional now we can see in architectural traces that we're now predicting all the branches now correctly - the simulator shows that its larger BTC is definitely a plus, it's also now running with the new more parallel load-store units and a combined ALU/branch units - we haven't built an FPGA version yet. Dhystone has been a useful proxy for performance for a while but it's rapidly losing its usefulness.

Putting it all together

Multithreading

A little further ...

This is what we have working today (without the L2)

Die Size

Building Systems

Current System

  • Written in Verilog, some parts auto-generated from C
  • Currently being tested on an AWS FPGA instance - boots linux
  • Xilinx VU9P Ultrascale (which is really 3 dies)
  • Cut down to fit: 2 load 1 store units, 2 ALUs, 1 mult, 1 shifter, 32 entry commitQ, 8 entry storeQ, 32k I$1, 32k D$1, no L2, small BTC, max 56 instructions in flight
  • Uses our cache coherency fabric, AWS’s DRAM controller
  • Runs at 25MHz (largely for faster synthesis/routing times)
  • Software support for serial and minimal hard drive, no networking yet

AWS FPGA architecture

The CPU is instantiated in a VU9P FPGA along with AWS's support logic (top 1/3 of the RHS two dies on the previous page). In our case we use the provided DRAM controller and PCIE interfaces to allow the host Linux CPU access to a small extra UART, a 'fake' disk controller, and a reset controller.

On the host CPU a small user space program talks to registers in memory mapped PCIE space to provide a console (through the UART) and access to a 'disk' image in a file.

Where are we up to?

  • Still very much a work in progress
  • Most of the design is in place
  • PLIC/CLIC/CLNT
  • uart/faux disk/timers
  • Coherent caching fabric
  • Boots Linux on AWS FPGA instance
  • Coded for multithreading (very not tested) and multiple CPUs (again not tested) – like cache/btc/queue sizes these are simple build options

Eventual Goal

Much of the design is parameterized, we can change stuff easily, here’s a back of the envelope sketch of our goal:

  • 1-5GHz (will likely involve adding ~2 pipe stages to the above description)
  • 2 HARTs/CPU (ie multithreaded)
  • CommitQ 64 entries - per HART
  • StoreQ 64 entries - shared
  • I$0 64+ entries of 8 instruction bundles - per HART
  • I$1 64kbtytes - shared
  • D$1 64kbytes - shared
  • Combined L2 2-4Mb - shared
  • This means ~192 max instructions in flight
  • 3 integer ALUs - shared
  • 1 shifter - shared
  • 4 load 4 store (per clock) load store unit - shared
  • 1 or 2 multipliers - shared
  • 1 or 2 FPUs - shared
  • 1 vector unit - shared
  • PLIC/CLIC/CLNT - shared
  • Bit/crypto extensions (mostly in shifter) - shared

Meltdown/Spectre etc

We’re not perfect (yet, still a work in progress), we do do the following mitigations:

  • Separate BTCs between M/S/U operating modes
  • BTC flushed on VM switch
  • No speculative fetches to L1/2 caches until they pass VM access
  • Fully associative TLB L1 with random replacement
  • Wide D$1/I$1 way-ness (currently 32-way – also allows for large L1 caches with parallel TLB lookup) combined with random way replacement this muddies any signal an attacker is receiving
  • Optional D$1 random replacement

Research

One of the shorter term goals has been to get a system working well enough so that we can do benchmarks enabling us to optimize things like

  • Cache sizes
  • BTC size and architecture
  • commitQ size
  • storeQ size
  • Test a multithreaded system

We’re at a point now though where the size of the AWS FPGA instances may limit what we can test at large scale

Next steps

Planned work:

  • I$0 trace cache
  • Expand LS unit to 3/2 from 2/1 load/store
  • Rewrite cache coherency fabric with L2
  • Spend some time on timing, we’ve purposely avoided spending too much time on low level timing – the current FPGA is big and slow, and nets that cross between dies kind of mess with any hope of representative timing – but it’s worth spending some time to hunt down particularly bad paths, we expect to repipe the final design by a couple of pipe stages to get to the Ghz range so some early warning would be useful
  • B – bit manipulation – tested
  • K – cryptography, both NIST and ShangMi extensions – tested
  • H - Virtualization – about 50% done
  • V - Vector Unit (waiting for FP)
  • Debug
  • Crypto

I$0 Trace cache

This is probably the most interesting enhancement we can do to the current system to up the issue rate in inner loops to a fixed 8 bundles/clock no matter what size the original instruction was

  • Virtually tagged
  • Contains instruction bundles recorded from the commit stage of the commitQ
  • This is a great place to do instruction combining (timing wise)
  • Bundles issue directly to the renamer saving a few clocks in the pipeline

Licensing

Once it’s usable by others:

  • GPL 3
  • Dual licensing available – looking for partners to actually build one