16 Jan 2022
Booting Linux isn’t going to work without some form of virtual memory, RISC-V has
a well defined spec for VM, implementation isn’t hard - page tables are well defined,
there’s nothing particularly unusual or surprising there
We have separate Instruction and Data level one TLBs, they’re fully associative which means that
we’re not practically limited to power of two sizes (though currently they have 32 entries each),
each entry contains a mapping between an ASID and upper bits of a virtual address and a physical
address - we support the various sizes of pages (both in tags and data).
A fully associative TLB with random replacement makes it harder for Meltdown/Spectre sorts of
attacks to use TLB replacement to attack systems. On VROOM memory accesses that miss in the TLB never result in
data cache changes.
Since the ALUs, and in particular the memory and fetch units, are shared between HARTs (CPUs) in the same
simultaneous multi-threaded core then so are the TLBs shared. We need a way to distinguish mappings between HARTs - this implementation
is simple, we reserve a portion of the ASID which is forced to a unique value for each core - each HART thinks it has N bits of ASID, in reality there are N+1. There’s also
a system flag that we can optionally set that lets SMT HARTs share the full sized ASID (provided the software understands
that ASIDs are system rather than CPU global).
Currently we use the old trick of doing L1 TLB lookup with the upper bits of a virtual address while using
the lower bits in parallel to do the indexing of the L1 data/instruction caches - large modern cache line sizes
mean that you have to go many ways/sets to get large data and instruction caches - this also helps
with Meltdown/Spectre/etc mitigation.
I’m currently redesigning the whole memory datapath unit to split TLB lookup and data cache access
into separate cycles - mostly to expose more parallelism during scheduling - more about that in
a later posting once it’s all working.
TLB misses result in stalled instructions in the commitQ - there’s a small queue of pending TLB
lookups in the memory unit and 2 pending lookups in the fetch unit - they feed the table walker
state machine which starts by dipping into the L2 TLB cache - currently it’s a 4-way 256 entry (so
1k entries total) set associative cache shared between the instruction and data TLBs - TLB data found here is fed directly to the L1 TLBs (a satisfied L1 miss takes 4
If a request also misses in the L2 TLB cache the table walker state machine starts walking page table trees.
cache lines of TLB data are fetched into a local read-only cache which contains a small number of entries,
essentially enough for 1 line for each level in the page hierarchy, and 2 for the lowest level, repeated
for instruction TLB and the data TLB (then repeated again for multiple HARTs).
After initial filling most table walks hit in this cache. This cache is slightly integrated into the data L1 I-cache, they share a
read-only access port into the cache coherency fabric, and both can be invalidated by data cache snoops shootdowns.
TLB invalidation is triggered by executing a TLB flush instruction - these instructions let the instructions
before them in the commitQ execute before they themselves are executed.
At this point they trigger a
commitQ flush (tossing any speculative instructions executed with the old VM mappings). At the same time
they trigger L1 and L2 TLB flushes. Note: there is no need to invalidate the TLB walker’s small data cache
as it will have been invalidated (shot down) by the cache coherency protocols if any page tables were
changed in the process.
Next time: (Once I get it working) Data memory accesses
19 Dec 2021
I started doing VLSI design in the early ’90s building graphics
accelerators at 2um and later in the decade building CPUs at 1.8u-0.5u - gates
and pins were expensive - we once bet the company on the viability of a 208
pin plastic package, something that paid off magnificently.
I started this project with the vague idea of “what happens if I
throw a lot of gates at it?” - my original planned development platform was a
board off of Aliexpress a lot like this one an Xylinx Kinetix 420 based board with embedded
DRAM - my plan was to wire an SD card socket to it and used the internal USB to serial
hardware to boot linux.
When I first compiled up VRoom! for it I had a slight problem …. the design was 50% too big! oops ….
So I went around looking for bigger targets ….
AWS’s F Instances
AWS provides an FPGA based service based on Xilinx’s Ultrascale VU9Ps - these are much larger
FPGAs, just what the doctor ordered. The F instances seem to be aimed at high
frequency traders - we’re small fry in comparison. They are based on a PCIE board in an
Intel based CPU - we assume they actually put 4 boards into each CPU and sell the
minimum as a virtual instance with 8 virtual cores and a physical FPGA board. This
is the F1 instance that we use for development.
The AWS F instances each include a VU9P, 4 sets of DDR4 (we use one), and several PCIEs
to the host CPU (we use one simple one).
The VU9P is an interesting beast it seems to actually be a sandwich of 3 dies, with
~1500 wires between the middle die and the upper die and ~1500 wires between the middle die and the lower die. These wires make the thing possible but they’re also a problem,
firstly they are potentially a scarce routing resource (not for us yet), and secondly
they are slow - for speed Xilinx recommend that one carefully pipe all the die crossings (ie a flop on either side). We’ve
decided to not do that, as it would mean messing with a design where we’re actually
trying to carefully debug the actual pipelining for a real CPU. Rather than have this
reality intrude on our design we’ve had to reduce our target clock from 100MHz to 25MHz
currently most critical paths have several die crossings.
The above image shows a recent layout plot - you can see the three dies. The upper 25%
on the center and right hand side dies is AWS’s “shell” a built in interface to the
host CPU and one DDR4 controller. There is now a smaller shell available which we
may switch to soon that removes a whole lot of functionality that
we don’t need, and gives us ~10% more gates (but sadly in the wrong places).
Development is done on an AWS FPGA development instance, you don’t need to pay
for a Vivado license if you do your development on AWS. The actual build environment,
documentation (and the shell) is available on github.
Build time is very variable, our big problem is congestion (timing issues come
from that) and builds can take any time from 10-20 hours and don’t always succeed.
We’ve had to drop the sizes of out I/D caches by 50% and our BTC by 8x to make this work.
AWS don’t want us to trash their hardware so after you’ve completed building a new
design you have to submit it for testing - they do a bunch of things presumably looking for over current and over temp issues (we haven’t had one rejected yet). This
takes about an hour.
F1 instances cost ~US$1.50/hour to use, the build systems about US$1/hour.
AWS provide their “shell”, a fixed, pre-laid out interface - you can see it in
this block diagram as “AWS support”:
The shell provides AXI interfaces looking inwards (our DRAM controller is a
master, our disk/UART are clients).
We’ve added a simple UART, dumb fake disk controller (really just a 256-byte
FIFO) to the shell, and a register that allows us to reset the VROOM!. These
simple devices are made visible on the PCIE as registers and mapped into a
user space linux app running on the host Intel CPU.
The VROOM! is instanced with a minimal number of connections (the above devices,
DRAM and clocks). It is essentially the same top level we use in verilog simulation
(from chip.sv down, one CPU and one set of IO devices).
The Linux user space software is a thread that maps the PCIE register
space and then pulls the CPU out of reset. It then loops reading from
the UART and ‘disk’ registers, if it finds a disk request it provides
data from one of two sources (an OpenSBI/Uboot image or a linux file
system image), if it finds incoming uart data it wakes a text display thread
to print it on the console. A third thread reads data from the console and
puts it into the uart to send when space is available.
We haven’t included the AWS interface code in the current VROOM! source
release, partly because we don’t want to confuse it with the real thing
we’re trying to build. But also because it’s mostly embedded in code
that is AWS’s IP (the shell API is one big verilog module that one needs
to embed one’s own IP) - there are no secrets on our part, we’re happy
to share if anyone is interested.
Next time: (probably after Xmas) Virtual Memory
05 Dec 2021
A short post this week about physical memory layout and a little bit about booting. I’ll talk more
about virtual memory another time
Physical memory layout
We currently use a 56-bit physical address, this is the address used with the MMU disabled or after a
virtual address has been translated.
Addresses with bit 55 (the most significant bit) set to 0 are treated as cacheable memory space - instruction
and data accesses go through separate L1 caches but the cache coherency protocol makes sure that they see the
same data. Cache lines are 512 bits.
Everything in this space is cache coherent, once the system is booted it’s the only place that code can
be executed from. Because it’s always coherent, the fence.i instruction is a simple thing for us, all it has to do is to
wait for the writes to drain from the store queue and then flush the commitQ (fences go in the commitQ so all
that happens when it reaches the end of the queue), subsequent instruction fetches
pull live modified data from the data cache into the instruction cache.
At the moment we have either a fake memory emulator on the simulator, or connect to DRAM on the AWS FPGA
implementation - both implementations back onto the coherent MOESI bus
Within this space main memory (DRAM) starts at address 0x00000000. What happens if you access memory
outside of the area where there’s real DRAM is undefined, with the proviso that it wont lock up
(writes probably get lost or wrap around, reads may return undefined data - it’s up to the real
memory controller to choose how it behaves - on a big system the memory controller may be on another die).
Addresses with bit 55 set are handled differently for data accesses and instruction accesses:
- instruction accesses got to a boot ROM at a known fixed address, the MMU never generates these addresses
- data accesses go to a shared 64-bit IO bus (can be accessed through the MMU)
The current data IO bus contains:
- a ROM containing a device tree image
- a uart
- multiple interrupt controllers PLIC/CLNT/CLIC
- a fake disk controller
Currently reset causes the CPU to jump to the start of code compiled into hardware in the special instruction space.
Currently that code:
- checks a GPIO pin (reads from IO space)
- if it’s set it jumps to address 0 (this is for the simulator where main memory can be side loaded)
- if it’s clear we read a boot image from the fake disk controller (connected to test fixture code on
the simulator, and to linux user space on the AWS FPGA world) into main memory, then jump to it
(currently we load uboot/OpenSBI)
Longer term we plan to put the two L1 caches into a mode at reset where the memory controller is disabled
and the data cache allocates lines on write, the instruction cache will use the existing cache coherency protocol
to pull shared cache lines from the data cache. The on-chip bootstrap will copy data into L1 from an SD/eMMC, validate it
(a crypto signature check), and jump into it. This code will initialize the DRAM controller, run it through
its startup conditioning, initialize the rest of the devices, take the cache controller out of
its reset mode and then finally load OpenSBi/UEFI/uboot/etc into DRAM.
Next time: Building on AWS
28 Nov 2021
This week we’re going to try and explain as simply as possible how our core architecture works. Here’s an
overview of the system:
The core structure is a first-in-first out queue called the ‘commitQ’. Instruction-bundles
are inserted in-order at one end, and removed in-order at the other
end once they have been committed.
While in the commitQ instructions can be executed in any order (with some limits like memory
aliasing for load/store instructions). At any time, if a branch instruction is discovered to have
been wrongly predicted, the instructions following it in the queue will be discarded.
A note on terminology - previously I’ve referred to ‘decode-bundle’s which are a bunch of bytes fetched from the i-cache
and fed to the decoders. Here we talk about ‘instruction-bundle’s which are a bunch of data being passed around
through the CPU representing an instruction and what it can do - data in an instruction-bundle includes its PC,
its input/output registers, immediate constants, which ALU type it needs to be fed to, what operation will be
performed on it there, etc.
You can think of the decoders as taking a decode-bundle and producing 1-8 instruction-bundles
per clock to feed to the renamer.
Registers and Register Renaming
We have a split register file:
On the right we have the 31 architectural integer register and the 32 floating point registers. If the commitQ is empty
(after a trap for example) then they will contain the exact state of the machine.
The commit registers are each staticly bound to one commitQ entry.
When an instruction-bundle leaves the decode stage it contains the architectural register numbers of its source
and destination registers, the renamer pipe stage assigns a commitQ entry (and therefore a commit register) to
each instruction and tags its output register to be that commit register.
The renamer also keeps a table of the latest
commit register that each architectural register is bound to (if any) - it uses these tables to point each source
register to either a particular commit register or an architectural register.
Once an instruction-bundle is in the commitQ it can’t be assigned an ALU until each of its source registers
is either an architectural register or it’s a commit register that has reached the ‘completed’ state (ie it’s
either in a commit register, or is being written to one and can be bypassed). Once a commitQ entry is in this state
it goes into contention to be assigned an ALU and will execute ASAP.
Once a commitQ entry has been executed its result goes into its associated commit register, and the commitQ
entry waits until it can be committed
Each commitQ entry is continually watching the state of it’s source registers, if one of them moves from the commitQ
to the architectural registers it updates the source register it will use when executing.
In every clock the CPU looks at the first 8 entries in the commitQ, it takes the first N contiguous entries
that are completed, marks them ‘committed’ and removes them from the queue.
Committing an entry causes
its associated commit register to be written back into the architectural registers (if multiple commit entries
write the same architectural register only the last one succeeds). It also releases any store instructions
in the load-store unit’s write buffer to be actually written to caches or IO space.
As mentioned above when a branch instruction hits a branch ALU and is discovered to have been mispredicted - either a
conditional branch where the condition was mispredicted, or an indirect branch where the destination was mispredicted - then
the instructions in the commitQ after the mispredicted instruction are discarded and the instruction fetcher
starts filling again from that new address.
While this is happening the renamer must update its state, to reflect the new truncated commitQ.
Branch instructions effectively act as barriers in the commitQ, until they are resolved into the completed
state (ie their speculative state has been resolved) instructions after them cannot be committed - which means that
their commit registers can’t be written into the architectural registers, and data waiting to be stored into cache or main memory
can’t be written.
However instructions after an unresolved branch can still be executed (out of order) using the results of
other speculative operations from their commit registers. Speculative store instructions can
write data into the load-store unit’s write
queue, and speculative loads can also snoop that data from the write queue and from the cache - I’ll
talk more about this in another blog post.
Also branches can be executed out of order resulting in smaller/later portions of the commitQ being discarded.
Traps and Interrupts
Finally - the CSR ALU (containing the CSR registers and the trap and interrupt logic) is special - only
an instruction at the very beginning of the commitQ can enter the CSR - that effectively means only one CSR/trap instruction
can be committed in that clock (actually the next clock) this acts as a synchonising mechanism.
Load/store traps (MMU/PMAP/etc) are generated by converting a load/store tagged instruction in the commitQ into a trap
Fetch traps (again MMU/PMAP but also invalid instructions) are injected into the commitQ by the instruction fetcher
and decoder which then stalls.
Interrupts are treated much the same as fetch traps, they’re injected into the instruction stream which then stops
When any of these hit the CSR unit all instructions prior to them have been executed and committed. They then trigger
a full commitQ flush, throwing away all subsequent instructions in the queue.
They then tell the fetcher to start fetching from a new address.
Some other CSR ALU operations also generate commitQ flushes: some of the MMU flush operations, loading page tables, fence.i etc. We also optimise interrupts that occur when you write a register to unmask interrupts and make them synchronous (so that we don’t end up executing a few instructions after changing the state).
The important ideas here:
- we have two register files one for speculative results, one for the architectural registers
- our commitQ is an in-order list of instructions to execute
- each commitQ entry has an associated output register for its speculative result
- commitQ instructions are executed in any order
- however instructions are committed/retired in chunks, in order
- at that point their results are copied from the speculative commit registers to the architectural registers
- badly speculated branches cause the parts of the commit Q after them to be discarded
Next time: Memory Layout
20 Nov 2021
This is the third of an occasional series of articles on the VRoom!/RVoom RISC-V
CPU. This week a shorter update, we’re going to talk about how we can create speculative entries
in the Branch Target Cache (BTC) call-return stack. A quick reminder of some of what we learned in the previous blog.
- we decode large bundles of many instructions every clock
- we predict bundles not instructions
- we maintain a queue of pending uncommitted BTC predictions
- each entry overrides the ones after it and the main tables
Last time we talked about how we manage speculated branch target cache (BTC) entries, in short
we have a queue of speculated BTC updates in front of the normal BTC tables, those entries are
discarded or updated when a speculated branch is found to have been mispredicted, and retired into the
main tables when a corresponding branch (or rather bundle of instructions) is committed.
Subroutine call stack
In addition to the standard BTC tables we also have a per-access mode call-return stack,
RISC-V defines for us the instructions that should be used to infer these (and which not to),
in the BTC we provide a 32-entry stack (smaller for M-mode) of the last 32 subroutine calls
so we can predict the return address when decoding ret instructions (if we get it wrong it’s
not the end of the world, it will be corrected when the ret instruction hits the branch ALU)
Initially we thought we’d just tag branches with a pointer to the top-of stack, and wind it back when we got
a misprediction, then we found out a case which turned out to have a particularly heavy impact on performance:
r1: call b
a: bnez a0, 1f < branch mispredicted
essentially what happens here is that we speculatively fetch the following instruction stream:
Instruction call stack
call a r1 -
bnez a0, 1f r1 - < remembers TOS
ret - < everything from here on is
call b r2 - mispredicted
Sometime later the bnez is resolved in the branch ALU and discovered to have been mispredicted. But by that point
the return address ‘r1’ on the call stack has been popped and replaced with ‘r2’, even though we got the top of stack (TOS)
correct, its contents are wrong, and can’t be undone when we discover the misprediction.
This means that at some later time when we finally do return from subroutine ‘a’
the prediction will be to ‘r2’ rather than ‘r1’ - this in turn will result in another misprediction when that return
instruction hits the branch ALU, which is exactly what were were trying to avoid.
Our solution is to take a similar approach to the one we took for the BTC tables and put a queue of prediction
history in front of the call/return stack, in this case it’s a queue of push and pop
actions. Instead of being indexed by hashes of the PC/global history, it’s simply the TOS address that’s used instead.
Top of stack always returns the latest push that matches the current TOS (and the top of the actual backing stack otherwise).
Like BTC speculative queue mentioned in the previous blog entry this queue is managed by the same bundle tag we
attach to instructions in the commitQ - when a mispredicted branch instruction is discovered then the
return stack speculative queue is truncated removing entries that correspond to the discarded instruction bundles
and the TOS updated to the oldest retained entry.
Similarly when the last instruction in an instruction bundle is committed then related call-return stack
speculative queue entries are
tagged ‘committed’, and then written back in order into the call-return stack.
The important ideas here:
- we have call-return stacks to predict the address of return instructions
- we maintain a queue of speculative push/pop transitions
- each entry overrides the ones after it and the main tables
- the queue is flushed on a misprediction
- committed data is pushed into the main tables
Next time: How our main pipe works - an overview