19 Dec 2021
Introduction
I started doing VLSI design in the early ’90s building graphics
accelerators at 2um and later in the decade building CPUs at 1.8u-0.5u - gates
and pins were expensive - we once bet the company on the viability of a 208
pin plastic package, something that paid off magnificently.
I started this project with the vague idea of “what happens if I
throw a lot of gates at it?” - my original planned development platform was a
board off of Aliexpress a lot like this one an Xylinx Kinetix 420 based board with embedded
DRAM - my plan was to wire an SD card socket to it and used the internal USB to serial
hardware to boot linux.
When I first compiled up VRoom! for it I had a slight problem …. the design was 50% too big! oops ….
So I went around looking for bigger targets ….
AWS’s FPGA Instances
AWS provides an FPGA based service based on Xilinx’s Ultrascale VU9Ps - these are much larger
FPGAs, just what the doctor ordered. The F instances seem to be aimed at high
frequency traders - we’re small fry in comparison. They are based on a PCIE board in an
Intel based CPU - we assume they actually put 4 boards into each CPU and sell the
minimum as a virtual instance with 8 virtual cores and a physical FPGA board. This
is the F1 instance that we use for development.
The AWS F instances each include a VU9P, 4 sets of DDR4 (we use one), and several PCIEs
to the host CPU (we use one simple one).
The VU9P is an interesting beast it seems to actually be a sandwich of 3 dies, with
~1500 wires between the middle die and the upper die and ~1500 wires between the middle die and the lower die. These wires make the thing possible but they’re also a problem,
firstly they are potentially a scarce routing resource (not for us yet), and secondly
they are slow - for speed Xilinx recommend that one carefully pipe all the die crossings (ie a flop on either side). We’ve
decided to not do that, as it would mean messing with a design where we’re actually
trying to carefully debug the actual pipelining for a real CPU. Rather than have this
reality intrude on our design we’ve had to reduce our target clock from 100MHz to 25MHz
currently most critical paths have several die crossings.

The above image shows a recent layout plot - you can see the three dies. The upper 25%
on the center and right hand side dies is AWS’s “shell” a built in interface to the
host CPU and one DDR4 controller. There is now a smaller shell available which we
may switch to soon that removes a whole lot of functionality that
we don’t need, and gives us ~10% more gates (but sadly in the wrong places).
Development is done on an AWS FPGA development instance, you don’t need to pay
for a Vivado license if you do your development on AWS. The actual build environment,
documentation (and the shell) is available on github.
Build time is very variable, our big problem is congestion (timing issues come
from that) and builds can take any time from 10-20 hours and don’t always succeed.
We’ve had to drop the sizes of out I/D caches by 50% and our BTC by 8x to make this work.
AWS don’t want us to trash their hardware so after you’ve completed building a new
design you have to submit it for testing - they do a bunch of things presumably looking for over current and over temp issues (we haven’t had one rejected yet). This
takes about an hour.
F1 instances cost ~US$1.50/hour to use, the build systems about US$1/hour.
Chip Architecture
AWS provide their “shell”, a fixed, pre-laid out interface - you can see it in
this block diagram as “AWS support”:

The shell provides AXI interfaces looking inwards (our DRAM controller is a
master, our disk/UART are clients).
We’ve added a simple UART, dumb fake disk controller (really just a 256-byte
FIFO) to the shell, and a register that allows us to reset the VROOM!. These
simple devices are made visible on the PCIE as registers and mapped into a
user space linux app running on the host Intel CPU.
The VROOM! is instanced with a minimal number of connections (the above devices,
DRAM and clocks). It is essentially the same top level we use in verilog simulation
(from chip.sv down, one CPU and one set of IO devices).
Software
The Linux user space software is a thread that maps the PCIE register
space and then pulls the CPU out of reset. It then loops reading from
the UART and ‘disk’ registers, if it finds a disk request it provides
data from one of two sources (an OpenSBI/Uboot image or a linux file
system image), if it finds incoming uart data it wakes a text display thread
to print it on the console. A third thread reads data from the console and
puts it into the uart to send when space is available.
Software release
We haven’t included the AWS interface code in the current VROOM! source
release, partly because we don’t want to confuse it with the real thing
we’re trying to build. But also because it’s mostly embedded in code
that is AWS’s IP (the shell API is one big verilog module that one needs
to embed one’s own IP) - there are no secrets on our part, we’re happy
to share if anyone is interested.
Next time: (probably after Xmas) Virtual Memory
05 Dec 2021
Introduction
A short post this week about physical memory layout and a little bit about booting. I’ll talk more
about virtual memory another time
Physical memory layout
We currently use a 56-bit physical address, this is the address used with the MMU disabled or after a
virtual address has been translated.
Addresses with bit 55 (the most significant bit) set to 0 are treated as cacheable memory space - instruction
and data accesses go through separate L1 caches but the cache coherency protocol makes sure that they see the
same data. Cache lines are 512 bits.
Everything in this space is cache coherent, once the system is booted it’s the only place that code can
be executed from. Because it’s always coherent, the fence.i instruction is a simple thing for us, all it has to do is to
wait for the writes to drain from the store queue and then flush the commitQ (fences go in the commitQ so all
that happens when it reaches the end of the queue), subsequent instruction fetches
pull live modified data from the data cache into the instruction cache.
At the moment we have either a fake memory emulator on the simulator, or connect to DRAM on the AWS FPGA
implementation - both implementations back onto the coherent MOESI bus
Within this space main memory (DRAM) starts at address 0x00000000. What happens if you access memory
outside of the area where there’s real DRAM is undefined, with the proviso that it wont lock up
(writes probably get lost or wrap around, reads may return undefined data - it’s up to the real
memory controller to choose how it behaves - on a big system the memory controller may be on another die).
Addresses with bit 55 set are handled differently for data accesses and instruction accesses:
- instruction accesses got to a boot ROM at a known fixed address, the MMU never generates these addresses
- data accesses go to a shared 64-bit IO bus (can be accessed through the MMU)
The current data IO bus contains:
- a ROM containing a device tree image
- timers
- a uart
- gpio
- multiple interrupt controllers PLIC/CLNT/CLIC
- a fake disk controller
Booting
Currently reset causes the CPU to jump to the start of code compiled into hardware in the special instruction space.
Currently that code:
- checks a GPIO pin (reads from IO space)
- if it’s set it jumps to address 0 (this is for the simulator where main memory can be side loaded)
- if it’s clear we read a boot image from the fake disk controller (connected to test fixture code on
the simulator, and to linux user space on the AWS FPGA world) into main memory, then jump to it
(currently we load uboot/OpenSBI)
Longer term we plan to put the two L1 caches into a mode at reset where the memory controller is disabled
and the data cache allocates lines on write, the instruction cache will use the existing cache coherency protocol
to pull shared cache lines from the data cache. The on-chip bootstrap will copy data into L1 from an SD/eMMC, validate it
(a crypto signature check), and jump into it. This code will initialize the DRAM controller, run it through
its startup conditioning, initialize the rest of the devices, take the cache controller out of
its reset mode and then finally load OpenSBi/UEFI/uboot/etc into DRAM.
Next time: Building on AWS
28 Nov 2021
Introduction
This week we’re going to try and explain as simply as possible how our core architecture works. Here’s an
overview of the system:

The core structure is a first-in-first out queue called the ‘commitQ’. Instruction-bundles
are inserted in-order at one end, and removed in-order at the other
end once they have been committed.
While in the commitQ instructions can be executed in any order (with some limits like memory
aliasing for load/store instructions). At any time, if a branch instruction is discovered to have
been wrongly predicted, the instructions following it in the queue will be discarded.
A note on terminology - previously I’ve referred to ‘decode-bundle’s which are a bunch of bytes fetched from the i-cache
and fed to the decoders. Here we talk about ‘instruction-bundle’s which are a bunch of data being passed around
through the CPU representing an instruction and what it can do - data in an instruction-bundle includes its PC,
its input/output registers, immediate constants, which ALU type it needs to be fed to, what operation will be
performed on it there, etc.
You can think of the decoders as taking a decode-bundle and producing 1-8 instruction-bundles
per clock to feed to the renamer.
Registers and Register Renaming
We have a split register file:

On the right we have the 31 architectural integer register and the 32 floating point registers. If the commitQ is empty
(after a trap for example) then they will contain the exact state of the machine.
The commit registers are each staticly bound to one commitQ entry.
When an instruction-bundle leaves the decode stage it contains the architectural register numbers of its source
and destination registers, the renamer pipe stage assigns a commitQ entry (and therefore a commit register) to
each instruction and tags its output register to be that commit register.
The renamer also keeps a table of the latest
commit register that each architectural register is bound to (if any) - it uses these tables to point each source
register to either a particular commit register or an architectural register.
Execution
Once an instruction-bundle is in the commitQ it can’t be assigned an ALU until each of its source registers
is either an architectural register or it’s a commit register that has reached the ‘completed’ state (ie it’s
either in a commit register, or is being written to one and can be bypassed). Once a commitQ entry is in this state
it goes into contention to be assigned an ALU and will execute ASAP.
Once a commitQ entry has been executed its result goes into its associated commit register, and the commitQ
entry waits until it can be committed
Each commitQ entry is continually watching the state of it’s source registers, if one of them moves from the commitQ
to the architectural registers it updates the source register it will use when executing.
Completion
In every clock the CPU looks at the first 8 entries in the commitQ, it takes the first N contiguous entries
that are completed, marks them ‘committed’ and removes them from the queue.
Committing an entry causes
its associated commit register to be written back into the architectural registers (if multiple commit entries
write the same architectural register only the last one succeeds). It also releases any store instructions
in the load-store unit’s write buffer to be actually written to caches or IO space.
Speculative Misses
As mentioned above when a branch instruction hits a branch ALU and is discovered to have been mispredicted - either a
conditional branch where the condition was mispredicted, or an indirect branch where the destination was mispredicted - then
the instructions in the commitQ after the mispredicted instruction are discarded and the instruction fetcher
starts filling again from that new address.
While this is happening the renamer must update its state, to reflect the new truncated commitQ.
Branch instructions effectively act as barriers in the commitQ, until they are resolved into the completed
state (ie their speculative state has been resolved) instructions after them cannot be committed - which means that
their commit registers can’t be written into the architectural registers, and data waiting to be stored into cache or main memory
can’t be written.
However instructions after an unresolved branch can still be executed (out of order) using the results of
other speculative operations from their commit registers. Speculative store instructions can
write data into the load-store unit’s write
queue, and speculative loads can also snoop that data from the write queue and from the cache - I’ll
talk more about this in another blog post.
Also branches can be executed out of order resulting in smaller/later portions of the commitQ being discarded.
Traps and Interrupts
Finally - the CSR ALU (containing the CSR registers and the trap and interrupt logic) is special - only
an instruction at the very beginning of the commitQ can enter the CSR - that effectively means only one CSR/trap instruction
can be committed in that clock (actually the next clock) this acts as a synchonising mechanism.
Load/store traps (MMU/PMAP/etc) are generated by converting a load/store tagged instruction in the commitQ into a trap
instruction.
Fetch traps (again MMU/PMAP but also invalid instructions) are injected into the commitQ by the instruction fetcher
and decoder which then stalls.
Interrupts are treated much the same as fetch traps, they’re injected into the instruction stream which then stops
fetching.
When any of these hit the CSR unit all instructions prior to them have been executed and committed. They then trigger
a full commitQ flush, throwing away all subsequent instructions in the queue.
They then tell the fetcher to start fetching from a new address.
Some other CSR ALU operations also generate commitQ flushes: some of the MMU flush operations, loading page tables, fence.i etc. We also optimise interrupts that occur when you write a register to unmask interrupts and make them synchronous (so that we don’t end up executing a few instructions after changing the state).
To Recap
The important ideas here:
- we have two register files one for speculative results, one for the architectural registers
- our commitQ is an in-order list of instructions to execute
- each commitQ entry has an associated output register for its speculative result
- commitQ instructions are executed in any order
- however instructions are committed/retired in chunks, in order
- at that point their results are copied from the speculative commit registers to the architectural registers
- badly speculated branches cause the parts of the commit Q after them to be discarded
Next time: Memory Layout
20 Nov 2021
This is the third of an occasional series of articles on the VRoom!/RVoom RISC-V
CPU. This week a shorter update, we’re going to talk about how we can create speculative entries
in the Branch Target Cache (BTC) call-return stack. A quick reminder of some of what we learned in the previous blog.
- we decode large bundles of many instructions every clock
- we predict bundles not instructions
- we maintain a queue of pending uncommitted BTC predictions
- each entry overrides the ones after it and the main tables
Introduction
Last time we talked about how we manage speculated branch target cache (BTC) entries, in short
we have a queue of speculated BTC updates in front of the normal BTC tables, those entries are
discarded or updated when a speculated branch is found to have been mispredicted, and retired into the
main tables when a corresponding branch (or rather bundle of instructions) is committed.

Subroutine call stack
In addition to the standard BTC tables we also have a per-access mode call-return stack,
RISC-V defines for us the instructions that should be used to infer these (and which not to),
in the BTC we provide a 32-entry stack (smaller for M-mode) of the last 32 subroutine calls
so we can predict the return address when decoding ret instructions (if we get it wrong it’s
not the end of the world, it will be corrected when the ret instruction hits the branch ALU)

Initially we thought we’d just tag branches with a pointer to the top-of stack, and wind it back when we got
a misprediction, then we found out a case which turned out to have a particularly heavy impact on performance:
.....
call a
r1: call b
r2 ....
a: bnez a0, 1f < branch mispredicted
ret
1: ....
b: ....
essentially what happens here is that we speculatively fetch the following instruction stream:
Instruction call stack
=========== ==========
call a r1 -
bnez a0, 1f r1 - < remembers TOS
ret - < everything from here on is
call b r2 - mispredicted
....
Sometime later the bnez is resolved in the branch ALU and discovered to have been mispredicted. But by that point
the return address ‘r1’ on the call stack has been popped and replaced with ‘r2’, even though we got the top of stack (TOS)
correct, its contents are wrong, and can’t be undone when we discover the misprediction.
This means that at some later time when we finally do return from subroutine ‘a’
the prediction will be to ‘r2’ rather than ‘r1’ - this in turn will result in another misprediction when that return
instruction hits the branch ALU, which is exactly what were were trying to avoid.
Solution
Our solution is to take a similar approach to the one we took for the BTC tables and put a queue of prediction
history in front of the call/return stack, in this case it’s a queue of push and pop
actions. Instead of being indexed by hashes of the PC/global history, it’s simply the TOS address that’s used instead.
Top of stack always returns the latest push that matches the current TOS (and the top of the actual backing stack otherwise).

Like BTC speculative queue mentioned in the previous blog entry this queue is managed by the same bundle tag we
attach to instructions in the commitQ - when a mispredicted branch instruction is discovered then the
return stack speculative queue is truncated removing entries that correspond to the discarded instruction bundles
and the TOS updated to the oldest retained entry.
Similarly when the last instruction in an instruction bundle is committed then related call-return stack
speculative queue entries are
tagged ‘committed’, and then written back in order into the call-return stack.
To Recap
The important ideas here:
- we have call-return stacks to predict the address of return instructions
- we maintain a queue of speculative push/pop transitions
- each entry overrides the ones after it and the main tables
- the queue is flushed on a misprediction
- committed data is pushed into the main tables
Next time: How our main pipe works - an overview
14 Nov 2021
This is the second of an occasional series of articles on the VRoom!/RVoom RISC-V
CPU. This week we’re going to talk about how we can create speculative entries
in the Branch Target Cache (BTC). A quick reminder of some of what we learned in the previous blog.
- we decode large bundles of many instructions every clock
- we predict bundles not instructions
System Architecture
Let’s have a look at an overview of our system architecture:

Instructions are fetched from the instruction cache (I$1), decoded, renamed to commit registers and entered
into the commitQ. Once in the commitQ instructions can be executed in any order and even after they are
executed they can be discarded by a trap or a preceding mispredicted branch.
At the end of the commitQ are committed instructions, when they and all instructions before them are done, all
branches are correctly predicted, there are no pending traps, then the CPU will retire up to the last 8
committed instructions per clock.
BTC Operations
Our BTC contains tables that are accessed using hashes constructed from the program counter (PC) and a
global history. On a traditional system the BTC tables and global history would be updated on every
clock, In VROOM! we can have many instructions outstanding, often they are speculative instructions
that may never be committed - when a branch instruction is processed by the branch ALU and is discovered to have been
speculated incorrectly it can trigger
the discard of subsequent speculative instructions (including other branch instructions), branch instructions can
also be processed out of order.
Processing branch instructions is important, we want to use branch mis-predictions to update the BTC tables - if
we assumed that a branch would be taken and it wasn’t we need to back out and change the value that we had
previously set them too.
A simple solution (and the one that we originally tried) was to wait until the commit stage of the CPU
and then only use instructions that have been committed (and branches that have been resolved) to update the BTC tables.
It works, but it’s very slow to update (remember that a big version of our CPU can have ~150 instructions in flight at
once). Consider something like this:
clear:
1: std x0, (a0)
add a0, a0, 8
add a1, a1, -1
bnez a1, 1b
ret
The core of this is a single decode-bundle that always loops and contains 4 instruction in its core.
The CPU may push 150/4 = 30+ bundles, 30 times around the loop, before we can start updating the
BTC - this is particularly a problem with global history predictors which to be useful really to be updated
on every prediction.
A better solution
Let’s look at how our BTC works:

The PC and the global history (a bit vector) are hashed (just some xors and bit swapping) into indexes into the 3
BTC tables, they’re used to look up the 3 tables. The combined output is used to choose whether the bimodal or
global history predictor is best for a particular bundle.
Our solution to the above problems is to create a ‘pending prediction’ queue in front of the normal BTC tables. Each
entry contains the information about a bundle prediction. Including the PC used for it and the global history at the
point that it was created.
Newer predictions are performed by looking into the queue from
most recent to oldest for each of the 3 hashes (individually and independently) comparing them with the hashes of the
PC and global history generated in each entry, if no match is found then the value from the main tables is used.

The per-bundle BTC pending prediction queue acts similarly to the main per-instruction commitQ - in this case
each instruction in the commitQ carries a reference to the decode-bundle it came from.
Actions
If a branch instruction in the commitQ is discovered to be mispredicted
all the instructions following it are discarded. At the same time entries in the BTC
prediction queue that are newer than the bundle corresponding to the mispredicted instruction are also discarded, and the
prediction queue entry that does correspond to the mispredicted branch has its data updated. Finally that prediction queue’s
copy of the global history along with the new branch information is used to reload the global history
vector used for the next BTC prediction.
Finally each BTC pending queue entry has a ‘committed’ bit - it gets set when all the instructions in the corresponding
bundle hit the commit state in the commitQ, that means none of them will be mispredicted, or already have been.
On every clock if the oldest entry in the BTC pending queue has its committed bit set then its data is copied back into the main BTC tables.
To Recap
The important ideas here:
- we maintain a queue of pending uncommitted BTC predictions
- each entry overrides the ones after it and the main tables
- the queue is flushed and the top entry updated on a misprediction
- committed data is pushed into the main tables
Next time: BTC (Part 3) Predicting Subroutine Returns