VRoom! blog: Building on AWS

Introduction

I started doing VLSI design in the early ’90s building graphics accelerators at 2um and later in the decade building CPUs at 1.8u-0.5u - gates and pins were expensive - we once bet the company on the viability of a 208 pin plastic package, something that paid off magnificently.

I started this project with the vague idea of “what happens if I throw a lot of gates at it?” - my original planned development platform was a board off of Aliexpress a lot like this one an Xylinx Kinetix 420 based board with embedded DRAM - my plan was to wire an SD card socket to it and used the internal USB to serial hardware to boot linux.

When I first compiled up VRoom! for it I had a slight problem …. the design was 50% too big! oops ….

So I went around looking for bigger targets ….

AWS’s FPGA Instances

AWS provides an FPGA based service based on Xilinx’s Ultrascale VU9Ps - these are much larger FPGAs, just what the doctor ordered. The F instances seem to be aimed at high frequency traders - we’re small fry in comparison. They are based on a PCIE board in an Intel based CPU - we assume they actually put 4 boards into each CPU and sell the minimum as a virtual instance with 8 virtual cores and a physical FPGA board. This is the F1 instance that we use for development.

The AWS F instances each include a VU9P, 4 sets of DDR4 (we use one), and several PCIEs to the host CPU (we use one simple one).

The VU9P is an interesting beast it seems to actually be a sandwich of 3 dies, with ~1500 wires between the middle die and the upper die and ~1500 wires between the middle die and the lower die. These wires make the thing possible but they’re also a problem, firstly they are potentially a scarce routing resource (not for us yet), and secondly they are slow - for speed Xilinx recommend that one carefully pipe all the die crossings (ie a flop on either side). We’ve decided to not do that, as it would mean messing with a design where we’re actually trying to carefully debug the actual pipelining for a real CPU. Rather than have this reality intrude on our design we’ve had to reduce our target clock from 100MHz to 25MHz currently most critical paths have several die crossings.

placeholder

The above image shows a recent layout plot - you can see the three dies. The upper 25% on the center and right hand side dies is AWS’s “shell” a built in interface to the host CPU and one DDR4 controller. There is now a smaller shell available which we may switch to soon that removes a whole lot of functionality that we don’t need, and gives us ~10% more gates (but sadly in the wrong places).

Development is done on an AWS FPGA development instance, you don’t need to pay for a Vivado license if you do your development on AWS. The actual build environment, documentation (and the shell) is available on github.

Build time is very variable, our big problem is congestion (timing issues come from that) and builds can take any time from 10-20 hours and don’t always succeed. We’ve had to drop the sizes of out I/D caches by 50% and our BTC by 8x to make this work.

AWS don’t want us to trash their hardware so after you’ve completed building a new design you have to submit it for testing - they do a bunch of things presumably looking for over current and over temp issues (we haven’t had one rejected yet). This takes about an hour.

F1 instances cost ~US$1.50/hour to use, the build systems about US$1/hour.

Chip Architecture

AWS provide their “shell”, a fixed, pre-laid out interface - you can see it in this block diagram as “AWS support”:

placeholder

The shell provides AXI interfaces looking inwards (our DRAM controller is a master, our disk/UART are clients).

We’ve added a simple UART, dumb fake disk controller (really just a 256-byte FIFO) to the shell, and a register that allows us to reset the VROOM!. These simple devices are made visible on the PCIE as registers and mapped into a user space linux app running on the host Intel CPU.

The VROOM! is instanced with a minimal number of connections (the above devices, DRAM and clocks). It is essentially the same top level we use in verilog simulation (from chip.sv down, one CPU and one set of IO devices).

Software

The Linux user space software is a thread that maps the PCIE register space and then pulls the CPU out of reset. It then loops reading from the UART and ‘disk’ registers, if it finds a disk request it provides data from one of two sources (an OpenSBI/Uboot image or a linux file system image), if it finds incoming uart data it wakes a text display thread to print it on the console. A third thread reads data from the console and puts it into the uart to send when space is available.

Software release

We haven’t included the AWS interface code in the current VROOM! source release, partly because we don’t want to confuse it with the real thing we’re trying to build. But also because it’s mostly embedded in code that is AWS’s IP (the shell API is one big verilog module that one needs to embed one’s own IP) - there are no secrets on our part, we’re happy to share if anyone is interested.

Next time: (probably after Xmas) Virtual Memory

VRoom! blog: Memory Layout

Introduction

A short post this week about physical memory layout and a little bit about booting. I’ll talk more about virtual memory another time

Physical memory layout

We currently use a 56-bit physical address, this is the address used with the MMU disabled or after a virtual address has been translated.

Addresses with bit 55 (the most significant bit) set to 0 are treated as cacheable memory space - instruction and data accesses go through separate L1 caches but the cache coherency protocol makes sure that they see the same data. Cache lines are 512 bits.

Everything in this space is cache coherent, once the system is booted it’s the only place that code can be executed from. Because it’s always coherent, the fence.i instruction is a simple thing for us, all it has to do is to wait for the writes to drain from the store queue and then flush the commitQ (fences go in the commitQ so all that happens when it reaches the end of the queue), subsequent instruction fetches pull live modified data from the data cache into the instruction cache.

At the moment we have either a fake memory emulator on the simulator, or connect to DRAM on the AWS FPGA implementation - both implementations back onto the coherent MOESI bus

Within this space main memory (DRAM) starts at address 0x00000000. What happens if you access memory outside of the area where there’s real DRAM is undefined, with the proviso that it wont lock up (writes probably get lost or wrap around, reads may return undefined data - it’s up to the real memory controller to choose how it behaves - on a big system the memory controller may be on another die).

Addresses with bit 55 set are handled differently for data accesses and instruction accesses:

  • instruction accesses got to a boot ROM at a known fixed address, the MMU never generates these addresses
  • data accesses go to a shared 64-bit IO bus (can be accessed through the MMU)

The current data IO bus contains:

  • a ROM containing a device tree image
  • timers
  • a uart
  • gpio
  • multiple interrupt controllers PLIC/CLNT/CLIC
  • a fake disk controller

Booting

Currently reset causes the CPU to jump to the start of code compiled into hardware in the special instruction space. Currently that code:

  • checks a GPIO pin (reads from IO space)
  • if it’s set it jumps to address 0 (this is for the simulator where main memory can be side loaded)
  • if it’s clear we read a boot image from the fake disk controller (connected to test fixture code on the simulator, and to linux user space on the AWS FPGA world) into main memory, then jump to it (currently we load uboot/OpenSBI)

Longer term we plan to put the two L1 caches into a mode at reset where the memory controller is disabled and the data cache allocates lines on write, the instruction cache will use the existing cache coherency protocol to pull shared cache lines from the data cache. The on-chip bootstrap will copy data into L1 from an SD/eMMC, validate it (a crypto signature check), and jump into it. This code will initialize the DRAM controller, run it through its startup conditioning, initialize the rest of the devices, take the cache controller out of its reset mode and then finally load OpenSBi/UEFI/uboot/etc into DRAM.

Next time: Building on AWS

Vroom! blog: Core VRoom! Architecture

Introduction

This week we’re going to try and explain as simply as possible how our core architecture works. Here’s an overview of the system:

placeholder

The core structure is a first-in-first out queue called the ‘commitQ’. Instruction-bundles are inserted in-order at one end, and removed in-order at the other end once they have been committed.

While in the commitQ instructions can be executed in any order (with some limits like memory aliasing for load/store instructions). At any time, if a branch instruction is discovered to have been wrongly predicted, the instructions following it in the queue will be discarded.

A note on terminology - previously I’ve referred to ‘decode-bundle’s which are a bunch of bytes fetched from the i-cache and fed to the decoders. Here we talk about ‘instruction-bundle’s which are a bunch of data being passed around through the CPU representing an instruction and what it can do - data in an instruction-bundle includes its PC, its input/output registers, immediate constants, which ALU type it needs to be fed to, what operation will be performed on it there, etc.

You can think of the decoders as taking a decode-bundle and producing 1-8 instruction-bundles per clock to feed to the renamer.

Registers and Register Renaming

We have a split register file:

placeholder

On the right we have the 31 architectural integer register and the 32 floating point registers. If the commitQ is empty (after a trap for example) then they will contain the exact state of the machine.

The commit registers are each staticly bound to one commitQ entry.

When an instruction-bundle leaves the decode stage it contains the architectural register numbers of its source and destination registers, the renamer pipe stage assigns a commitQ entry (and therefore a commit register) to each instruction and tags its output register to be that commit register.

The renamer also keeps a table of the latest commit register that each architectural register is bound to (if any) - it uses these tables to point each source register to either a particular commit register or an architectural register.

Execution

Once an instruction-bundle is in the commitQ it can’t be assigned an ALU until each of its source registers is either an architectural register or it’s a commit register that has reached the ‘completed’ state (ie it’s either in a commit register, or is being written to one and can be bypassed). Once a commitQ entry is in this state it goes into contention to be assigned an ALU and will execute ASAP.

Once a commitQ entry has been executed its result goes into its associated commit register, and the commitQ entry waits until it can be committed

Each commitQ entry is continually watching the state of it’s source registers, if one of them moves from the commitQ to the architectural registers it updates the source register it will use when executing.

Completion

In every clock the CPU looks at the first 8 entries in the commitQ, it takes the first N contiguous entries that are completed, marks them ‘committed’ and removes them from the queue.

Committing an entry causes its associated commit register to be written back into the architectural registers (if multiple commit entries write the same architectural register only the last one succeeds). It also releases any store instructions in the load-store unit’s write buffer to be actually written to caches or IO space.

Speculative Misses

As mentioned above when a branch instruction hits a branch ALU and is discovered to have been mispredicted - either a conditional branch where the condition was mispredicted, or an indirect branch where the destination was mispredicted - then the instructions in the commitQ after the mispredicted instruction are discarded and the instruction fetcher starts filling again from that new address.

While this is happening the renamer must update its state, to reflect the new truncated commitQ.

Branch instructions effectively act as barriers in the commitQ, until they are resolved into the completed state (ie their speculative state has been resolved) instructions after them cannot be committed - which means that their commit registers can’t be written into the architectural registers, and data waiting to be stored into cache or main memory can’t be written.

However instructions after an unresolved branch can still be executed (out of order) using the results of other speculative operations from their commit registers. Speculative store instructions can write data into the load-store unit’s write queue, and speculative loads can also snoop that data from the write queue and from the cache - I’ll talk more about this in another blog post.

Also branches can be executed out of order resulting in smaller/later portions of the commitQ being discarded.

Traps and Interrupts

Finally - the CSR ALU (containing the CSR registers and the trap and interrupt logic) is special - only an instruction at the very beginning of the commitQ can enter the CSR - that effectively means only one CSR/trap instruction can be committed in that clock (actually the next clock) this acts as a synchonising mechanism.

Load/store traps (MMU/PMAP/etc) are generated by converting a load/store tagged instruction in the commitQ into a trap instruction.

Fetch traps (again MMU/PMAP but also invalid instructions) are injected into the commitQ by the instruction fetcher and decoder which then stalls.

Interrupts are treated much the same as fetch traps, they’re injected into the instruction stream which then stops fetching.

When any of these hit the CSR unit all instructions prior to them have been executed and committed. They then trigger a full commitQ flush, throwing away all subsequent instructions in the queue. They then tell the fetcher to start fetching from a new address.

Some other CSR ALU operations also generate commitQ flushes: some of the MMU flush operations, loading page tables, fence.i etc. We also optimise interrupts that occur when you write a register to unmask interrupts and make them synchronous (so that we don’t end up executing a few instructions after changing the state).

To Recap

The important ideas here:

  • we have two register files one for speculative results, one for the architectural registers
  • our commitQ is an in-order list of instructions to execute
  • each commitQ entry has an associated output register for its speculative result
  • commitQ instructions are executed in any order
  • however instructions are committed/retired in chunks, in order
  • at that point their results are copied from the speculative commit registers to the architectural registers
  • badly speculated branches cause the parts of the commit Q after them to be discarded

Next time: Memory Layout

VRoom! blog: Branch Target Cache [BTC] (part 3) Managing a speculative subroutine call stack

This is the third of an occasional series of articles on the VRoom!/RVoom RISC-V CPU. This week a shorter update, we’re going to talk about how we can create speculative entries in the Branch Target Cache (BTC) call-return stack. A quick reminder of some of what we learned in the previous blog.

  • we decode large bundles of many instructions every clock
  • we predict bundles not instructions
  • we maintain a queue of pending uncommitted BTC predictions
  • each entry overrides the ones after it and the main tables

Introduction

Last time we talked about how we manage speculated branch target cache (BTC) entries, in short we have a queue of speculated BTC updates in front of the normal BTC tables, those entries are discarded or updated when a speculated branch is found to have been mispredicted, and retired into the main tables when a corresponding branch (or rather bundle of instructions) is committed.

placeholder

Subroutine call stack

In addition to the standard BTC tables we also have a per-access mode call-return stack, RISC-V defines for us the instructions that should be used to infer these (and which not to), in the BTC we provide a 32-entry stack (smaller for M-mode) of the last 32 subroutine calls so we can predict the return address when decoding ret instructions (if we get it wrong it’s not the end of the world, it will be corrected when the ret instruction hits the branch ALU)

Initially we thought we’d just tag branches with a pointer to the top-of stack, and wind it back when we got a misprediction, then we found out a case which turned out to have a particularly heavy impact on performance:


	.....
	call	a
r1:	call	b
r2	....

a:	bnez	a0, 1f		< branch mispredicted
	ret
1:	....

b:	....

essentially what happens here is that we speculatively fetch the following instruction stream:

	Instruction		call stack
	===========		==========
				

	call	a		r1 -
	bnez 	a0, 1f		r1 -	< remembers TOS
	ret			-	< everything from here on is 
	call	b		r2 -	  mispredicted
	....				

Sometime later the bnez is resolved in the branch ALU and discovered to have been mispredicted. But by that point the return address ‘r1’ on the call stack has been popped and replaced with ‘r2’, even though we got the top of stack (TOS) correct, its contents are wrong, and can’t be undone when we discover the misprediction. This means that at some later time when we finally do return from subroutine ‘a’ the prediction will be to ‘r2’ rather than ‘r1’ - this in turn will result in another misprediction when that return instruction hits the branch ALU, which is exactly what were were trying to avoid.

Solution

Our solution is to take a similar approach to the one we took for the BTC tables and put a queue of prediction history in front of the call/return stack, in this case it’s a queue of push and pop actions. Instead of being indexed by hashes of the PC/global history, it’s simply the TOS address that’s used instead. Top of stack always returns the latest push that matches the current TOS (and the top of the actual backing stack otherwise).

Like BTC speculative queue mentioned in the previous blog entry this queue is managed by the same bundle tag we attach to instructions in the commitQ - when a mispredicted branch instruction is discovered then the return stack speculative queue is truncated removing entries that correspond to the discarded instruction bundles and the TOS updated to the oldest retained entry.

Similarly when the last instruction in an instruction bundle is committed then related call-return stack speculative queue entries are tagged ‘committed’, and then written back in order into the call-return stack.

To Recap

The important ideas here:

  • we have call-return stacks to predict the address of return instructions
  • we maintain a queue of speculative push/pop transitions
  • each entry overrides the ones after it and the main tables
  • the queue is flushed on a misprediction
  • committed data is pushed into the main tables

Next time: How our main pipe works - an overview

VRoom! blog: Branch Target Cache [BTC] (part 2) Living in a Speculative World

This is the second of an occasional series of articles on the VRoom!/RVoom RISC-V CPU. This week we’re going to talk about how we can create speculative entries in the Branch Target Cache (BTC). A quick reminder of some of what we learned in the previous blog.

  • we decode large bundles of many instructions every clock
  • we predict bundles not instructions

System Architecture

Let’s have a look at an overview of our system architecture:

placeholder

Instructions are fetched from the instruction cache (I$1), decoded, renamed to commit registers and entered into the commitQ. Once in the commitQ instructions can be executed in any order and even after they are executed they can be discarded by a trap or a preceding mispredicted branch.

At the end of the commitQ are committed instructions, when they and all instructions before them are done, all branches are correctly predicted, there are no pending traps, then the CPU will retire up to the last 8 committed instructions per clock.

BTC Operations

Our BTC contains tables that are accessed using hashes constructed from the program counter (PC) and a global history. On a traditional system the BTC tables and global history would be updated on every clock, In VROOM! we can have many instructions outstanding, often they are speculative instructions that may never be committed - when a branch instruction is processed by the branch ALU and is discovered to have been speculated incorrectly it can trigger the discard of subsequent speculative instructions (including other branch instructions), branch instructions can also be processed out of order.

Processing branch instructions is important, we want to use branch mis-predictions to update the BTC tables - if we assumed that a branch would be taken and it wasn’t we need to back out and change the value that we had previously set them too.

A simple solution (and the one that we originally tried) was to wait until the commit stage of the CPU and then only use instructions that have been committed (and branches that have been resolved) to update the BTC tables. It works, but it’s very slow to update (remember that a big version of our CPU can have ~150 instructions in flight at once). Consider something like this:

clear:
1:	std	x0, (a0)
	add	a0, a0, 8
	add	a1, a1, -1
	bnez	a1, 1b
	ret

The core of this is a single decode-bundle that always loops and contains 4 instruction in its core. The CPU may push 150/4 = 30+ bundles, 30 times around the loop, before we can start updating the BTC - this is particularly a problem with global history predictors which to be useful really to be updated on every prediction.

A better solution

Let’s look at how our BTC works:

placeholder

The PC and the global history (a bit vector) are hashed (just some xors and bit swapping) into indexes into the 3 BTC tables, they’re used to look up the 3 tables. The combined output is used to choose whether the bimodal or global history predictor is best for a particular bundle.

Our solution to the above problems is to create a ‘pending prediction’ queue in front of the normal BTC tables. Each entry contains the information about a bundle prediction. Including the PC used for it and the global history at the point that it was created. Newer predictions are performed by looking into the queue from most recent to oldest for each of the 3 hashes (individually and independently) comparing them with the hashes of the PC and global history generated in each entry, if no match is found then the value from the main tables is used.

placeholder

The per-bundle BTC pending prediction queue acts similarly to the main per-instruction commitQ - in this case each instruction in the commitQ carries a reference to the decode-bundle it came from.

Actions

If a branch instruction in the commitQ is discovered to be mispredicted all the instructions following it are discarded. At the same time entries in the BTC prediction queue that are newer than the bundle corresponding to the mispredicted instruction are also discarded, and the prediction queue entry that does correspond to the mispredicted branch has its data updated. Finally that prediction queue’s copy of the global history along with the new branch information is used to reload the global history vector used for the next BTC prediction.

Finally each BTC pending queue entry has a ‘committed’ bit - it gets set when all the instructions in the corresponding bundle hit the commit state in the commitQ, that means none of them will be mispredicted, or already have been. On every clock if the oldest entry in the BTC pending queue has its committed bit set then its data is copied back into the main BTC tables.

To Recap

The important ideas here:

  • we maintain a queue of pending uncommitted BTC predictions
  • each entry overrides the ones after it and the main tables
  • the queue is flushed and the top entry updated on a misprediction
  • committed data is pushed into the main tables

Next time: BTC (Part 3) Predicting Subroutine Returns