VRoom! blog - Verilog changes, new performance numbers

Introduction

Not really an architectural posting, more about tooling, skip if it’s not really your thing, also new some performance numbers.

New Performance Numbers

I found the bug in the BTC - mispredicted 4-byte branches that crossed a 16 byte boundary (our instruction bundle size) this is now fixed and we have better dhrystone numbers: 6.33 DMIPS/MHz at an IPC of 2.51 - better than I expected! We can still do better.

More System Verilog, Interfaces for the (mostly) win

Last week I posted a description about my recent major changes to the way that the load/store unit works, what I didn’t touch on is some pretty major changes to the ways the actual Verilog code is written.

I’m an old-time verilog hack, I’ve even written a compiler (actually 2 of them, you can thank me for the * in the “always @(*)” construct). Back in the 90s, I tried to build a cloud biz (before ‘cloud’ was a word) selling time with it for that last couple of months when you need zillions of licenses and machines to run them on. Sadly we were probably ahead of our time, and we got killed by Enron the “smartest guys in the room” who made any Californian business that depended on the price of electricity impossible to budget for. I opened sourced the compiler and it died a quiet death.

I started this project using Icarus Verilog, using a relatively simple System Verilog subset, making sure I could also build on Verilator once I started running larger sims.

I’m a big fan of Xs, verilog’s 4-value ‘undefined’ values - in particular using them as clues to synthesis about when we don’t care and using them during simulation to propagate errors when assumptions are not met (and to check that we really don’t care)

For example - a 1-hot Mux:

always @(*)
casez (control) // synthesis full_case parallel_case
4'b1000: out = in0;
4'b0100: out = in1;
4'b0010: out = in2;
4'b0001: out = in3;
default: out = 'bx;
endcase

One could just use:

always @(*) 
casez (control) // synthesis full_case parallel_case
4'b1???: out = in0;
4'b?1??: out = in1;
4'b??1?: out = in2;
4'b???1: out = in3;
endcase

Chances are synthesis will make the same gates - but in the first case simulation is more likely to blow up (and here I mean blow up ‘constructively’ - signalling an error) if more than 1 bit of control is set. “full_case parallel_case” is a potentially dangerous thing, it gives the synthesis tool permission to do something that might not be the same as what the verilog compiler does during simulation (if the assumption here that control is one-hot isn’t met) the “out = ‘bx” gets us a warning as the Xs propagate if our design is broken.

Xs are also good for figuring out what really needs to be reset, propagating timing failures etc etc

Sadly Verilator doesn’t support Xs, while Icarus Verilog does. Icarus Verilog also compiles a lot faster making for far faster turn around for small tests - I had got to a point where I would use Verilator for long tests and Icarus for actual debugging once I could reproduce a problem “in the small”.

One of the problems I’ve had with this project is that in simple verilog there really wasn’t a way to pass an arbitrary number of an interface to an object - for example the dcache needs L dcache read ports, where L is the number of load ports - you can pass bit vectors of 1-bit structures, but not an array of n-bit structures like addresses, or an array of data results.

System Verilog provides a mechanism to do this via “interfaces” - and this time around I have embraced them with a passion - here’s NLOAD async dcache lookups:

interface DCACHE_LOAD #(int  RV=64, NPHYS=56, NLOAD=2);     
	typedef struct packed {
		bit [NPHYS-1:$clog2(RV/8)]addr; // CPU read port
	} DCACHE_REQ;
	typedef struct packed {
		bit hit;
		bit hit_need_o;                                     
		bit [RV-1:0]data;
	} DCACHE_ACK;
	DCACHE_REQ req[0:NLOAD-1];
	DCACHE_ACK ack[0:NLOAD-1];
endinterface

Sadly Icarus Verilog (currently) doesn’t support interfaces and I’ve had to remove them from our supported tools - there’s still stuff in the Makefile for its support (I live in hope :-). Even Verilator’s support is relatively minimal, it doesn’t support arrays within arrays, nor does it support unpacked structures (so debug in gtkwave is primitive).

I’m really going to miss 4-value simulation, I hope IV supports interfaces sometime - I wish I had the time to just do it and offer it up, but realisticly one can really only do one Big Thing at a time.

What this does mean though is that longer term a lot of those places where I’m currently making on-the-fly verilog from small C programs to get around the inability to pass arbitrary numbers of a thing to a module (the register file comes to mind), or using ifdef’s to enable particular interfaces will likely go away in the future to be replaced by interfaces. They won’t completely go away, I still need them to make things like arbitrary width 1-hot muxes - and more generally for large repetitive things that could be ruined by fat finger typos.

You can find the new load/store interfaces in “lstypes.si”.

Next time: Combining ALUs and Branch units

VRoom! blog - Memory Parallelism

Introduction

I’ve not posted in 2 months, mostly because I’ve been spending time redesigning the load/store unit this posting is about these changes. This is a long post, but then I’ve been doing a lot of work.

The problem

Back when I was bringing up Linux on the system I spent a lot of time looking at low level assembler trace, watching instructions flow through the CPU it became obvious that one of the bottlenecks is when large numbers of store instructions occur together they form an effective block to earlier scheduling of loads, in particular this happens at the beginning of subroutine calls when registers are being saved.

Ideally a subroutine should start loading data from memory as soon as possibly after entry - in practice code can’t load a value into say s0 until after it has been saved on the stack - however we have an out-of-order CPU with register renaming, it can happily load s0 BEFORE it is saved - provided there’s no memory aliasing between the place where the save is being done to and where the load is being done from. Ideally we’d like to be able to push as many load instructions before all those register save instructions as possible, we want to get any cache miss from main memory started as soon as possible, then we can save the registers into the storeQ and then into cache while we wait.

The old design

The old memory design issued N loads and M stores in every clock (variously 2:1, 2:2, and 3:2) - loads and stores executed in one clock (either into the store queue or retired if there was a cache/storeQ snoop hit.

We used an old trick, the L1 TLB is fully associative, the L1 data cache is set associative - we can look up the TLB at the same time that we do the SRAM index portion (with address bits that are not translated by the VM system) and then compare the output of the TLB (the physical address) with the fetched data cache tags to choose a set - this gives us a lot of useful very low level parallelism in the load/store unit - currently we run it in 1 clock (maybe not at 5GHz :-).

These days there are some downsides to this design - essentially it’s a general computer architecture problem: page sizes have not been increasing while cache line sizes and cache sizes have - the data cache index is generated from bits of the page index (the 12 LSBs of a virtual address) that are the same for a virtual and physical address - page sizes really haven’t changed since the early 80s, and RISC-V uses the same 4k/12-bits that have been used since then - but cache lines have gotten longer and caches have got larger- we use a 512-bit cache line (64 bytes) - that’s 6-bits of addressing - leaving 6-bits for indexing the data cache. This means that a direct mapped cache (ie 1 set) can at most have 64x64 bytes - ie 4K - to get a 64k L1 cache we need 16 sets, 128k needs 32 sets. One can’t have a large cache without large numbers of sets.

There are some advantages to having large numbers of sets mostly around Meltdown/Spectre/etc using random replacement and large numbers of sets muddies the water for those sorts of exploits - the downsides are essentially a 16:1 or 32:1 mux (and a larger fanout between the TLBs and the comparators) in a critical path.

placeholder

Getting back to the reason for this redesign - the big problem though is that in order to safely reorder the execution of load and store instructions safely we need to be able to compare their physical addresses when we’re doing the scheduling (not their virtual addresses as their might be aliasing) - and here we have a design where the TLB lookup and physically tagged data cache lookup are deeply intertwined. So it has to go ….

There’s another issue - the storeQ - this implicitly orders stores (and loads waiting for stores, or for cache fills) - it’s a queue, embedded in it’s design is an ordering, transactions must be retired in order, when a store reaches the head of the queue AND it’s associated instruction has reach the commit state (ie there’s no chance of a branch misprediction or a trap will cause it not be executed)`it attempts to update the L1 cache, if it gets a miss it triggers a cache line fetch and a subsequent update. This means that we can’t reorder stores to disjoint addresses post commit - and limits the number of cache line fetches we can have in parallel. The queue is nice in that it inherently embeds ordering of operations, but in reality most operations don’t require this. Again this also needs to go ….

The new design

So here’s the new design, it got a lot bigger ….:

placeholder

First thing to note it’s a 2 clock latency design - first clock is triggered when the address register in a load or store is available and just performs the TLB lookup portion of an operation. The second clock is triggered when all the addresses conflicts have been ‘resolved’ (that means that there are no instructions touching the same bytes that must be done in instruction order that haven’t been pushed into the storeQ) and, if the instruction is a store, the data to be stored is available from the pipe.

In the second clock load transactions either get resolved because they hit in the dcache or are snooped (from as yet uncommitted stores) from the storeQ. All store and fence transactions, IO transactions, loads that miss in dcache/snoop, loads that suffer hazards (for example a 4 byte load that discovers a single byte in that range waiting to be stored in the storeQ), all of these cases get put into the storeQ.

Pending transactions

The minimum transaction time is now 2 clocks, but it can be many more - we’re processing A address transactions (TLB lookups) per clock - we’re simulating with 6, but will likely have to drop back to 4 on the FPGA system. Some transactions can’t be completed right away, some store transactions might still be waiting for the data to store to become available, some might be hung up because there’s an instruction ahead of them that needs to be executed (in this context ‘executed’ means a load that hits in the icache or snoops in the storeQ, or any other instruction that has be pushed into the storeQ), or because there’s a preceding instruction that hasn’t been through the address unit (and therefore we can’t tell if there might be a load/store dependency.

The Pending Transactions buffer is a list of transactions (one entry for every instruction entry in the commitQ) that have been through the address unit - each entry contains a physical address, a mask of the bytes it will read or write (8 bits on a 64-bit machine), fence/amo dependency information and a hazard bitmask. The hazard bitmask is similar to the ‘hazard’ structure that we used in the old storeQ (and in the new storeQ described below) essentially each entry contains a bitmask (1 bit for every other pending transaction), each bit is set if that other pending transaction entry blocks this one.

For example: a load transaction might be blocked by one or more stores to one or more of the bytes the load loads, it might also be blocked by a fence or amoXXX instruction - it wont leave the pending transactions store until all of these blocking instructions have also left - it will likely find these hazards again (and maybe more, new ones) as it’s entered into the storeQ via the snoop operation.

Load/Store Units

The load store units have a scheduler that looks at how many free storeQ entries are ready (have 0 hazards) and chooses N load/store operations to perform, it prioritizes loads because in general we want to move loads before stores wherever possible and because we also use the load unit to retire completed loads (and things like AMOXXX, LC and SC) - each load unit has a dedicated write port into the general register file.

The load/store scheduler bypasses the pending transaction buffer in a way that it picks up entries from the address unit that are about to be written, this is how we can do 2 clock operations.

The load unit starts a dcache access and a snoop operation into the storeQ and then ….

  • I/O accesses go straight into the storeQ
  • if the snoop into the storeQ hits then it returns the newest data (effectively if there are multiple hits it’s the entry that has hazard bits that none of the others have) and the instruction completes
  • if the snoop hits but it returns a hazard (something we need to wait on to complete before we can proceed, like a fence, or a partially written location) we put the load into the storeQ
  • if there are no hazards and the dcache hits we return the dcache data
  • otherwise we put the entry into the storeQ

Stores and fences are similar, they always go into the storeQ, we perform a simpler snoop just to discover hazard information.

The new store queue

As mentioned above the storeQ is no longer a simple queue of entries where stuff gets put in at one end and interesting stuff happens to entries that reach the other end (and are ready to be committed).

Now it’s more of a heap, with multiple internal ad-hoc queues being created dynamically on the fly. This is done as part of the snoop we used to look for speculatively satisfying load transactions from as yet uncommitted stores, and a search for hazards (as described above) - we still do that but now we search for a wider class of ‘hazards’ that in also represent store-store ordering (making sure that stores to the same location occur in the same order), load-store dependencies (loads from the same location as stores occur after the ones that are before them in instruction order) and store-load dependences (stores don’t occur until after the loads that read the current data in a location have completed), we also use hazards to represent fence and amo-style blockages.

A ‘hazard’ is actually represented by a bit mask with one bit for each storeQ entry, a storeQ entry is created with the hazards detected when the storeQ is snooped at the load/store unit stage. Logic keeps track of which hazard bits are valid, clearing bits as transactions ahead in the ad-hoc queues are retired a storeQ entry is not allowed to proceed until all its hazards have been cleared.

Store entries also keep track of activity in the cache lines they would access - the load/store unit has an implicit understanding with the underlying cache coherency fabric that it will never make multiple concurrent transaction requests for the same cache line. Now if one storeQ entry makes a request for a cache line the others must wait (but when it arrives they all get woken up). With this new storeQ we don’t have to wait for an entry to reach the head of the queue and we can now have many outstanding cache transactions (before it was just one store and a few loads).

Summary

We’re currently at the point where this new design passes the sanity tests, I’m sure it’s still buggy and I need to spend some times writing some architectural white-box tests trying to poke in the obviously hard to trigger corners.

It’s big! lots of NxM stuff - but then this whole design was always based on asking the question “what if we throw gates at it?” we have to do lots of comparisons to discover where the hazards are if we want to get real memory parallelism.

I’m currently simulating a 6:4:4:6 system - 6 address units, 4 load and 4 store units and 6 write ports to the storeQ (so no more than 6 of the load/store units can be scheduled per clock, this area needs some work). 4 load units means 4 read ports on the dcache, there’s also still only 1 dcache write port there which limits write bandwidth, and how fast the storeQ can drain, this also needs some work, and will change in the future.

Performance

I’ve done a small amount of micro-benchmarking - close examination of code sequences show that the area I was particularly targeting (write bursts at the beginning of subroutines due to register saves) perform well, the address unit fills to capacity (they’re all offsets from the same register SP, and they’re ready to schedule almost right away), the store unit also fills, and the load units fill as soon as the commitQ starts presenting loads, at that point stores start sitting as pending transactions and loads pass them to be handled - which is also one of our goals.

The main downside is that our one clock load has become a 2 clock load, places where we load an address from memory and then immediately load something from that address suffer.

I had thought I was done with dhrystone, but it still proves to be useful as a small easy to run test that exposes microarchitectural issues so here are the results:

Dhrystone on the old design sat at 4.27 dhrystone/MHz - we recently switched to a clang compile because we expected it to expose more parallelism to the instruction stream - all numbers (before and after) here are running that same code image and now is at 5.88 which meets our original architectural target (a hand wavy number “5+” pulled out of thin air …) - not a bad increase.

Dhrystone is a very branchy test, we can issue up to 8 instructions per clock but running dhrystone we only get 3.61 - that limits how full our pipe can get and how much parallelism can likely happen - on this new run we see an average of 2.33 instructions executed per clock (it can’t be larger than that 3.61 issue rate). This is still a useful benchmark as it’s worth spending time figuring out where those IPC are being lost, I’m pretty sure the 2-clock load is now getting some of it, but on the whole it’s been a great trade for an almost 40% performance increase. Note that that 3.61 issue rate/2.33 IPC implies that the largest dhrystone/MHz we can reach with this code base is effectively with tweakage is ~9.1.

Switching to clang seems to have exposed a BTC issue, which needs to be fixed which might push us to ~6.2ish (more hand waving here), but after that probably the next big dhrystone boost will come from installing an L0 instruction trace cache that should bust the 3.61 issue rate up to close to 8 and expose more parallelism to this new load/store architecture - that’s for another day.

Future Work

This has been a big change, it needs more testing, I need to move it back onto the AWS/FPGA platform for more shaking out, that in itself will be a big job, I’ll probably build a 4:2:2:4 system and even then it may not fit in a VU9P (anyone want to contribute a VU13/VU19 board to the cause?). That’s going to take a long time.

The new storeQ and the pending transactions block have evolved to be quite similar - I can’t help but feel there’s fat there that can be removed. The one main sticking point is data life times, the pending transactions are tied 1-1 with commitQ entries (just located elsewhere because otherwise routing might be a nightmare) while storeQ entries can long outlive the commitQ entries that birth them (a storeQ entry can’t issue a dcache change, or the memory request to fill the dcache entry it wants to change until after its associated commitQ entry has been committed, and is about to be recycled).

I also want to start some more serious work on timing, not to make a real chip but to give me more feedback on my microarchitecture decisions (I’m sure some stuff will blow up in my face :-) to that end I’ve started investigating getting a build into OpenLane/Skyworks - again not with the intent of taping something out (it’ probably too big for any of the open source runs) but more as a reality check.

Next time: Verilog changes

VRoom! blog - Virtual Memory

Introduction

Booting Linux isn’t going to work without some form of virtual memory, RISC-V has a well defined spec for VM, implementation isn’t hard - page tables are well defined, there’s nothing particularly unusual or surprising there

L1 TLBs

We have separate Instruction and Data level one TLBs, they’re fully associative which means that we’re not practically limited to power of two sizes (though currently they have 32 entries each), each entry contains a mapping between an ASID and upper bits of a virtual address and a physical address - we support the various sizes of pages (both in tags and data).

A fully associative TLB with random replacement makes it harder for Meltdown/Spectre sorts of attacks to use TLB replacement to attack systems. On VROOM memory accesses that miss in the TLB never result in data cache changes.

Since the ALUs, and in particular the memory and fetch units, are shared between HARTs (CPUs) in the same simultaneous multi-threaded core then so are the TLBs shared. We need a way to distinguish mappings between HARTs - this implementation is simple, we reserve a portion of the ASID which is forced to a unique value for each core - each HART thinks it has N bits of ASID, in reality there are N+1. There’s also a system flag that we can optionally set that lets SMT HARTs share the full sized ASID (provided the software understands that ASIDs are system rather than CPU global).

Currently we use the old trick of doing L1 TLB lookup with the upper bits of a virtual address while using the lower bits in parallel to do the indexing of the L1 data/instruction caches - large modern cache line sizes mean that you have to go many ways/sets to get large data and instruction caches - this also helps with Meltdown/Spectre/etc mitigation.

I’m currently redesigning the whole memory datapath unit to split TLB lookup and data cache access into separate cycles - mostly to expose more parallelism during scheduling - more about that in a later posting once it’s all working.

L2 TLBs

TLB misses result in stalled instructions in the commitQ - there’s a small queue of pending TLB lookups in the memory unit and 2 pending lookups in the fetch unit - they feed the table walker state machine which starts by dipping into the L2 TLB cache - currently it’s a 4-way 256 entry (so 1k entries total) set associative cache shared between the instruction and data TLBs - TLB data found here is fed directly to the L1 TLBs (a satisfied L1 miss takes 4 clocks).

Table walking

If a request also misses in the L2 TLB cache the table walker state machine starts walking page table trees.

Full cache lines of TLB data are fetched into a local read-only cache which contains a small number of entries, essentially enough for 1 line for each level in the page hierarchy, and 2 for the lowest level, repeated for instruction TLB and the data TLB (then repeated again for multiple HARTs).

After initial filling most table walks hit in this cache. This cache is slightly integrated into the data L1 I-cache, they share a read-only access port into the cache coherency fabric, and both can be invalidated by data cache snoops shootdowns.

TLB Invalidation

TLB invalidation is triggered by executing a TLB flush instruction - these instructions let the instructions before them in the commitQ execute before they themselves are executed.

At this point they trigger a commitQ flush (tossing any speculative instructions executed with the old VM mappings). At the same time they trigger L1 and L2 TLB flushes. Note: there is no need to invalidate the TLB walker’s small data cache as it will have been invalidated (shot down) by the cache coherency protocols if any page tables were changed in the process.

Next time: (Once I get it working) Data memory accesses

VRoom! blog - Building on AWS

Introduction

I started doing VLSI design in the early ’90s building graphics accelerators at 2um and later in the decade building CPUs at 1.8u-0.5u - gates and pins were expensive - we once bet the company on the viability of a 208 pin plastic package, something that paid off magnificently.

I started this project with the vague idea of “what happens if I throw a lot of gates at it?” - my original planned development platform was a board off of Aliexpress a lot like this one an Xylinx Kinetix 420 based board with embedded DRAM - my plan was to wire an SD card socket to it and used the internal USB to serial hardware to boot linux.

When I first compiled up VRoom! for it I had a slight problem …. the design was 50% too big! oops ….

So I went around looking for bigger targets ….

AWS’s FPGA Instances

AWS provides an FPGA based service based on Xilinx’s Ultrascale VU9Ps - these are much larger FPGAs, just what the doctor ordered. The F instances seem to be aimed at high frequency traders - we’re small fry in comparison. They are based on a PCIE board in an Intel based CPU - we assume they actually put 4 boards into each CPU and sell the minimum as a virtual instance with 8 virtual cores and a physical FPGA board. This is the F1 instance that we use for development.

The AWS F instances each include a VU9P, 4 sets of DDR4 (we use one), and several PCIEs to the host CPU (we use one simple one).

The VU9P is an interesting beast it seems to actually be a sandwich of 3 dies, with ~1500 wires between the middle die and the upper die and ~1500 wires between the middle die and the lower die. These wires make the thing possible but they’re also a problem, firstly they are potentially a scarce routing resource (not for us yet), and secondly they are slow - for speed Xilinx recommend that one carefully pipe all the die crossings (ie a flop on either side). We’ve decided to not do that, as it would mean messing with a design where we’re actually trying to carefully debug the actual pipelining for a real CPU. Rather than have this reality intrude on our design we’ve had to reduce our target clock from 100MHz to 25MHz currently most critical paths have several die crossings.

placeholder

The above image shows a recent layout plot - you can see the three dies. The upper 25% on the center and right hand side dies is AWS’s “shell” a built in interface to the host CPU and one DDR4 controller. There is now a smaller shell available which we may switch to soon that removes a whole lot of functionality that we don’t need, and gives us ~10% more gates (but sadly in the wrong places).

Development is done on an AWS FPGA development instance, you don’t need to pay for a Vivado license if you do your development on AWS. The actual build environment, documentation (and the shell) is available on github.

Build time is very variable, our big problem is congestion (timing issues come from that) and builds can take any time from 10-20 hours and don’t always succeed. We’ve had to drop the sizes of out I/D caches by 50% and our BTC by 8x to make this work.

AWS don’t want us to trash their hardware so after you’ve completed building a new design you have to submit it for testing - they do a bunch of things presumably looking for over current and over temp issues (we haven’t had one rejected yet). This takes about an hour.

F1 instances cost ~US$1.50/hour to use, the build systems about US$1/hour.

Chip Architecture

AWS provide their “shell”, a fixed, pre-laid out interface - you can see it in this block diagram as “AWS support”:

placeholder

The shell provides AXI interfaces looking inwards (our DRAM controller is a master, our disk/UART are clients).

We’ve added a simple UART, dumb fake disk controller (really just a 256-byte FIFO) to the shell, and a register that allows us to reset the VROOM!. These simple devices are made visible on the PCIE as registers and mapped into a user space linux app running on the host Intel CPU.

The VROOM! is instanced with a minimal number of connections (the above devices, DRAM and clocks). It is essentially the same top level we use in verilog simulation (from chip.sv down, one CPU and one set of IO devices).

Software

The Linux user space software is a thread that maps the PCIE register space and then pulls the CPU out of reset. It then loops reading from the UART and ‘disk’ registers, if it finds a disk request it provides data from one of two sources (an OpenSBI/Uboot image or a linux file system image), if it finds incoming uart data it wakes a text display thread to print it on the console. A third thread reads data from the console and puts it into the uart to send when space is available.

Software release

We haven’t included the AWS interface code in the current VROOM! source release, partly because we don’t want to confuse it with the real thing we’re trying to build. But also because it’s mostly embedded in code that is AWS’s IP (the shell API is one big verilog module that one needs to embed one’s own IP) - there are no secrets on our part, we’re happy to share if anyone is interested.

Next time: (probably after Xmas) Virtual Memory

VRoom! blog - Memory Layout

Introduction

A short post this week about physical memory layout and a little bit about booting. I’ll talk more about virtual memory another time

Physical memory layout

We currently use a 56-bit physical address, this is the address used with the MMU disabled or after a virtual address has been translated.

Addresses with bit 55 (the most significant bit) set to 0 are treated as cacheable memory space - instruction and data accesses go through separate L1 caches but the cache coherency protocol makes sure that they see the same data. Cache lines are 512 bits.

Everything in this space is cache coherent, once the system is booted it’s the only place that code can be executed from. Because it’s always coherent, the fence.i instruction is a simple thing for us, all it has to do is to wait for the writes to drain from the store queue and then flush the commitQ (fences go in the commitQ so all that happens when it reaches the end of the queue), subsequent instruction fetches pull live modified data from the data cache into the instruction cache.

At the moment we have either a fake memory emulator on the simulator, or connect to DRAM on the AWS FPGA implementation - both implementations back onto the coherent MOESI bus

Within this space main memory (DRAM) starts at address 0x00000000. What happens if you access memory outside of the area where there’s real DRAM is undefined, with the proviso that it wont lock up (writes probably get lost or wrap around, reads may return undefined data - it’s up to the real memory controller to choose how it behaves - on a big system the memory controller may be on another die).

Addresses with bit 55 set are handled differently for data accesses and instruction accesses:

  • instruction accesses got to a boot ROM at a known fixed address, the MMU never generates these addresses
  • data accesses go to a shared 64-bit IO bus (can be accessed through the MMU)

The current data IO bus contains:

  • a ROM containing a device tree image
  • timers
  • a uart
  • gpio
  • multiple interrupt controllers PLIC/CLNT/CLIC
  • a fake disk controller

Booting

Currently reset causes the CPU to jump to the start of code compiled into hardware in the special instruction space. Currently that code:

  • checks a GPIO pin (reads from IO space)
  • if it’s set it jumps to address 0 (this is for the simulator where main memory can be side loaded)
  • if it’s clear we read a boot image from the fake disk controller (connected to test fixture code on the simulator, and to linux user space on the AWS FPGA world) into main memory, then jump to it (currently we load uboot/OpenSBI)

Longer term we plan to put the two L1 caches into a mode at reset where the memory controller is disabled and the data cache allocates lines on write, the instruction cache will use the existing cache coherency protocol to pull shared cache lines from the data cache. The on-chip bootstrap will copy data into L1 from an SD/eMMC, validate it (a crypto signature check), and jump into it. This code will initialize the DRAM controller, run it through its startup conditioning, initialize the rest of the devices, take the cache controller out of its reset mode and then finally load OpenSBi/UEFI/uboot/etc into DRAM.

Next time: Building on AWS