12 Mar 2022
Introduction
Not really an architectural posting, more about tooling, skip if it’s not really your thing, also new
some performance numbers.
I found the bug in the BTC - mispredicted 4-byte branches that crossed a 16 byte boundary (our
instruction bundle size) this is now fixed and we have better dhrystone numbers: 6.33 DMIPS/MHz
at an IPC of 2.51 - better than I expected! We can still do better.
More System Verilog, Interfaces for the (mostly) win
Last week I posted a description about my recent major changes to the way that the load/store
unit works, what I didn’t touch on is some pretty major changes to the ways the actual Verilog code
is written.
I’m an old-time verilog hack, I’ve even written a compiler (actually 2 of them, you can
thank me for the * in the “always @(*)” construct). Back in the
90s, I tried to build a cloud biz (before ‘cloud’ was a word) selling time with it for that last
couple of months when you need zillions of licenses and machines to run them on. Sadly we were
probably ahead of our time, and we got killed by Enron the “smartest guys in the room” who made
any Californian business that depended on the price of electricity impossible to budget for. I
opened sourced the compiler and it died a quiet death.
I started this project using Icarus Verilog, using a relatively simple System Verilog subset,
making sure I could also build on Verilator once I started running larger sims.
I’m a big fan of Xs, verilog’s 4-value ‘undefined’ values - in particular using them as clues to
synthesis about when we don’t care and using them during simulation to propagate errors when
assumptions are not met (and to check that we really don’t care)
For example - a 1-hot Mux:
always @(*)
casez (control) // synthesis full_case parallel_case
4'b1000: out = in0;
4'b0100: out = in1;
4'b0010: out = in2;
4'b0001: out = in3;
default: out = 'bx;
endcase
One could just use:
always @(*)
casez (control) // synthesis full_case parallel_case
4'b1???: out = in0;
4'b?1??: out = in1;
4'b??1?: out = in2;
4'b???1: out = in3;
endcase
Chances are synthesis will make the same gates - but in the first case simulation is more likely to blow up
(and here I mean blow up ‘constructively’ - signalling an error) if more than 1 bit of
control is set. “full_case parallel_case” is a potentially dangerous thing, it gives the synthesis tool
permission to do something that might not be the same as what the verilog compiler does during simulation
(if the assumption here that control is one-hot isn’t met) the “out = ‘bx” gets us a warning as the Xs
propagate if our design is broken.
Xs are also good for figuring out what really needs to be reset, propagating timing failures etc etc
Sadly Verilator doesn’t support Xs, while Icarus Verilog does. Icarus Verilog also compiles
a lot faster making for far faster turn around for small tests - I had got to a point where I would
use Verilator for long tests and Icarus for actual debugging once I could reproduce a problem “in the
small”.
One of the problems I’ve had with this project is that in simple verilog there really wasn’t a
way to pass an arbitrary
number of an interface to an object - for example the dcache needs L dcache read ports, where L is the
number of load ports - you can pass bit vectors of 1-bit structures, but not an array of n-bit structures
like addresses, or an array of data results.
System Verilog provides a mechanism to do this via “interfaces” - and this time around I have embraced them
with a passion - here’s NLOAD async dcache lookups:
interface DCACHE_LOAD #(int RV=64, NPHYS=56, NLOAD=2);
typedef struct packed {
bit [NPHYS-1:$clog2(RV/8)]addr; // CPU read port
} DCACHE_REQ;
typedef struct packed {
bit hit;
bit hit_need_o;
bit [RV-1:0]data;
} DCACHE_ACK;
DCACHE_REQ req[0:NLOAD-1];
DCACHE_ACK ack[0:NLOAD-1];
endinterface
Sadly Icarus Verilog (currently) doesn’t support interfaces and I’ve had to remove them from our
supported tools - there’s still stuff in the Makefile for its support (I live in hope :-). Even Verilator’s
support is relatively minimal, it doesn’t support arrays within arrays, nor does it support
unpacked structures (so debug in gtkwave is primitive).
I’m really going to miss 4-value simulation, I hope IV supports interfaces sometime - I wish
I had the time to just do it and offer it up, but realisticly one can really only do one Big Thing at a time.
What this does mean though is that longer term a lot of those places where I’m currently making
on-the-fly verilog from small C programs to get around the inability to pass arbitrary numbers
of a thing to a module (the register file comes to mind), or using ifdef’s to enable
particular interfaces will likely go away in the future to be replaced by interfaces. They won’t completely
go away, I still need them to make things like arbitrary width 1-hot muxes - and more generally
for large repetitive things that could be ruined by fat finger typos.
You can find the new load/store interfaces in “lstypes.si”.
Next time: Combining ALUs and Branch units
10 Mar 2022
Introduction
I’ve not posted in 2 months, mostly because I’ve been spending time redesigning the load/store unit
this posting is about these changes. This is a long post, but then I’ve been doing a lot of work.
The problem
Back when I was bringing up Linux on the system I spent a lot of time looking at low level
assembler trace, watching instructions flow through the CPU it became obvious that one of
the bottlenecks is when large numbers of store instructions occur together they form an
effective block to earlier scheduling of loads, in particular this happens at the beginning
of subroutine calls when registers are being saved.
Ideally a subroutine should start loading
data from memory as soon as possibly after entry - in practice code can’t load a value into say
s0 until after it has been saved on the stack - however we have an out-of-order CPU with register
renaming, it can happily load s0 BEFORE it is saved - provided there’s no memory aliasing between the
place where the save is being done to and where the load is being done from. Ideally we’d like to be
able to push as many load instructions before all those register save instructions as possible, we want to
get any cache miss from main memory started as soon as possible, then we can save the
registers into the storeQ and then into cache while we wait.
The old design
The old memory design issued N loads and M stores in every clock (variously 2:1, 2:2, and 3:2) - loads and
stores executed in one clock (either into the store queue or retired if there was a cache/storeQ snoop hit.
We used an old trick, the L1 TLB is fully associative, the L1 data cache is set associative - we can look up
the TLB at the same time that we do the SRAM index portion (with address bits that are not translated by the
VM system) and then compare the output of the TLB (the
physical address) with the fetched data cache tags to choose a set - this gives us a lot of useful
very low level parallelism in the load/store unit - currently we run it in 1 clock (maybe not at 5GHz :-).
These days there are some downsides to
this design - essentially it’s a general computer architecture problem: page sizes have not been
increasing while cache line sizes and cache sizes have -
the data cache index is generated from bits of the page index (the 12 LSBs of a virtual address) that
are the same for a virtual and physical address - page sizes really haven’t changed since the early 80s,
and RISC-V uses the same 4k/12-bits that have been used since then - but cache lines have gotten
longer and caches have got larger- we
use a 512-bit cache line (64 bytes) - that’s 6-bits of addressing - leaving 6-bits for indexing the
data cache.
This means that a direct mapped cache (ie 1 set) can at most have 64x64 bytes - ie 4K - to get a 64k
L1 cache we need 16 sets, 128k needs 32 sets. One can’t have a large cache without large numbers
of sets.
There are some advantages to having large numbers
of sets mostly around Meltdown/Spectre/etc using random replacement and large numbers of sets muddies
the water for those sorts of exploits - the downsides are essentially a 16:1 or 32:1 mux (and a larger fanout between the TLBs and the comparators) in a critical path.
Getting back to the reason for this redesign - the big problem though is that in order to safely reorder the execution of load and store instructions
safely we need to be able to compare their physical addresses when we’re doing the scheduling (not their
virtual addresses as their might be aliasing) - and here we have a design where the TLB lookup
and physically tagged data cache lookup are deeply intertwined. So it has to go ….
There’s another issue - the storeQ - this implicitly orders stores (and loads waiting for stores, or
for cache fills) -
it’s a queue, embedded in it’s design is an ordering, transactions must be retired in order, when a store
reaches the head of the queue AND it’s associated instruction has reach the commit state (ie there’s no
chance of a branch misprediction or a trap will cause it not be executed)`it attempts to update the
L1 cache, if it gets a miss it triggers a cache line fetch and a subsequent update. This means that we can’t reorder stores to disjoint addresses post commit - and limits the number of cache line fetches we can
have in parallel. The queue is nice in that it inherently embeds ordering of operations, but in
reality most operations don’t require this. Again this also needs to go ….
The new design
So here’s the new design, it got a lot bigger ….:
First thing to note it’s a 2 clock latency design - first clock is triggered when the address register in
a load or store is available and just performs the TLB lookup portion of an operation.
The second clock is triggered when all the addresses conflicts have been ‘resolved’ (that means that there
are no instructions touching the same bytes that must be done in instruction order that haven’t been pushed into the storeQ) and, if the instruction
is a store, the data to be stored is available from the pipe.
In the second clock load transactions either
get resolved because they hit in the dcache or are snooped (from as yet uncommitted stores) from the storeQ.
All store and fence transactions, IO transactions, loads that miss in dcache/snoop, loads that suffer
hazards (for example a 4 byte load that discovers a single byte in that range waiting to be stored in the
storeQ), all of these cases get put into the storeQ.
Pending transactions
The minimum transaction time is now 2 clocks, but it can be many more - we’re processing A address
transactions (TLB lookups) per clock - we’re simulating with 6, but will likely have to drop back to 4
on the FPGA system. Some
transactions can’t be completed right away, some store transactions might still be waiting for the data to
store to become available, some might be hung up because there’s an instruction ahead of them that needs
to be executed (in this context ‘executed’ means a load that hits in the icache or snoops in the storeQ,
or any other instruction that has be pushed into the storeQ), or because there’s a preceding instruction that
hasn’t been through the address unit (and therefore we can’t tell if there might be a load/store dependency.
The Pending Transactions buffer is a list of transactions (one entry for every instruction entry in
the commitQ) that have been through the address unit - each entry contains a physical address, a mask of
the bytes it will read or write (8 bits on a 64-bit machine), fence/amo dependency information and a hazard
bitmask. The hazard bitmask is similar to the ‘hazard’ structure that we used in the old storeQ (and in the
new storeQ described below) essentially each entry contains a bitmask (1 bit for every other pending
transaction), each bit is set if that other pending transaction entry blocks this one.
For example: a load transaction might be blocked by one or more stores to one or more of the bytes the
load loads, it might also be blocked by a fence or amoXXX instruction - it wont leave the pending
transactions store until all of these blocking instructions have also left - it will likely find these
hazards again (and maybe more, new ones) as it’s entered into the storeQ via the snoop operation.
Load/Store Units
The load store units have a scheduler that looks at how many free storeQ entries are ready (have 0 hazards)
and chooses
N load/store operations to perform, it prioritizes loads because in general we want to move loads
before stores wherever possible and because we also use the load unit to retire completed loads (and
things like AMOXXX, LC and SC) - each load unit has a dedicated write port into the general register file.
The load/store scheduler bypasses the pending transaction buffer in a way that it picks up entries from
the address unit that are about to be written, this is how we can do 2 clock operations.
The load unit starts a dcache access and a snoop operation into the storeQ and then ….
- I/O accesses go straight into the storeQ
- if the snoop into the storeQ hits then it returns the newest data (effectively if there are multiple hits it’s the entry that has hazard bits that none of the others have) and the instruction completes
- if the snoop hits but it returns a hazard (something we need to wait on to complete before we can proceed,
like a fence, or a partially written location) we put the load into the storeQ
- if there are no hazards and the dcache hits we return the dcache data
- otherwise we put the entry into the storeQ
Stores and fences are similar, they always go into the storeQ, we perform a simpler snoop
just to discover hazard information.
The new store queue
As mentioned above the storeQ is no longer a simple queue of entries where stuff gets put in at one end and
interesting stuff happens to entries that reach the other end (and are ready to be committed).
Now it’s more of a heap, with multiple internal ad-hoc queues being created dynamically on the fly. This is
done as part of the snoop we used to look for speculatively satisfying load transactions from as yet
uncommitted stores, and a search for hazards (as described above) - we still do that but now we search for
a wider class of ‘hazards’ that in also represent store-store ordering (making sure that stores to the
same location
occur in the same order), load-store dependencies (loads from the same location as stores occur after the
ones that are before them in instruction order) and store-load dependences (stores don’t occur until after
the loads that read the current data in a location have completed), we also use hazards to represent
fence and amo-style blockages.
A ‘hazard’ is actually represented by a bit mask with one bit for each storeQ entry, a storeQ entry is
created with the hazards detected when the storeQ is snooped at the load/store unit stage. Logic keeps
track of which hazard bits are valid, clearing bits as transactions ahead in the ad-hoc queues are retired
a storeQ entry is not allowed to proceed until all its hazards have been cleared.
Store entries also keep track of activity in the cache lines they would access - the load/store unit has
an implicit understanding with the underlying cache coherency fabric that it will never make multiple
concurrent transaction requests for the same cache line. Now if one storeQ entry makes a request for a
cache line the others must wait (but when it arrives they all get woken up). With this new storeQ we
don’t have to wait for an entry to reach the head of the queue and we can now have many
outstanding cache transactions (before it was just one store and a few loads).
Summary
We’re currently at the point where this new design passes the sanity tests, I’m sure it’s still buggy
and I need to spend some times writing some architectural white-box tests trying to poke in the
obviously hard to trigger corners.
It’s big! lots of NxM stuff - but then this whole design was always based on asking the question “what if
we throw gates at it?” we have to do lots of comparisons to discover where the hazards are if we want
to get real memory parallelism.
I’m currently simulating a 6:4:4:6 system - 6 address units, 4 load and 4 store units and 6 write ports to
the storeQ (so no more than 6 of the load/store units can be scheduled per clock, this area needs some work).
4 load units means 4 read ports on the dcache, there’s also still only 1 dcache write port there which
limits write bandwidth, and how fast the storeQ can drain, this also needs some work, and will change
in the future.
I’ve done a small amount of micro-benchmarking - close examination of code sequences show that the area
I was particularly targeting (write bursts at the beginning of subroutines due to register saves) perform
well, the address unit fills to capacity (they’re all offsets from the same register SP, and they’re ready
to schedule almost right away), the store unit also fills, and the load units fill as soon as the commitQ
starts presenting loads, at that point stores start sitting as pending transactions and loads pass them
to be handled - which is also one of our goals.
The main downside is that our one clock load has become a 2 clock load, places where we load an address
from memory and then immediately load something from that address suffer.
I had thought I was done with dhrystone, but it still proves to be useful as a small easy to run
test that exposes microarchitectural issues so here are the results:
Dhrystone on the old design sat at 4.27 dhrystone/MHz - we recently switched to a clang compile because we
expected it to expose more parallelism to the instruction stream - all numbers (before and after) here
are running that same code image and now is at 5.88 which meets our original architectural
target (a hand wavy number “5+” pulled out of thin air …) - not a bad increase.
Dhrystone is a very branchy test, we can issue up to 8 instructions per clock but
running dhrystone we only get 3.61 - that limits how full our pipe can get and how much parallelism can
likely happen - on this new run we see an average of 2.33 instructions executed per clock (it can’t
be larger than that 3.61 issue rate). This is still a useful benchmark as it’s worth spending time
figuring out where those IPC are being lost, I’m pretty sure the 2-clock load is now getting some
of it, but on the whole it’s been a great trade for an almost 40% performance increase.
Note that that 3.61 issue rate/2.33 IPC implies that the largest dhrystone/MHz we can
reach with this code base is effectively with tweakage is ~9.1.
Switching to clang seems to have exposed a BTC issue, which needs to be fixed which might push us to
~6.2ish (more hand waving here), but after that probably the next big dhrystone boost will come
from installing an L0 instruction trace cache
that should bust the 3.61 issue rate up to close to 8 and expose more parallelism to this new load/store
architecture - that’s for another day.
Future Work
This has been a big change, it needs more testing, I need to move it back onto the AWS/FPGA platform
for more shaking out, that in itself will be a big job, I’ll probably build a 4:2:2:4 system and even
then it may not fit in a VU9P (anyone want to contribute a VU13/VU19 board to the cause?). That’s going
to take a long time.
The new storeQ and the pending transactions block have evolved to be quite similar - I can’t help but feel
there’s fat there that can be removed. The one main sticking point is data life times, the pending
transactions are tied 1-1 with commitQ entries (just located elsewhere because otherwise routing might be a
nightmare) while storeQ entries can long outlive the commitQ entries that birth them (a storeQ
entry can’t issue a dcache change, or the memory request to fill the dcache entry it wants to change
until after its associated commitQ entry has been committed, and is about to be recycled).
I also want to start some more serious work on timing, not to make a real chip but to give me more
feedback on my microarchitecture decisions (I’m sure some stuff will blow up in my face :-) to that end I’ve
started investigating getting a build into OpenLane/Skyworks - again not with the intent of
taping something out (it’ probably too big for any of the open source runs) but more as a reality check.
Next time: Verilog changes
16 Jan 2022
Introduction
Booting Linux isn’t going to work without some form of virtual memory, RISC-V has
a well defined spec for VM, implementation isn’t hard - page tables are well defined,
there’s nothing particularly unusual or surprising there
L1 TLBs
We have separate Instruction and Data level one TLBs, they’re fully associative which means that
we’re not practically limited to power of two sizes (though currently they have 32 entries each),
each entry contains a mapping between an ASID and upper bits of a virtual address and a physical
address - we support the various sizes of pages (both in tags and data).
A fully associative TLB with random replacement makes it harder for Meltdown/Spectre sorts of
attacks to use TLB replacement to attack systems. On VROOM memory accesses that miss in the TLB never result in
data cache changes.
Since the ALUs, and in particular the memory and fetch units, are shared between HARTs (CPUs) in the same
simultaneous multi-threaded core then so are the TLBs shared. We need a way to distinguish mappings between HARTs - this implementation
is simple, we reserve a portion of the ASID which is forced to a unique value for each core - each HART thinks it has N bits of ASID, in reality there are N+1. There’s also
a system flag that we can optionally set that lets SMT HARTs share the full sized ASID (provided the software understands
that ASIDs are system rather than CPU global).
Currently we use the old trick of doing L1 TLB lookup with the upper bits of a virtual address while using
the lower bits in parallel to do the indexing of the L1 data/instruction caches - large modern cache line sizes
mean that you have to go many ways/sets to get large data and instruction caches - this also helps
with Meltdown/Spectre/etc mitigation.
I’m currently redesigning the whole memory datapath unit to split TLB lookup and data cache access
into separate cycles - mostly to expose more parallelism during scheduling - more about that in
a later posting once it’s all working.
L2 TLBs
TLB misses result in stalled instructions in the commitQ - there’s a small queue of pending TLB
lookups in the memory unit and 2 pending lookups in the fetch unit - they feed the table walker
state machine which starts by dipping into the L2 TLB cache - currently it’s a 4-way 256 entry (so
1k entries total) set associative cache shared between the instruction and data TLBs - TLB data found here is fed directly to the L1 TLBs (a satisfied L1 miss takes 4
clocks).
Table walking
If a request also misses in the L2 TLB cache the table walker state machine starts walking page table trees.
Full
cache lines of TLB data are fetched into a local read-only cache which contains a small number of entries,
essentially enough for 1 line for each level in the page hierarchy, and 2 for the lowest level, repeated
for instruction TLB and the data TLB (then repeated again for multiple HARTs).
After initial filling most table walks hit in this cache. This cache is slightly integrated into the data L1 I-cache, they share a
read-only access port into the cache coherency fabric, and both can be invalidated by data cache snoops shootdowns.
TLB Invalidation
TLB invalidation is triggered by executing a TLB flush instruction - these instructions let the instructions
before them in the commitQ execute before they themselves are executed.
At this point they trigger a
commitQ flush (tossing any speculative instructions executed with the old VM mappings). At the same time
they trigger L1 and L2 TLB flushes. Note: there is no need to invalidate the TLB walker’s small data cache
as it will have been invalidated (shot down) by the cache coherency protocols if any page tables were
changed in the process.
Next time: (Once I get it working) Data memory accesses
19 Dec 2021
Introduction
I started doing VLSI design in the early ’90s building graphics
accelerators at 2um and later in the decade building CPUs at 1.8u-0.5u - gates
and pins were expensive - we once bet the company on the viability of a 208
pin plastic package, something that paid off magnificently.
I started this project with the vague idea of “what happens if I
throw a lot of gates at it?” - my original planned development platform was a
board off of Aliexpress a lot like this one an Xylinx Kinetix 420 based board with embedded
DRAM - my plan was to wire an SD card socket to it and used the internal USB to serial
hardware to boot linux.
When I first compiled up VRoom! for it I had a slight problem …. the design was 50% too big! oops ….
So I went around looking for bigger targets ….
AWS’s FPGA Instances
AWS provides an FPGA based service based on Xilinx’s Ultrascale VU9Ps - these are much larger
FPGAs, just what the doctor ordered. The F instances seem to be aimed at high
frequency traders - we’re small fry in comparison. They are based on a PCIE board in an
Intel based CPU - we assume they actually put 4 boards into each CPU and sell the
minimum as a virtual instance with 8 virtual cores and a physical FPGA board. This
is the F1 instance that we use for development.
The AWS F instances each include a VU9P, 4 sets of DDR4 (we use one), and several PCIEs
to the host CPU (we use one simple one).
The VU9P is an interesting beast it seems to actually be a sandwich of 3 dies, with
~1500 wires between the middle die and the upper die and ~1500 wires between the middle die and the lower die. These wires make the thing possible but they’re also a problem,
firstly they are potentially a scarce routing resource (not for us yet), and secondly
they are slow - for speed Xilinx recommend that one carefully pipe all the die crossings (ie a flop on either side). We’ve
decided to not do that, as it would mean messing with a design where we’re actually
trying to carefully debug the actual pipelining for a real CPU. Rather than have this
reality intrude on our design we’ve had to reduce our target clock from 100MHz to 25MHz
currently most critical paths have several die crossings.
The above image shows a recent layout plot - you can see the three dies. The upper 25%
on the center and right hand side dies is AWS’s “shell” a built in interface to the
host CPU and one DDR4 controller. There is now a smaller shell available which we
may switch to soon that removes a whole lot of functionality that
we don’t need, and gives us ~10% more gates (but sadly in the wrong places).
Development is done on an AWS FPGA development instance, you don’t need to pay
for a Vivado license if you do your development on AWS. The actual build environment,
documentation (and the shell) is available on github.
Build time is very variable, our big problem is congestion (timing issues come
from that) and builds can take any time from 10-20 hours and don’t always succeed.
We’ve had to drop the sizes of out I/D caches by 50% and our BTC by 8x to make this work.
AWS don’t want us to trash their hardware so after you’ve completed building a new
design you have to submit it for testing - they do a bunch of things presumably looking for over current and over temp issues (we haven’t had one rejected yet). This
takes about an hour.
F1 instances cost ~US$1.50/hour to use, the build systems about US$1/hour.
Chip Architecture
AWS provide their “shell”, a fixed, pre-laid out interface - you can see it in
this block diagram as “AWS support”:
The shell provides AXI interfaces looking inwards (our DRAM controller is a
master, our disk/UART are clients).
We’ve added a simple UART, dumb fake disk controller (really just a 256-byte
FIFO) to the shell, and a register that allows us to reset the VROOM!. These
simple devices are made visible on the PCIE as registers and mapped into a
user space linux app running on the host Intel CPU.
The VROOM! is instanced with a minimal number of connections (the above devices,
DRAM and clocks). It is essentially the same top level we use in verilog simulation
(from chip.sv down, one CPU and one set of IO devices).
Software
The Linux user space software is a thread that maps the PCIE register
space and then pulls the CPU out of reset. It then loops reading from
the UART and ‘disk’ registers, if it finds a disk request it provides
data from one of two sources (an OpenSBI/Uboot image or a linux file
system image), if it finds incoming uart data it wakes a text display thread
to print it on the console. A third thread reads data from the console and
puts it into the uart to send when space is available.
Software release
We haven’t included the AWS interface code in the current VROOM! source
release, partly because we don’t want to confuse it with the real thing
we’re trying to build. But also because it’s mostly embedded in code
that is AWS’s IP (the shell API is one big verilog module that one needs
to embed one’s own IP) - there are no secrets on our part, we’re happy
to share if anyone is interested.
Next time: (probably after Xmas) Virtual Memory
05 Dec 2021
Introduction
A short post this week about physical memory layout and a little bit about booting. I’ll talk more
about virtual memory another time
Physical memory layout
We currently use a 56-bit physical address, this is the address used with the MMU disabled or after a
virtual address has been translated.
Addresses with bit 55 (the most significant bit) set to 0 are treated as cacheable memory space - instruction
and data accesses go through separate L1 caches but the cache coherency protocol makes sure that they see the
same data. Cache lines are 512 bits.
Everything in this space is cache coherent, once the system is booted it’s the only place that code can
be executed from. Because it’s always coherent, the fence.i instruction is a simple thing for us, all it has to do is to
wait for the writes to drain from the store queue and then flush the commitQ (fences go in the commitQ so all
that happens when it reaches the end of the queue), subsequent instruction fetches
pull live modified data from the data cache into the instruction cache.
At the moment we have either a fake memory emulator on the simulator, or connect to DRAM on the AWS FPGA
implementation - both implementations back onto the coherent MOESI bus
Within this space main memory (DRAM) starts at address 0x00000000. What happens if you access memory
outside of the area where there’s real DRAM is undefined, with the proviso that it wont lock up
(writes probably get lost or wrap around, reads may return undefined data - it’s up to the real
memory controller to choose how it behaves - on a big system the memory controller may be on another die).
Addresses with bit 55 set are handled differently for data accesses and instruction accesses:
- instruction accesses got to a boot ROM at a known fixed address, the MMU never generates these addresses
- data accesses go to a shared 64-bit IO bus (can be accessed through the MMU)
The current data IO bus contains:
- a ROM containing a device tree image
- timers
- a uart
- gpio
- multiple interrupt controllers PLIC/CLNT/CLIC
- a fake disk controller
Booting
Currently reset causes the CPU to jump to the start of code compiled into hardware in the special instruction space.
Currently that code:
- checks a GPIO pin (reads from IO space)
- if it’s set it jumps to address 0 (this is for the simulator where main memory can be side loaded)
- if it’s clear we read a boot image from the fake disk controller (connected to test fixture code on
the simulator, and to linux user space on the AWS FPGA world) into main memory, then jump to it
(currently we load uboot/OpenSBI)
Longer term we plan to put the two L1 caches into a mode at reset where the memory controller is disabled
and the data cache allocates lines on write, the instruction cache will use the existing cache coherency protocol
to pull shared cache lines from the data cache. The on-chip bootstrap will copy data into L1 from an SD/eMMC, validate it
(a crypto signature check), and jump into it. This code will initialize the DRAM controller, run it through
its startup conditioning, initialize the rest of the devices, take the cache controller out of
its reset mode and then finally load OpenSBi/UEFI/uboot/etc into DRAM.
Next time: Building on AWS