VRoom! blog: Trace cache - Part 1

Introduction

I’ve spent the past month working on adding a trace cache to VRoom! - it’s not done yet but is mildly functional. Here’s a quick write up of what I’ve done so far.

What is a Trace cache?

Essentially a trace cache is an instruction cache of already decoded instructions. On a CISC CPU like an Intel/AMD x86 it might contain RISC-like micro-operations decoded from complicated CISC instructions.

On VRoom! we have an instruction decode that every clock can decode 4 32-bit instructions, 8 16-bit instructions, or some mixture of the two from an 128-bit bundle fetched from the L1 instruction cache. Theoretically we can feed our CPU’s core (the renamer onwards) with 8 instructions on every clock, but in reality we seldom do. There are four main reasons why:

  • we can only get 8 instructions per clock through the decoder if they are all 16-bit instructions - most instruction streams are a mix of both 16-bit and 32-bit instructions
  • there’s a very rough rule of thumb from computer architecture in general that says that about every 5th instruction is a branch - with potentially 8 instructions per icache fetch that means that (very roughly) on average only 5 of them get executed
  • branch targets are not always on a 16-byte (8 instruction) boundary - when this happens we only issue instructions after the target
  • when the BTC make a micro-misprediction we sometimes lose a clock (and up to 8 instructions) retrying the icache fetch

And indeed measurement of instruction use rates on VRoom! (again on the still useful dhrystone - which currently does almost perfect branch prediction) we measure 3.68 instructions/128-bit instruction bundle - on average less than 4 instructions being issued per clock to the core. Of course every instruction stream is different - we look both at largish benchmark streams and drill down into particular hot spots looking to understand low level issues.

There is also another an issue for tight loops - imagine something like this looking for the null at the end of a string:

loop:
	ldb	a0, (a1)
	add	a1, a1, 1
	bne	a0, loop

our current execution core can issue up to 4 loads per clock and execute 3 add/branch instructions per clock - theoretical it should be able to speculatively go 3 times around this loop every 2 clocks - if only it can be fed from the instruction stream quickly enough (9 instructions every 2 clocks), but in this case the current fetch/decode logic is limited to 3 instructions per clock (or per 128-bit bundle) - it might even be less (3 instructions every 2 clocks) if the instructions happen to straddle a 128-bit bundle boundary.

So a trace cache allows us to record ‘traces’ of instructions - streams of decoded consecutive instructions from the system and then consistently play them back at the full rate so that the core can execute them without the limitations placed on them by the decoder, instruction cache, and the alignment of instructions within an instruction stream.

A trace cache is an interesting beast - when it hits it replaces not only the instruction cache but also part of the branch target cache - and there’s likely some interesting interactions between the BTC and the trace cache.

A trace cache hit returns N consecutive instructions (not necessarily consecutive in memory, but consecutive within the instruction stream) along with the PC of the instruction to execute following the last instruction in the trace that’s returned. On VRoom! we don’t ‘execute’ unconditional branches, they’re handled solely in the fetch unit so they don’t find their way into the execution pipe or the trace cache - the ‘next’ address of an instruction might be the address the branch after it jumped to - so this loop:

loop:	bne     a1, done
        add     a1, a1, -1
	j	loop
done:

results in only 2 instructions in the commitQ instruction stream and 2 instructions per loop in a trace cache.

Implementation

Each instruction from VRoom!s decoders is represented by a roughly 160-bit bundle (registers, PC, decoded opcode, functional unit number, predicted branch destination, immediate value etc) - after renaming (matching registers to the commitQ entry that will produce their values) the renamed instructions are entered as a block into the commitQ.

The commitQ is an ordered list of instructions to be out-of-order and speculatively executed - because of branch mispredictions, traps/faults, TLB misses etc an instruction may or may not reach the far end of the commitQ, where it will be ‘committed’ (ie will update the architectural state - writes to memory will actually happen, register updates will be committed to the architecturally visible register file etc).

The commit end of the commitQ can currently commit up to 8 instructions per clock (essentially we need a number here that is higher than our target IPC so that we can burst where possible - - it’s strongly related to the number of write ports into the architectural register file) - it regularly achieves 8 instructions per clock, but on average it’s more like 2-3 instructions per clock (it’s how we measure our system IPC) one of our main goals with VRoom! is to increase this number to something more like 4 IPC - of course it’s going to be different on each instruction stream.

So let’s look at how a trace cache (I$0 - 0th level instruction cache) fits into our system

placeholder

Red arrows are control signals, black ones are instruction bundles, blue ones register access.

Note that the trace cache is filled from the portion of the commitQ containing committed instructions, on every clock it will be filled with 0-8 instructions. On the fetch side data is read up to 8 at a time and replaces data from the decoders into the rename unit.

The fetch side of things is managed by the PC (program counter) unit, on every clock it fetches from the normal instruction cache and the trace cache in parallel, if it hits in the trace cache the instruction cache data is discarded. If the decode unit is currently busy (from the previous fetch) the PC has to wait a clock before allowing the trace cache data be used (to allow the decode instructions to be renamed), subsequent trace cache data can be dispatched on the every clock. Because we fetch instruction cache data in parallel with trace cache data switching back causes a 1 clock (decode time) penalty. Some of the times we gain a clock - for example when when we take a branch misprediction, the pipe will start one clock more quickly if the new branch target is in the trace cache.

Normally the PC uses BTC information, or information from the decoder to choose what the next PC fetch will be - on a trace cache hit we use the next PC value from the trace cache instead.

Initial Trace Cache Implementation

This is all still very much a work in process, here’s a brief description of our initial implementation, it’s not good enough yet, but the basic connection described above probably wont change.

This initial implementation consists of N traces each containing 8 instructions (~1k bits per trace) - on the read side they act as a fully associative cache - so N can be any value, doesn’t have to be a power of 2, we’re starting with N=64, a number pulled out of thin air.

Each trace has the following metadata:

  • PC of the first entry (tag for associative lookup)
  • next PC of the last entry in the trace
  • bit mask describing which instruction entries in the trace are valid (the first M will be set)
  • use tag - a pinned counter - incremented on each valid access, decremented every P CPU clocks, a trace entry with a use tag of 0 can be stolen

In general some other CPU’s trace cache implementations have much larger trace entries (far more than 8) - our implementation is an attempt to create a system that will self assemble larger trace lines out of the smaller 8 instruction chunks - this part is still a work in progress.

Internally the current implementation looks like this:

placeholder

The fetch interface is described above, the Commit side interface has to handle unaligned incoming data - for example imagine on clock N we get 4 instructions and on clock N+1 we get 8 - we want to be able to write 4 in clock N, 4 more into the same line on clock N+1 and the final 4 instructions into a new line at clock N+1. For this reason there’s a temporary holding buffer “waiting” that holds data and allows us to share a single write port. In essence there’s a single write port with a shift mux and a holding register in its input.

There are two pointers to trace lines:

  • “Last” which points to the last line written (in case we want to add more data to the end of it)
  • “Next” which will be the next line we can write to (if one is available) - this is calculated using the “use tag” metadata a clock ahead

During operation fresh data line goes into Next at which time Last become Next - new data that is in the same trace and fits into the trace pointed to by Last goes there, skipping any input invalidates Last.

Performance

So the above design works, passes our CPU sanity tests (there were lots of interesting bugs around interrupts) however the performance is meh - it’s a little slower than the existing non-trace-cache model which obviously isn’t good enough.

Why? mostly it seems that the current mechanism for choosing traces is too simple, it’s greedy and it grabs any trace when there is space available - we do start traces at jump targets but that’s about it.

A particular problem is that we happily collect traces across subroutine calls and returns - this results in branch mispredictions and confuses the branch target cache. Another issue is that we are collecting the same data over and over again essentially in different phases (at different offsets) within different trace entries which limits the effective size of the trace cache.

So the next plan is to work with the existing working design but modifying the criteria for when traces are started and when they finish - initially making sure that they end at instruction calls and returns, and then some smarts around looping, perhaps limiting trace length.

Conclusion

Our basic trace cache design seems to be vaguely functional, but doesn’t perform well - originally we’d hoped for a 1.5-2x boost - we think this is still possible - watch this space in a month or so.

Next time: Trace cache - Part 2

VRoom! blog: Combining ALUs and Branch Units

Introduction

Wow, this was a fun week VRoom! made it to the front page of HackerNews - for those new here this is an occasional blog post on architectural issues as they are investigated - VRoom! is very much a work in progress.

This particular blog entry is about a recent exploration around the way that we handle branches and ALUs.

Branch units vs ALU units

The current design has 1 branch unit per hart (a ‘hart’ is essentially a RISCV CPU context - an N-way simultaneous multi-threaded machine has N harts even if they share memory interfaces, caches and ALUs). It also has M ALUs - currently two.

After a lot of thinking about branch units and ALU units it seemed that they have a lot in common - both have a 64-bit adder/subtracter in their core, and use 2 register read ports and a write port. Branch units only use the write port for call instructions - some branches, unconditional relative branches don’t actually hit the commitQ at all, conditional branches simply check that the predicted destination is correct, if so they become an effective no-op, otherwise they trigger a commitQ flush and a BTC miss.

So we’ve made some changes to the design to be able to optionally make a combined ALU/branch unit, and to be able to build the system with those instead of the existing independent branch and ALUs units (it’s a Makefile option so relatively easy to change). Running a bunch of tests on our still useful Dhrystone we get:

Branch ALU Combined DMips/MHz Area Register Ports  
1 2 0 6.33 3x 6/3 2022-03-25-combined-branches.md
0 0 2 6.14 2x 4/2  
0 0 3 6.49 3x 6/3  
0 0 4 6.49 4x 8/4  

Which is interesting - 3 combined Branch/ALU units outperform the existing 1 Branch/2 ALU with roughly the same area/register file ports. So we’ll keep that.

It’s also interesting that 4 combined ALUs performs exactly the same as the 3 ALU system (to the clock) even though the 4th ALU gets scheduled about 1/12 of the time - essentially this is because because we’re scheduling ALUs out of order (and speculatively) the 3rd ALU happily takes on that extra work without changing how fast the final instructions get retired.

One other interesting thing here, and the likely reason for much of this performance improvement is that we can now retire multiple branches per clock - we need to be able to do something sensible if multiple branches fail branch prediction in the same clock - the correct solution is to give priority to the misprediction closest to the end of the pipe (since the earlier instruction should cause the later one to be flushed from the pipe).

What’s also interesting is: what would happen if we build a 2-hart SMT machine? previously such a system would have had 2 branch units and 2 ALUs - looking at current simulations a CPU is keeping 1 ALU busy close to 90%, the second to ~50%, the 3rd ~20% - while we don’t have any good simulation data yet we can guess that 4 combined ALUs (so about the same area) would likely satisfy a dual SMT system - mostly because the 2 threads would share I/Dcaches and as a result run a little more slowly (add that to the list of future experiments).

VRoom! go Boom!

Scheduling N units is difficult - essentially we need to look at all the entries in the commitQ and choose the one closest to the end of the pipe ready to perform an ALU operation on ALU 0. That’s easy the core of it looks something a bit like this (for an 8 entry Q):

always @(*)
casez (req) 
8'b???????1: begin alu0_enable = 1; alu0_rd = 0; end
8'b??????10: begin alu0_enable = 1; alu0_rd = 1; end
8'b?????100: begin alu0_enable = 1; alu0_rd = 2; end
.....
8'b10000000: begin alu0_enable = 1; alu0_rd = 7; end
8'b00000000: alu0_enable = 0;
endcase

for ALU 1 it looks like (answering the question: what is the 2nd bit if 2 or more bits are set):

always @(*)
casez (req) 
8'b??????11: begin alu1_enable = 1; alu1_rd = 1; end
8'b?????101,
8'b?????110: begin alu1_enable = 1; alu1_rd = 2; end
8'b????1001,
8'b????1010,
8'b????1100: begin alu1_enable = 1; alu1_rd = 3; end
.....
8'b00000000: alu1_enable = 0;
endcase

For more ALUs it gets more complex for a 3264 entry commitQ it’s also much bigger, for dual SMT systems there are 2 commitQs so 64/128 entries (we interleave the request bits from the two commitQs to give them fair access to the resources).

Simply listing all the bit combinations with 3 bits set out of 128 bits in a case statement just listing all of them gets unruly - but really we do want to express this in a manner where we can provide a maximum amount of parallelism to the synthesis tools we’re using, hopefully they’ll find optimizations that are not obvious - so long ago we had reformulated it to something like this:

always @(*)
casez (req) 
8'b???????1: begin alu0_enable = 1; alu0_rd = 0;
		casez (req[7:1]) 
		7'b??????1: begin alu1_enable = 1; alu1_rd = 1; end
		7'b?????10: begin alu1_enable = 1; alu1_rd = 2; end
		7'b????100: begin alu1_enable = 1; alu1_rd = 3; end
		.....
		7'b1000000: begin alu1_enable = 1; alu1_rd = 7; end
		7'b0000000: alu1_enable = 0;
		endcase
	     end
8'b??????10: begin alu0_enable = 1; alu0_rd = 1; 
		casez (req[7:2]) 
		6'b?????1: begin alu1_enable = 1; alu1_rd = 2; end
		6'b????10: begin alu1_enable = 1; alu1_rd = 3; end
		6'b???100: begin alu1_enable = 1; alu1_rd = 4; end
		.....
		6'b100000: begin alu1_enable = 1; alu1_rd = 7; end
		6'b000000: alu1_enable = 0;
		endcase
	     end
8'b?????100: begin alu0_enable = 1; alu0_rd = 2; 
.....
8'b10000000: begin alu0_enable = 1; alu0_rd = 7; alu1_enable = 0; end
8'b00000000: begin alu0_enable = 0; alu1_enable = 0; end
endcase

etc etc we have C code that will spit this out for an arbitrary number of ALUs (arbitrary depths) the 2 ALU scheduler for a 32 entry commitQ happily compiles under verilator/iverilog and on Vivado (yosys currently goes bang! we suspect upset by this combinatorial explosion). When we switched to 3 ALUs (we tried this a while ago) it happily compiled on verilator (takes a while) and runs. When we compiled up the 4 ALU scheduler on verilator it went Bang! the kernel OOM killer got it (on a 96Gb laptop) - looking at the machine generated code it was 200K lines of verilog …. oops … 3 ALUs was compiling 50k, 2 ALUs ~10k …. serious combinatorial explosion!

Luckily we’d already found another way to solve this problem elsewhere (we have 6! address units) so dropping some other code in to generate this scheduler wasn’t hard (900 lines rather than 200k) - we’re going to need to spend some time looking at how well this new code performs, it had always been expected to be one of the more difficult areas for timing - we might need to find some 95% heuristics here that are not perfect but allow us higher core clock speeds - time will tell. Sadly Yosys still goes bang, must be something else.

Conclusion

Combined branch ALUs seem to provide a performance improvement with little increase in area - we’ll keep them.

Next time: Probably something about trace caches, might take a while

VRoom! blog: Verilog changes, new performance numbers

Introduction

Not really an architectural posting, more about tooling, skip if it’s not really your thing, also new some performance numbers.

New Performance Numbers

I found the bug in the BTC - mispredicted 4-byte branches that crossed a 16 byte boundary (our instruction bundle size) this is now fixed and we have better dhrystone numbers: 6.33 DMIPS/MHz at an IPC of 2.51 - better than I expected! We can still do better.

More System Verilog, Interfaces for the (mostly) win

Last week I posted a description about my recent major changes to the way that the load/store unit works, what I didn’t touch on is some pretty major changes to the ways the actual Verilog code is written.

I’m an old-time verilog hack, I’ve even written a compiler (actually 2 of them, you can thank me for the * in the “always @(*)” construct). Back in the 90s, I tried to build a cloud biz (before ‘cloud’ was a word) selling time with it for that last couple of months when you need zillions of licenses and machines to run them on. Sadly we were probably ahead of our time, and we got killed by Enron the “smartest guys in the room” who made any Californian business that depended on the price of electricity impossible to budget for. I opened sourced the compiler and it died a quiet death.

I started this project using Icarus Verilog, using a relatively simple System Verilog subset, making sure I could also build on Verilator once I started running larger sims.

I’m a big fan of Xs, verilog’s 4-value ‘undefined’ values - in particular using them as clues to synthesis about when we don’t care and using them during simulation to propagate errors when assumptions are not met (and to check that we really don’t care)

For example - a 1-hot Mux:

always @(*)
casez (control) // synthesis full_case parallel_case
4'b1000: out = in0;
4'b0100: out = in1;
4'b0010: out = in2;
4'b0001: out = in3;
default: out = 'bx;
endcase

One could just use:

always @(*) 
casez (control) // synthesis full_case parallel_case
4'b1???: out = in0;
4'b?1??: out = in1;
4'b??1?: out = in2;
4'b???1: out = in3;
endcase

Chances are synthesis will make the same gates - but in the first case simulation is more likely to blow up (and here I mean blow up ‘constructively’ - signalling an error) if more than 1 bit of control is set. “full_case parallel_case” is a potentially dangerous thing, it gives the synthesis tool permission to do something that might not be the same as what the verilog compiler does during simulation (if the assumption here that control is one-hot isn’t met) the “out = ‘bx” gets us a warning as the Xs propagate if our design is broken.

Xs are also good for figuring out what really needs to be reset, propagating timing failures etc etc

Sadly Verilator doesn’t support Xs, while Icarus Verilog does. Icarus Verilog also compiles a lot faster making for far faster turn around for small tests - I had got to a point where I would use Verilator for long tests and Icarus for actual debugging once I could reproduce a problem “in the small”.

One of the problems I’ve had with this project is that in simple verilog there really wasn’t a way to pass an arbitrary number of an interface to an object - for example the dcache needs L dcache read ports, where L is the number of load ports - you can pass bit vectors of 1-bit structures, but not an array of n-bit structures like addresses, or an array of data results.

System Verilog provides a mechanism to do this via “interfaces” - and this time around I have embraced them with a passion - here’s NLOAD async dcache lookups:

interface DCACHE_LOAD #(int  RV=64, NPHYS=56, NLOAD=2);     
	typedef struct packed {
		bit [NPHYS-1:$clog2(RV/8)]addr; // CPU read port
	} DCACHE_REQ;
	typedef struct packed {
		bit hit;
		bit hit_need_o;                                     
		bit [RV-1:0]data;
	} DCACHE_ACK;
	DCACHE_REQ req[0:NLOAD-1];
	DCACHE_ACK ack[0:NLOAD-1];
endinterface

Sadly Icarus Verilog (currently) doesn’t support interfaces and I’ve had to remove them from our supported tools - there’s still stuff in the Makefile for its support (I live in hope :-). Even Verilator’s support is relatively minimal, it doesn’t support arrays within arrays, nor does it support unpacked structures (so debug in gtkwave is primitive).

I’m really going to miss 4-value simulation, I hope IV supports interfaces sometime - I wish I had the time to just do it and offer it up, but realisticly one can really only do one Big Thing at a time.

What this does mean though is that longer term a lot of those places where I’m currently making on-the-fly verilog from small C programs to get around the inability to pass arbitrary numbers of a thing to a module (the register file comes to mind), or using ifdef’s to enable particular interfaces will likely go away in the future to be replaced by interfaces. They won’t completely go away, I still need them to make things like arbitrary width 1-hot muxes - and more generally for large repetitive things that could be ruined by fat finger typos.

You can find the new load/store interfaces in “lstypes.si”.

Next time: Combining ALUs and Branch units

VRoom! blog: Memory Parallelism

Introduction

I’ve not posted in 2 months, mostly because I’ve been spending time redesigning the load/store unit this posting is about these changes. This is a long post, but then I’ve been doing a lot of work.

The problem

Back when I was bringing up Linux on the system I spent a lot of time looking at low level assembler trace, watching instructions flow through the CPU it became obvious that one of the bottlenecks is when large numbers of store instructions occur together they form an effective block to earlier scheduling of loads, in particular this happens at the beginning of subroutine calls when registers are being saved.

Ideally a subroutine should start loading data from memory as soon as possibly after entry - in practice code can’t load a value into say s0 until after it has been saved on the stack - however we have an out-of-order CPU with register renaming, it can happily load s0 BEFORE it is saved - provided there’s no memory aliasing between the place where the save is being done to and where the load is being done from. Ideally we’d like to be able to push as many load instructions before all those register save instructions as possible, we want to get any cache miss from main memory started as soon as possible, then we can save the registers into the storeQ and then into cache while we wait.

The old design

The old memory design issued N loads and M stores in every clock (variously 2:1, 2:2, and 3:2) - loads and stores executed in one clock (either into the store queue or retired if there was a cache/storeQ snoop hit.

We used an old trick, the L1 TLB is fully associative, the L1 data cache is set associative - we can look up the TLB at the same time that we do the SRAM index portion (with address bits that are not translated by the VM system) and then compare the output of the TLB (the physical address) with the fetched data cache tags to choose a set - this gives us a lot of useful very low level parallelism in the load/store unit - currently we run it in 1 clock (maybe not at 5GHz :-).

These days there are some downsides to this design - essentially it’s a general computer architecture problem: page sizes have not been increasing while cache line sizes and cache sizes have - the data cache index is generated from bits of the page index (the 12 LSBs of a virtual address) that are the same for a virtual and physical address - page sizes really haven’t changed since the early 80s, and RISC-V uses the same 4k/12-bits that have been used since then - but cache lines have gotten longer and caches have got larger- we use a 512-bit cache line (64 bytes) - that’s 6-bits of addressing - leaving 6-bits for indexing the data cache. This means that a direct mapped cache (ie 1 set) can at most have 64x64 bytes - ie 4K - to get a 64k L1 cache we need 16 sets, 128k needs 32 sets. One can’t have a large cache without large numbers of sets.

There are some advantages to having large numbers of sets mostly around Meltdown/Spectre/etc using random replacement and large numbers of sets muddies the water for those sorts of exploits - the downsides are essentially a 16:1 or 32:1 mux (and a larger fanout between the TLBs and the comparators) in a critical path.

placeholder

Getting back to the reason for this redesign - the big problem though is that in order to safely reorder the execution of load and store instructions safely we need to be able to compare their physical addresses when we’re doing the scheduling (not their virtual addresses as their might be aliasing) - and here we have a design where the TLB lookup and physically tagged data cache lookup are deeply intertwined. So it has to go ….

There’s another issue - the storeQ - this implicitly orders stores (and loads waiting for stores, or for cache fills) - it’s a queue, embedded in it’s design is an ordering, transactions must be retired in order, when a store reaches the head of the queue AND it’s associated instruction has reach the commit state (ie there’s no chance of a branch misprediction or a trap will cause it not be executed)`it attempts to update the L1 cache, if it gets a miss it triggers a cache line fetch and a subsequent update. This means that we can’t reorder stores to disjoint addresses post commit - and limits the number of cache line fetches we can have in parallel. The queue is nice in that it inherently embeds ordering of operations, but in reality most operations don’t require this. Again this also needs to go ….

The new design

So here’s the new design, it got a lot bigger ….:

placeholder

First thing to note it’s a 2 clock latency design - first clock is triggered when the address register in a load or store is available and just performs the TLB lookup portion of an operation. The second clock is triggered when all the addresses conflicts have been ‘resolved’ (that means that there are no instructions touching the same bytes that must be done in instruction order that haven’t been pushed into the storeQ) and, if the instruction is a store, the data to be stored is available from the pipe.

In the second clock load transactions either get resolved because they hit in the dcache or are snooped (from as yet uncommitted stores) from the storeQ. All store and fence transactions, IO transactions, loads that miss in dcache/snoop, loads that suffer hazards (for example a 4 byte load that discovers a single byte in that range waiting to be stored in the storeQ), all of these cases get put into the storeQ.

Pending transactions

The minimum transaction time is now 2 clocks, but it can be many more - we’re processing A address transactions (TLB lookups) per clock - we’re simulating with 6, but will likely have to drop back to 4 on the FPGA system. Some transactions can’t be completed right away, some store transactions might still be waiting for the data to store to become available, some might be hung up because there’s an instruction ahead of them that needs to be executed (in this context ‘executed’ means a load that hits in the icache or snoops in the storeQ, or any other instruction that has be pushed into the storeQ), or because there’s a preceding instruction that hasn’t been through the address unit (and therefore we can’t tell if there might be a load/store dependency.

The Pending Transactions buffer is a list of transactions (one entry for every instruction entry in the commitQ) that have been through the address unit - each entry contains a physical address, a mask of the bytes it will read or write (8 bits on a 64-bit machine), fence/amo dependency information and a hazard bitmask. The hazard bitmask is similar to the ‘hazard’ structure that we used in the old storeQ (and in the new storeQ described below) essentially each entry contains a bitmask (1 bit for every other pending transaction), each bit is set if that other pending transaction entry blocks this one.

For example: a load transaction might be blocked by one or more stores to one or more of the bytes the load loads, it might also be blocked by a fence or amoXXX instruction - it wont leave the pending transactions store until all of these blocking instructions have also left - it will likely find these hazards again (and maybe more, new ones) as it’s entered into the storeQ via the snoop operation.

Load/Store Units

The load store units have a scheduler that looks at how many free storeQ entries are ready (have 0 hazards) and chooses N load/store operations to perform, it prioritizes loads because in general we want to move loads before stores wherever possible and because we also use the load unit to retire completed loads (and things like AMOXXX, LC and SC) - each load unit has a dedicated write port into the general register file.

The load/store scheduler bypasses the pending transaction buffer in a way that it picks up entries from the address unit that are about to be written, this is how we can do 2 clock operations.

The load unit starts a dcache access and a snoop operation into the storeQ and then ….

  • I/O accesses go straight into the storeQ
  • if the snoop into the storeQ hits then it returns the newest data (effectively if there are multiple hits it’s the entry that has hazard bits that none of the others have) and the instruction completes
  • if the snoop hits but it returns a hazard (something we need to wait on to complete before we can proceed, like a fence, or a partially written location) we put the load into the storeQ
  • if there are no hazards and the dcache hits we return the dcache data
  • otherwise we put the entry into the storeQ

Stores and fences are similar, they always go into the storeQ, we perform a simpler snoop just to discover hazard information.

The new store queue

As mentioned above the storeQ is no longer a simple queue of entries where stuff gets put in at one end and interesting stuff happens to entries that reach the other end (and are ready to be committed).

Now it’s more of a heap, with multiple internal ad-hoc queues being created dynamically on the fly. This is done as part of the snoop we used to look for speculatively satisfying load transactions from as yet uncommitted stores, and a search for hazards (as described above) - we still do that but now we search for a wider class of ‘hazards’ that in also represent store-store ordering (making sure that stores to the same location occur in the same order), load-store dependencies (loads from the same location as stores occur after the ones that are before them in instruction order) and store-load dependences (stores don’t occur until after the loads that read the current data in a location have completed), we also use hazards to represent fence and amo-style blockages.

A ‘hazard’ is actually represented by a bit mask with one bit for each storeQ entry, a storeQ entry is created with the hazards detected when the storeQ is snooped at the load/store unit stage. Logic keeps track of which hazard bits are valid, clearing bits as transactions ahead in the ad-hoc queues are retired a storeQ entry is not allowed to proceed until all its hazards have been cleared.

Store entries also keep track of activity in the cache lines they would access - the load/store unit has an implicit understanding with the underlying cache coherency fabric that it will never make multiple concurrent transaction requests for the same cache line. Now if one storeQ entry makes a request for a cache line the others must wait (but when it arrives they all get woken up). With this new storeQ we don’t have to wait for an entry to reach the head of the queue and we can now have many outstanding cache transactions (before it was just one store and a few loads).

Summary

We’re currently at the point where this new design passes the sanity tests, I’m sure it’s still buggy and I need to spend some times writing some architectural white-box tests trying to poke in the obviously hard to trigger corners.

It’s big! lots of NxM stuff - but then this whole design was always based on asking the question “what if we throw gates at it?” we have to do lots of comparisons to discover where the hazards are if we want to get real memory parallelism.

I’m currently simulating a 6:4:4:6 system - 6 address units, 4 load and 4 store units and 6 write ports to the storeQ (so no more than 6 of the load/store units can be scheduled per clock, this area needs some work). 4 load units means 4 read ports on the dcache, there’s also still only 1 dcache write port there which limits write bandwidth, and how fast the storeQ can drain, this also needs some work, and will change in the future.

Performance

I’ve done a small amount of micro-benchmarking - close examination of code sequences show that the area I was particularly targeting (write bursts at the beginning of subroutines due to register saves) perform well, the address unit fills to capacity (they’re all offsets from the same register SP, and they’re ready to schedule almost right away), the store unit also fills, and the load units fill as soon as the commitQ starts presenting loads, at that point stores start sitting as pending transactions and loads pass them to be handled - which is also one of our goals.

The main downside is that our one clock load has become a 2 clock load, places where we load an address from memory and then immediately load something from that address suffer.

I had thought I was done with dhrystone, but it still proves to be useful as a small easy to run test that exposes microarchitectural issues so here are the results:

Dhrystone on the old design sat at 4.27 dhrystone/MHz - we recently switched to a clang compile because we expected it to expose more parallelism to the instruction stream - all numbers (before and after) here are running that same code image and now is at 5.88 which meets our original architectural target (a hand wavy number “5+” pulled out of thin air …) - not a bad increase.

Dhrystone is a very branchy test, we can issue up to 8 instructions per clock but running dhrystone we only get 3.61 - that limits how full our pipe can get and how much parallelism can likely happen - on this new run we see an average of 2.33 instructions executed per clock (it can’t be larger than that 3.61 issue rate). This is still a useful benchmark as it’s worth spending time figuring out where those IPC are being lost, I’m pretty sure the 2-clock load is now getting some of it, but on the whole it’s been a great trade for an almost 40% performance increase. Note that that 3.61 issue rate/2.33 IPC implies that the largest dhrystone/MHz we can reach with this code base is effectively with tweakage is ~9.1.

Switching to clang seems to have exposed a BTC issue, which needs to be fixed which might push us to ~6.2ish (more hand waving here), but after that probably the next big dhrystone boost will come from installing an L0 instruction trace cache that should bust the 3.61 issue rate up to close to 8 and expose more parallelism to this new load/store architecture - that’s for another day.

Future Work

This has been a big change, it needs more testing, I need to move it back onto the AWS/FPGA platform for more shaking out, that in itself will be a big job, I’ll probably build a 4:2:2:4 system and even then it may not fit in a VU9P (anyone want to contribute a VU13/VU19 board to the cause?). That’s going to take a long time.

The new storeQ and the pending transactions block have evolved to be quite similar - I can’t help but feel there’s fat there that can be removed. The one main sticking point is data life times, the pending transactions are tied 1-1 with commitQ entries (just located elsewhere because otherwise routing might be a nightmare) while storeQ entries can long outlive the commitQ entries that birth them (a storeQ entry can’t issue a dcache change, or the memory request to fill the dcache entry it wants to change until after its associated commitQ entry has been committed, and is about to be recycled).

I also want to start some more serious work on timing, not to make a real chip but to give me more feedback on my microarchitecture decisions (I’m sure some stuff will blow up in my face :-) to that end I’ve started investigating getting a build into OpenLane/Skyworks - again not with the intent of taping something out (it’ probably too big for any of the open source runs) but more as a reality check.

Next time: Verilog changes

VRoom! blog: Virtual Memory

Introduction

Booting Linux isn’t going to work without some form of virtual memory, RISC-V has a well defined spec for VM, implementation isn’t hard - page tables are well defined, there’s nothing particularly unusual or surprising there

L1 TLBs

We have separate Instruction and Data level one TLBs, they’re fully associative which means that we’re not practically limited to power of two sizes (though currently they have 32 entries each), each entry contains a mapping between an ASID and upper bits of a virtual address and a physical address - we support the various sizes of pages (both in tags and data).

A fully associative TLB with random replacement makes it harder for Meltdown/Spectre sorts of attacks to use TLB replacement to attack systems. On VROOM memory accesses that miss in the TLB never result in data cache changes.

Since the ALUs, and in particular the memory and fetch units, are shared between HARTs (CPUs) in the same simultaneous multi-threaded core then so are the TLBs shared. We need a way to distinguish mappings between HARTs - this implementation is simple, we reserve a portion of the ASID which is forced to a unique value for each core - each HART thinks it has N bits of ASID, in reality there are N+1. There’s also a system flag that we can optionally set that lets SMT HARTs share the full sized ASID (provided the software understands that ASIDs are system rather than CPU global).

Currently we use the old trick of doing L1 TLB lookup with the upper bits of a virtual address while using the lower bits in parallel to do the indexing of the L1 data/instruction caches - large modern cache line sizes mean that you have to go many ways/sets to get large data and instruction caches - this also helps with Meltdown/Spectre/etc mitigation.

I’m currently redesigning the whole memory datapath unit to split TLB lookup and data cache access into separate cycles - mostly to expose more parallelism during scheduling - more about that in a later posting once it’s all working.

L2 TLBs

TLB misses result in stalled instructions in the commitQ - there’s a small queue of pending TLB lookups in the memory unit and 2 pending lookups in the fetch unit - they feed the table walker state machine which starts by dipping into the L2 TLB cache - currently it’s a 4-way 256 entry (so 1k entries total) set associative cache shared between the instruction and data TLBs - TLB data found here is fed directly to the L1 TLBs (a satisfied L1 miss takes 4 clocks).

Table walking

If a request also misses in the L2 TLB cache the table walker state machine starts walking page table trees.

Full cache lines of TLB data are fetched into a local read-only cache which contains a small number of entries, essentially enough for 1 line for each level in the page hierarchy, and 2 for the lowest level, repeated for instruction TLB and the data TLB (then repeated again for multiple HARTs).

After initial filling most table walks hit in this cache. This cache is slightly integrated into the data L1 I-cache, they share a read-only access port into the cache coherency fabric, and both can be invalidated by data cache snoops shootdowns.

TLB Invalidation

TLB invalidation is triggered by executing a TLB flush instruction - these instructions let the instructions before them in the commitQ execute before they themselves are executed.

At this point they trigger a commitQ flush (tossing any speculative instructions executed with the old VM mappings). At the same time they trigger L1 and L2 TLB flushes. Note: there is no need to invalidate the TLB walker’s small data cache as it will have been invalidated (shot down) by the cache coherency protocols if any page tables were changed in the process.

Next time: (Once I get it working) Data memory accesses