VRoom! blog: Bugs

Bugs

A quick note, still working on FP, but spent some time fixing bugs - thanks to Hirosh who’s been playing with VRoom! I went back and ran some old code fragments and discovered they locked up, that led to rerunning our standard set of regressikons we used before switching to running linux on AWS - turns out a lot of code to do with fence instructions and load conditional instructions were broken by the memory subsystem overhaul.

These are fixed now, though we probably need a bunch of even more subtle fence tests. Running all the regression tests on every change is now a lot easier.

Next time: more FPU stuff

VRoom! blog: Floating Point 1

Introduction

We started work on FP quite a while ago - building an FP adder and multiplier, that part of the design had previously passed a few million test vectors but has been sitting on the shelf. What we’re working on now is integrating these blocks along with the missing instructions into a schedulable FP unit, N of which can then be added to a VRoom! - we’re not done yet so this blog post is a write up about how some of this new stuff works - mostly it’s about the register file and instruction scheduling.

ALUs

We have not spent lot of time building fast (at the gate level) ALUs for VRoom!, at least not yet - we fully expect a serious implementation will have carefully hand built register files, adders, data paths, multipliers, barrel shifters, FPUs, TLBs, caches etc

Instead we’ve made realistic estimates of the number of clocks we think each unit should take to run in a high speed implementation and have built versions that are bit accurate at the ALU block level, but have left the details of their cores to future implementors (along with this code as working DV targets).

In our AWS test system that boots linux we’ve just let vivado do its thing with these models and relaxed the timing targets (with the exception of the caches and register files where we were limited by the SRAM models available and had to do a little bit of special work).

So this FPU implementation will have all the details required to implement the RISC-V FP 32 and 64-bit instruction sets - but low level timing work needs an actual timing and process goal to really start work. So expect verilog ‘*’ and ‘+’ but not wallace trees and adders.

Thinking forwards to physical layout is important when we start to talk about the physical layout of things like ALUs and register files - read onward ….

Registers

A quick reminder our integer register file is split into 2 parts:

  • ‘architectural’ registers - the ones that a programmer can see integer registers 1-31, and
  • ‘commit’ registers - these registers are the results of completed operations that have not yet been architecturally committed

placeholder

Commit registers have 3 states:

  • not yet done
  • completed
  • committed

Instructions that use a “not yet done” registers as a source register cannot be scheduled to an ALU. Instructions who’s operands are “completed” or “committed” can be scheduled - “completed” operands are found in the commit register file, “committed” ones in the architectural register file. The output from each instruction initially goes into the commit register file (ALU write ports only go there), when an instruction is committed (ie it can’t be undone by an interrupt, trap or a mispredicted branch) it is copied into the architectural register file (and the corresponding commitQ entry can be reused).

A note on write ports: by construction no two ALUs write to the same address in the commit registers during the same clock. Equally by filtering no two architectural register file write ports are ever written with the same address in the same clock.

It’s possible to generate a commit transfer that attempts multiple writes to the same location, but the commit logic disables the least recent one(s). Initially we got this logic backwards - it took a surprisingly long time before we found a real-world case where that broke something.

So how do we implement FP registers in this world? there are lots of possible solutions, let’s talk about two of them.

Firstly we should realize that:

  • commit registers are just bits, they have no idea whether they are FP or integer, or of any particular size (single, double, i64, i32 etc)
  • they are the uncommitted outputs of ALUs
  • the will be written back into an architectural register file - the commitQ entry attached to them knows where they need to go

The output of most instructions will end up going into the integer register file, the outputs of some FP instructions also end up in the integer register file. The output of most FP instructions end up in the FP register files, as do the results of FP load instructions.

One possible architecture, and the one we’ve chosen to implement for the moment looks like this:

placeholder

You’ll notice that the FPU and Integer sides share a single commit register file - this works well for units that might otherwise need both FP and integer write ports (loads and the FPU) it has the downside that the number of read and write ports in the commit registers continue to grow.

An alternative design, if we’re being careful about layout and don’t mind spending gates to reduce routing congestion, could split the commit registers into FP and integer register files, each with fewer read and write ports (but at any one time half of the entries not being used) then we could have something like this:

placeholder

You have to think of this diagram as being vaguely a hint as to layout (FPUs on one side, integer ALUs on the other registers in the middle).

The load/store unit will still need paths to the FPU registers and the FPUs to the integer ones, but in general routing complexity is reduced and the max number of ports per individual register file is lower.

Switching from the current design to this one would be relatively trivial - less than the time one might spend switching to a hand built register file anyway. This is another architectural choice that really needs to be left to those who are actually laying out a real chip, and different people might make different choices.

Scheduling FP Instructions

First of all we don’t treat FP load/store instructions as being part of the FPU units - they are done as part of the load/store unit - mostly they are treated as all other load/stores are (however single loads are nan-boxed rather than sign extended). So we’re not addressing them here.

In our current design most FP instructions run in 1 clock, with some important exceptions

  • add - 2 clocks
  • mul, muladd - 3 clocks
  • div/sqrt - N clocks

For any one FPU we can issue one instruction per clock - whichever instruction is scheduled gets to use the operand read ports in the next clock - if we ignore div/sqrt for the moment we also need to schedule the write port when the instruction completes.

Unlike the other ALUs the FPU scheduler needs to keep track of which instructions are active and how long they will take - this isn’t hard it’s just a 3-bit vector per FPU saying when the FPU will complete previously scheduled instructions (the LSB of the vector is actually ignored because nothing it can schedule will finish in the next clock - remember the scheduler is always running a clock ahead of the ALUs).

Just like the other ALUs all the commit stations doing FPU ops make a bid for an FPU when the know their operands will be ready in the next clock, unlike normal ALUs they also specify how many clocks their instruction will take - that request is masked with each current FPU state before being presented to the scheduler - a request that can’t run on FPU 0 because of write port contention might be bumped to FPU 1.

We haven’t implemented div/sqrt yet - they’re essentially going to have two requests, one for starting (operand port contention) and one for write back (write port contention), unlike all the other functions an FPU can only perform one div/sqrt at a time, everything else is fully pipelined (we can issue an add or mul on every clock). We’ll talk more about div/sqrt in the next blog post when they’re actually implemented.

Still To Do

We know how handle exceptions, the way RISC-V manages them works very well with our commit model, but there’s still a bunch of work to do there.

We also have a small number of instructions still to implement - basically divide, sqrt and the multiple-add instructions.

Finally we’re using our own testing for the moment, once we’re done with the above we’ll pull in the more public verification tests and run them before we declare we’re finished.

Next time: more FPU stuff

VRoom! blog: Rename Optimizations

Introduction

The past few weeks we’ve been working on two relatively simple changes in the register renamer: renaming registers known to be 0 to register x0, and optimizing register to register moves.

These are important classes of optimizations on x86 class CPUs, a real hot topic - previously I worked on an x86 clone where instructions were cracked into micro-ops - lots of these moves were generated by this process and optimization is important there. On RISC-V code streams we see relatively fewer (but not 0) register moves.

This particular exercise isn’t expected to have a great performance bump, but it’s the same spot in the design that we will also be doing opcode merging so it’s a chance to start to crack that part of the design open.

Our renamer essentially does one thing - it keeps track of where the live value (from the point of view of instructions coming out of the decoder or the trace cache) of architectural registers are actually stored, they could either be in the architectural register files, or the commit Q register file - once an instruction hits the commit Q it can’t be executed until all it’s operands are either in the commit Q register file (ie done), or in the architectural register file.

placeholder

The goal of these optimizations is not so much to remove move instructions from the instruction stream but to allow subsequent instructions that depend on them execute earlier (when the source register for the move instruction is ready rather than it’s destination register is).

0 register operations

This optimization is very simple - on the fly the renamer recognizes a bunch of arithmetic operations as generating ‘0’ - add/xor/and/or.

For example “li a0, 0” is really “add a0,x0,0”, or “xor a0,t0,t0” etc - it’s a simple operation and is very cheap.

(by the time they hit the renamer all the subtract operations look like adds)

Move instruction operations

These essentially optimize:

mov	a0, a1
add	t1, a0, t5

into:

mov	a0, a1
add	t1, a1, t5

Note: they don’t remove the mov instruction (that’s a job for another day) as there’s not enough information in the renamer to tell if a0 will be subsequently needed. The sole goal here is to allow the two instructions to be able to run in the same clock (or rather for the second instruction to run earlier if there’s an ALU slot available).

Recognizing a move in the renamer is easy, we’re looking for add/or/xor with one 0 operand.

Managing the book keeping for this is much more complex than the 0 case - largely because the register being moved could be in 1 of 4 different places - imagine some code like:

sub	X, ....
....
mov	Y, X
...
add	,,Y

The live value for operand ‘Y’ to the add instruction could now be in:

  • the actual architectural register X
  • the commitQ entry for the sub instruction
  • the commitQ entry for the mov instruction
  • the actual architectural register Y

The commitQ logic that keeps track of registers now needs be more subtle and recognize these state changes. As does the existing logic in the renamer that handles the interactions between instructions that are decoded in the same clock and that update registers referenced by each other.

Conclusion

There’s not discernible performance bump for dhrystone here (though we can see the operations happening at a micro level) - there’s only 1 move and a couple of 0 optimizations per dhrystone loop and they’re just reducing pressure on the ALUs - that’s not surprising.

We think this is worth keeping in though, as we start running more complex benchmarks, especially ALU bound ones we’re going to see the advantages here.

A note on Benchmarking

While checking out performance here we noticed a rounding bug in the code we’ve been running on the simulator (but not on the FPGAs). The least significant digits of the “Microseconds for one run through Dhrystone” number we’ve been quoting have been slightly bogus - that number is not used for calculation of the rest of the numbers (including the Dhrystone numbers) so those numbers remain valid.

We recompiled the offending code and reran it for a few past designs … they gave different (and lower) numbers (6.42 vs. 6.49 DMips/MHz) than we had before …. after a lot of investigation we’ve decided that this is caused by code alignment in the recompiled code (another argument for a fully working trace cache).

We could hack on the code to manually align stuff and make it faster (or as fast as it was) but that would be cheating - we’re just compiling it without any tweaking. Instead we’ll just note that this change has happened and start using the new code and compare with the new numbers. What’s important for this optimization process is knowing when the architecture is doing better.

Next time

We’ve been putting off finishing the FPU(s) - we currently have working FP adders (2 clocks) and multipliers (3 clocks) that pass many millions of test vectors, some of the misc instructions still need work, and it all still needs to be integrated into a new FP renamer/scoreboard. Then on to a big wide vector unit …..

Next time: FPU stuff

VRoom! blog: Trace cache - Part 2

Introduction

It’s been a couple of months since I last posted about our work on a trace cache for VRroom, it’s become a bit of slog - trace cache bugs are the worst: suddenly out of nowhere your code branches off into the weeds, often caused by something that happened tens of thousands of clocks before. Short story, we’re pausing work on my trace cache implementation to get more important stuff done, long story: well read on

What is good about our current Trace Cache?

You’ll remember from the previous blog post about our trace cache that it’s pretty simple, the idea was to bolt on a direct mapped cache filled with traces extracted from the completion end of our commit engine that competed with the normal icache/branch-predictor/PC/decoder to provide decoded data directly to the rename units.

After the last blog post we integrated support to avoid traces from crossing instruction call/return transitions, this preserved the call prediction stacks in the BTC and made overall performance in the same ballpark as the existing performance.

Now that is up and working, it passes all our sanity tests, direct examination of the pipelines shows a higher issue rate (8 instructions per clock regularly which was the goal) but with a higher stall rate (because the ALUs are busy and the commit queue is filling) - which indicates that a well working trace cache probably needs an increase in the number of ALUs and/or load/store units to further increase performance - that in itself is not a bad thing it means we have more performance choices when choosing a final architecture.

What is bad about our current Trace Cache?

However while the back end pipelines are performing better the one bad thing that benchmarks show is that because our naive direct mapped trace cache bypasses the branch predictor and as a result it performs worse (about 15% worse) on branch prediction - you will recall we’re using a traditional dual combined branch predictor (bimodal and global history) - the trace cache is effectively embedding the bimodal BTC data in the cache, but not the global predictor data.

What we really need is a way to incorporate the global predictor into the trace cache tags, that way we can collect multiple traces for the same starting PC address (there are lots of examples of this in the literature on trace caches, it’s not a new idea).

This is not a trivial change, it means pulling apart the BTC and PC fetcher code and more closely integrating the trace cache at a lower level - the BTC needs to be able to track both trace streams and icache streams to update its predictors from both sources - the current BTC/PC fetcher are very focused on icache boundaries and predicting multiple branches within them, the data coming out of the trace cache tends to have arbitrary PC boundaries, and has unconditional branches elided - merging them more than we currently have is a non trivial piece of work.

Note that getting a direct mapped trace cache to ~5/6 the performance of the existing CPU is actually not that bad - it means we’re on the right track, just not there yet.

Conclusion

Our basic trace cache design seems to be functional, but doesn’t perform well enough yet - short to medium term it’s become a bit derail against making other progress so we’re going to set it aside for the moment, probably until the next time the BTC has been opened up and is spread over the desktop for some major rearrangement and then dig into integrating global history into the trace cache tags and BTC predictor updates from trace data.

This is not the last we’ve seen of the trace cache.

Our next work - some simple work on optimising moves and 0ing of registers - some minor instruction recognition in the renamer will allow commit Q entries to resolve earlier (more in parallel) reducing instruction dependencies on the fly.

Next time: Simple Renamer Optimisations

VRoom! blog: Trace cache - Part 1

Introduction

I’ve spent the past month working on adding a trace cache to VRoom! - it’s not done yet but is mildly functional. Here’s a quick write up of what I’ve done so far.

What is a Trace cache?

Essentially a trace cache is an instruction cache of already decoded instructions. On a CISC CPU like an Intel/AMD x86 it might contain RISC-like micro-operations decoded from complicated CISC instructions.

On VRoom! we have an instruction decode that every clock can decode 4 32-bit instructions, 8 16-bit instructions, or some mixture of the two from an 128-bit bundle fetched from the L1 instruction cache. Theoretically we can feed our CPU’s core (the renamer onwards) with 8 instructions on every clock, but in reality we seldom do. There are four main reasons why:

  • we can only get 8 instructions per clock through the decoder if they are all 16-bit instructions - most instruction streams are a mix of both 16-bit and 32-bit instructions
  • there’s a very rough rule of thumb from computer architecture in general that says that about every 5th instruction is a branch - with potentially 8 instructions per icache fetch that means that (very roughly) on average only 5 of them get executed
  • branch targets are not always on a 16-byte (8 instruction) boundary - when this happens we only issue instructions after the target
  • when the BTC make a micro-misprediction we sometimes lose a clock (and up to 8 instructions) retrying the icache fetch

And indeed measurement of instruction use rates on VRoom! (again on the still useful dhrystone - which currently does almost perfect branch prediction) we measure 3.68 instructions/128-bit instruction bundle - on average less than 4 instructions being issued per clock to the core. Of course every instruction stream is different - we look both at largish benchmark streams and drill down into particular hot spots looking to understand low level issues.

There is also another an issue for tight loops - imagine something like this looking for the null at the end of a string:

loop:
	ldb	a0, (a1)
	add	a1, a1, 1
	bne	a0, loop

our current execution core can issue up to 4 loads per clock and execute 3 add/branch instructions per clock - theoretical it should be able to speculatively go 3 times around this loop every 2 clocks - if only it can be fed from the instruction stream quickly enough (9 instructions every 2 clocks), but in this case the current fetch/decode logic is limited to 3 instructions per clock (or per 128-bit bundle) - it might even be less (3 instructions every 2 clocks) if the instructions happen to straddle a 128-bit bundle boundary.

So a trace cache allows us to record ‘traces’ of instructions - streams of decoded consecutive instructions from the system and then consistently play them back at the full rate so that the core can execute them without the limitations placed on them by the decoder, instruction cache, and the alignment of instructions within an instruction stream.

A trace cache is an interesting beast - when it hits it replaces not only the instruction cache but also part of the branch target cache - and there’s likely some interesting interactions between the BTC and the trace cache.

A trace cache hit returns N consecutive instructions (not necessarily consecutive in memory, but consecutive within the instruction stream) along with the PC of the instruction to execute following the last instruction in the trace that’s returned. On VRoom! we don’t ‘execute’ unconditional branches, they’re handled solely in the fetch unit so they don’t find their way into the execution pipe or the trace cache - the ‘next’ address of an instruction might be the address the branch after it jumped to - so this loop:

loop:	bne     a1, done
        add     a1, a1, -1
	j	loop
done:

results in only 2 instructions in the commitQ instruction stream and 2 instructions per loop in a trace cache.

Implementation

Each instruction from VRoom!s decoders is represented by a roughly 160-bit bundle (registers, PC, decoded opcode, functional unit number, predicted branch destination, immediate value etc) - after renaming (matching registers to the commitQ entry that will produce their values) the renamed instructions are entered as a block into the commitQ.

The commitQ is an ordered list of instructions to be out-of-order and speculatively executed - because of branch mispredictions, traps/faults, TLB misses etc an instruction may or may not reach the far end of the commitQ, where it will be ‘committed’ (ie will update the architectural state - writes to memory will actually happen, register updates will be committed to the architecturally visible register file etc).

The commit end of the commitQ can currently commit up to 8 instructions per clock (essentially we need a number here that is higher than our target IPC so that we can burst where possible - - it’s strongly related to the number of write ports into the architectural register file) - it regularly achieves 8 instructions per clock, but on average it’s more like 2-3 instructions per clock (it’s how we measure our system IPC) one of our main goals with VRoom! is to increase this number to something more like 4 IPC - of course it’s going to be different on each instruction stream.

So let’s look at how a trace cache (I$0 - 0th level instruction cache) fits into our system

placeholder

Red arrows are control signals, black ones are instruction bundles, blue ones register access.

Note that the trace cache is filled from the portion of the commitQ containing committed instructions, on every clock it will be filled with 0-8 instructions. On the fetch side data is read up to 8 at a time and replaces data from the decoders into the rename unit.

The fetch side of things is managed by the PC (program counter) unit, on every clock it fetches from the normal instruction cache and the trace cache in parallel, if it hits in the trace cache the instruction cache data is discarded. If the decode unit is currently busy (from the previous fetch) the PC has to wait a clock before allowing the trace cache data be used (to allow the decode instructions to be renamed), subsequent trace cache data can be dispatched on the every clock. Because we fetch instruction cache data in parallel with trace cache data switching back causes a 1 clock (decode time) penalty. Some of the times we gain a clock - for example when when we take a branch misprediction, the pipe will start one clock more quickly if the new branch target is in the trace cache.

Normally the PC uses BTC information, or information from the decoder to choose what the next PC fetch will be - on a trace cache hit we use the next PC value from the trace cache instead.

Initial Trace Cache Implementation

This is all still very much a work in process, here’s a brief description of our initial implementation, it’s not good enough yet, but the basic connection described above probably wont change.

This initial implementation consists of N traces each containing 8 instructions (~1k bits per trace) - on the read side they act as a fully associative cache - so N can be any value, doesn’t have to be a power of 2, we’re starting with N=64, a number pulled out of thin air.

Each trace has the following metadata:

  • PC of the first entry (tag for associative lookup)
  • next PC of the last entry in the trace
  • bit mask describing which instruction entries in the trace are valid (the first M will be set)
  • use tag - a pinned counter - incremented on each valid access, decremented every P CPU clocks, a trace entry with a use tag of 0 can be stolen

In general some other CPU’s trace cache implementations have much larger trace entries (far more than 8) - our implementation is an attempt to create a system that will self assemble larger trace lines out of the smaller 8 instruction chunks - this part is still a work in progress.

Internally the current implementation looks like this:

placeholder

The fetch interface is described above, the Commit side interface has to handle unaligned incoming data - for example imagine on clock N we get 4 instructions and on clock N+1 we get 8 - we want to be able to write 4 in clock N, 4 more into the same line on clock N+1 and the final 4 instructions into a new line at clock N+1. For this reason there’s a temporary holding buffer “waiting” that holds data and allows us to share a single write port. In essence there’s a single write port with a shift mux and a holding register in its input.

There are two pointers to trace lines:

  • “Last” which points to the last line written (in case we want to add more data to the end of it)
  • “Next” which will be the next line we can write to (if one is available) - this is calculated using the “use tag” metadata a clock ahead

During operation fresh data line goes into Next at which time Last become Next - new data that is in the same trace and fits into the trace pointed to by Last goes there, skipping any input invalidates Last.

Performance

So the above design works, passes our CPU sanity tests (there were lots of interesting bugs around interrupts) however the performance is meh - it’s a little slower than the existing non-trace-cache model which obviously isn’t good enough.

Why? mostly it seems that the current mechanism for choosing traces is too simple, it’s greedy and it grabs any trace when there is space available - we do start traces at jump targets but that’s about it.

A particular problem is that we happily collect traces across subroutine calls and returns - this results in branch mispredictions and confuses the branch target cache. Another issue is that we are collecting the same data over and over again essentially in different phases (at different offsets) within different trace entries which limits the effective size of the trace cache.

So the next plan is to work with the existing working design but modifying the criteria for when traces are started and when they finish - initially making sure that they end at instruction calls and returns, and then some smarts around looping, perhaps limiting trace length.

Conclusion

Our basic trace cache design seems to be vaguely functional, but doesn’t perform well - originally we’d hoped for a 1.5-2x boost - we think this is still possible - watch this space in a month or so.

Next time: Trace cache - Part 2