VRoom! blog - More Experiments in Macro Op Fusion

Introduction

This week we’ve done some more experiments in macro op fusion, and found an unrelated surprising performance boost - ~12% to 9.76 DMips/Mhz - almost at 10!

Limits

A reminder from last time, we’ve set some limits on our macro op fusion experiments:

  • we’re going to avoid creating new instructions that would require adding more read/write register ports to ALUs
  • we’re also going to avoid (this time around) trying to deal with instructions where the second of a pair might fault and not be executed. For example if we fuse ‘lui t0, hi(xx);ld t0, lo(xx)(t0)’ the load might take a page fault and get restarted expecting the side effect of the lui had actually happened - it’s not that this can’t necessarily be done, just that this is hard and we’re performing a simple experiment

Choices

Last time we merged return instructions and preceding add instructions, and constant loads and compares.

This time around we’ve add 4 more optimizations: merged lui/auipc instructions and add instructions, and lui instructions and load/store instructions. These are the ‘traditional’ candidates for fusion.

Planning ahead for this we’d previously plumbed 32-bit constants everywhere through our design, up until now they’ve either been high 20 bit lui/auipc values or low 12-bit signed constants. Now we’ve added logic to combine these values - if we limit merging to constants that the decoder actually makes then it doesn’t require a full 32-bit adder per renamer (7 of them for 8 renamers to merge the data between two adjacent units) instead the lower 12 bit just get passed and the upper 20 bits have an optional decrementer. We share this logic for the 4 newly merged instructions.

The new 4 instructions pair we’re supporting are:

1)

lui	a, #hi(X)
add	a, a, #lo(X)

becomes:

nop
lui	a, #X

2)

auipc	a, #hi(X)
add	a, a, #lo(X)

becomes:

nop
auipc	a, #X 

3)

lui	a, #hi(X)
ld.x	y, #lo(X)(a)

becomes:

lui	a, #hi(X)
ld.x	y, #X(x0)

4)

lui	a, #hi(X)
st.x	y, #lo(X)(a)

becomes:

lui	a, #hi(X)
st.x	y, #X(x0)

Some notes:

  • as per last time nops get swallowed in the commit stage
  • the new auipc created in 2) gets the pc of the original pc
  • load/store instructions still perform the lui (so that exceptions can be restarted, it doesn’t break the second limitation above), but can run a clock earlier because they don’t have to wait for the lui - we may remove this limitation in the future

Performance

Like previous optimizations this doesn’t affect Dhrystone numbers at all, particularly because Dhrystone doesn’t use any of these in its core. Nothing surprising here …. however ….

More Performance

However this week we also did some ‘maintenance’. We’d been using an old library for our benchmarks, but a few weeks back we switched to supporting the B and K RISCV architectural extensions, the B extension in particular contains instructions designed to help with string processing and strcmp is a core part of Dhrystone, this week we upgraded the library to use the new libraries, we also optimized them for VRoom! (mostly this involved aligning code to 16-byte boundaries).

The results have been better than we expected - a 12% increase in Dhrystone numbers, 9.76 DMips/MHz - a better number than we’d previously assumed possible without a trace cache - we’re now seeing an IPC of 3.58 (max 8) and an issue rate of 3.66 (out of 8). In the future with a functioning trace cache (and a n issue rate close to 8) we can predict that the max performance could be in the ~20 DMips/MHz range (assuming that our execution side improvements keep up, at this point they’re already ahead a bit).

So excellent news, not from recent changes, but essentially due some of the changes due to updates a month or so ago that had not fully been counted yet.

Next time: More misc extensions

VRoom! blog - Experiments in Macro Op Fusion

Introduction

Spent some time this week experimenting with macro op fusion. Many of the arguments around the RISCV ISA have been addressed with “you can solve that with macro op fusion” which is the idea that you can take adjacent instructions, next to each other, in an instruction stream and merge them into a single instruction that by itself can’t be expressed directly in the instruction set.

We’ve always planned that VRoom! would support fusion that made sense in the renamer stage (post decode) - today’s work is a trial experiment fusing some common sequences just to get a feel for how hard they are to implement.

Limits

First some limits:

  • we’re going to avoid creating new instructions that would require adding more read/write register ports to ALUs
  • we’re also going to avoid (this time around) trying to deal with instructions where the second of a pair might fault and not be executed. For example if we fuse ‘lui t0, hi(xx);ld t0, lo(xx)(t0)’ the load might take a page fault and get restarted expecting the side effect of the lui had actually happened - it’s not that this can’t necessarily be done, just that this is hard and we’re performing a simple experiment

Choices

We decided to choose 3 simple instruction pairs, these occur reasonably often in instruction streams and because they involve fused branches and ALU operations they particularly work well with our change a while back to merge our ALU and branch units, the downside is that previously it was possible to build ALUs which shared the adder used by normal addition operations and the magnitude comparator used for branches, now we will need to be able to do both at once.

The first choice is to merge an add immediate, or an add to x0 instruction, and an indirect branch that doesn’t write an output register, this catches a whole bunch of common code that happens at the end of a subroutine:

add	sp, sp, 40
ret

li	a0, 0
ret

mv	a0, s0
ret

The one limitation is that the destination of the add must not be the indirect register used by the jump/ret.

The second and third choices marry a load immediate to a register (add immediate to x0) instruction and a conditional branch comparing that register with another into one instruction

li	a0, #5
beq	a0, a1, target

li	a0, #5
beq	a1, a0, target

becomes:

li      a0, #5;beq	#5, a1, target

li      a0, #5;beq	a1, #5, target

This mostly means the branch instruction runs one clock earlier, which makes no difference in performance if it is correctly predicted (a predicted branch just checks that it did the right thing and quietly continues on), but reduces mispredicted branch times by effectively 1 clock.

All 3 optimizations free up an ALU slot for another instruction.

Implementation

We perform this optimization in our renamer - remember we have 8 renamers renaming up to 8 decoded instructions from the previous clock. We add logic to each renamer unit to recognize the types of addition operations we want to recognize for the first instruction, and pass that information to the next renamer (the first renamer gets told “no instructions match”).

Each renamer also recognizes the second instruction of the pair and gates that with the match from the previous renamer in the chain. Care is taken to make logic paths that do not go through the full layer of renamers.

If a renamer matches a branch (and a previous add) then it modifies the ‘control’ field (the portion that describes when sort operation is being done, it also tags itself as now creating an output register (and tags the addition’s output register), it also copies the immediate portion of the addition and the source register (in the ret operation case which it fetches as rs2).

If a renamer discovers that it’s feeding an add to a fused pair then it marks itself as not creating an output (doesn’t write a register), this effectively makes it a no-op, the commit stage will mark this instruction as done and it will never be scheduled to an ALU. The renamer logic already contains a layer that packs down decoded instructions, if we used that to remove this instruction it would create circular logic that wouldn’t be synthesisable, If we removed this in the next stage (added a ‘pack’ layer to the logic that feeds the commitQ) it would be very expensive and would break the renaming (which renames to future commitQ locations) - so instead we create a no-op entry (but see below).

The actual implementation in the ALU is pretty simple, we have to allow for a r2+immed addition and for immediates to be fed into the branch comparator.

Further optimization

You likely noticed that we can’t optimize add/branch pairs that cross instruction bundle boundaries (the boundaries in the instruction stream that our 4-8 instruction decoder decodes from). Once we implement a trace cache instruction streams will be presented to the renamer again, each time they are replayed, if such a split instruction pair is presented again here they can be re-optimized.

If we’re running a trace cache we can also purge no-op instructions from the cached instruction stream as it’s filled.

Performance

Here’s a simple example of an add/ret optimization, before the return (in black)

placeholder

and after:

placeholder

Notice that the add (at 0x36d4) is nullified, all that green in the instructions around there means that the ALUs are busy so removing one helps. Those blue dotted lines coming down out of the add instruction (dependencies on the new SP) now come out of the branch (return).

So we know it’s working - how much faster is it on our favorite dhrystone benchmark? Like last time it makes no difference at all - at the moment on that benchmark our performance there is completely limited by the issue rate from the instruction fetch/decoders.

We actually don’t expect these optimizations to help dhrystone much anyway as all it’s branches are predicted, and the big advantage of executing branches early is that it speeds up mispredictions. However it will also speed up instruction streams that are ALU bound as it frees up a slot.

Any design is a balance between the decode rate and the rate that the decoded instructions are executed, that balance is going to change with different sets of code. Essentially our current bunch of changes, and last week’s, are building pressure on the execution side of that trade off, we can’t easily speed up the decoder but we will return to our trace cache design which should be able it issue 8 instructions/clock to support that we’ll need more execution capability.

Next time: More misc extensions

VRoom! blog - one more clock

Introduction

Over the past few weeks we’ve announced a bunch of new VRoom! incremental performance increases, this week we found another

To recap

One of the fixes we’d found a couple of weeks ago was able to pull in an extra clock from the load data path - essentially our commit Q entries predict when an instruction will be completed a clock before the data becomes ready, this way the instructions that use that output can be scheduled a clock ahead. That bought us a 6% dhrystone performance bump.

We also increased the depth of the commit Q, something we’d always planned on doing - bumping it from 32 entries to 64 bought us another ~22% - you can see it here in trace below from dhrystone, along the green vertical cursor there are 35 active entries (more than half the 64 commit Q size) - in addition there’s a lot of green in the boxes - that indicates instructions waiting for ALUs to become available, that’s a good thing, as we reported last time increasing the number of ALUs from 3 to 4 has no effect here, but the longer commit Q gives the pipe more leeway to move instructions in time to make better use of the 3 ALUs we do have, without affecting performance.

placeholder

Another Clock ….

Bruce Hoult introduced me to a benchmark of his, it’s based on primes which makes it a great test of branch predictors, not so much about how good they are because no one makes branch predictors to predict primes (if we could that whole P vs. NP thing would be moot), more it’s a test of what happens when branch predictors fail. In our case it shows our miss cost, ameliorated by the fact that sometimes we can resolve a missed branch high in the commit Q (because we execute instructions out of order) and refill it before it empties.

One thing I noticed in this benchmark is that it does one of the worst case things, a load followed by a (possibly mispredicted) dependant branch, that sort of latency tends to slow a CPU’s pipe to a crawl. Terrible if you find it in your program, but as a benchmark it’s excellent because we can look at it and see if there’s anything here we can do to make things go faster …..

In this case it appeared that that previous load clock bug I’d found and fixed didn’t really apply, turns out there a path there which could also be tightened up, it’s an interesting one because it’s all tangled up in the logic that handles restarting mispredicted instructions …. it’s the sort of logic you can’t compile into synchronous logic because it has circular feedback paths in it.

However it turns out that one can tease apart the load completion logic and the “instruction is being killed” logic and effectively speculatively complete the load, knowing that if the load instruction is going to be killed anyway there’s another mechanism that going to do that a clock later, before the instruction is committed - the part of the logic this was entangled with also stops a store Q entry being allocated for a killed instruction, that still needs to happen, but now it’s unrelated.

(by ‘killed’ here we’re talking about the instructions following a mispredicted branch or a trap that need to be cleared from the commitQ in order to restart the instruction stream)

So we got one less clock on a lot of instructions - how much faster is dhrystone now? The answer is none, 0%, no faster at all. As we mentioned last week at this point dhrystone’s performance on VRoom! is essentially limited by its issue rate (the speed that the front end can decode and issue instructions, and that’s limited by the branchy nature of the benchmark. We issue bundles of instructions on 16-byte boundaries, so we can issue up to 8 instructions per clock if they’re all 16-bit instructions and 4 if they are 32-bit (we’re operating on a mixture), also whenever we branch into the middle of a 16-byte bundle, or out of the middle of one we have to ignore the instructions not executed. Dhrystone is pretty ‘branchy’ it loses a lot to these issues - currently we see an issue rate of 3.58 instructions per clock (max would be 8).

A while back we made an attempt to up issue rates by creating a trace cache (a level 0 virtual i-cache that holds decoded instructions), that experiment was a bit of a failure, now that we’re starting to become issue limited it’s probably worth reopening that particular can of worms ….

And possibly yet another Clock ….

Here’s another trace from dhrystone:

placeholder

It’s another case where the commitQ is pretty full - the interesting stuff starts at the top, there’s a “2a BR 386a”, that’s a subroutine return followed by an indirect load from the GP. The address read from there is then used as base address for another load and then those two addresses are then used as the base addresses for a sequence of load/store pairs.

The load/store unit is getting a great workout there, at times issuing 6 instructions/clock (for address processing) and 4 each load/stores per clock (those vertical red dotted lines are the data moving), what’s not so obvious is that there are load instructions there where the blue portion is not 2 clocks wide … that means that they are waiting for store instructions ahead of them for which we don’t yet know the address of (and therefore can’t yet be entered into storeQ), because we don’t know the (physical) address these instructions will store to, the following load instruction can’t bypass them in time because they might alias (ie the store instruction might change the data the load instruction is trying to load).

Anyway looking at this closely we suspect that there might be an extra clock here for the case where when the store resolves but the addresses don’t alias (when it does alias the load has to wait until it can snoop the stored data from the storeQ).

However even if we hunt down this clock (it’s likely going to be a difficult timing path) benchmarked performance likely won’t increase for all the reasons we’ve explained above.

The core instruction retirement rate here is also strange - it’s retiring 8 instructions per clock every other clock (2f-36, waits a clock, the 37-3e, waits a clock, 3f-06, waits a clock, 07…), which is a bit strange and worth spending some time with.

Next step is going to be to build some more visualization tools to represent these relationships along side the existing pipeline tool, there’s likely more subtly to understand.

Next time: More misc extensions

VRoom! blog - faster still!

Introduction

Last week we announced a new Dhrystone number 7.16 DMips/MHz at an average ~2.8 IPC (instructions per clock) - we keep quoting DMips/MHz because it’s a great number for measuring what a particular architectural implementation can do. IPC is more specific it tells us something about how our particular architecture is doing - in our case we can peak issue 8 instructions per clock (if we have no branches and all RISCV C (16-bit instructions), of course real instruction streams don’t do this - on Dhrystone we measure on average ~3.5 (out of 8) instructions per 16 byte bundle. VRoom! can also retire a peak of 8 instructions per clock (this is limited by the number of write ports in the architectural register file) the average IPC numbers we quote is the average of the number of instructions retired per clock.

Our IPC clock number is actually slightly low as a small number of instructions (mostly no-ops and unconditional branch instructions) don’t make it past the decode stage, and so never get ‘retired’.

Micro-architectural Tools

Our major microarchitectural tool has been a simple trace, it logs instructions entering commit buffers, their execution and their retirement - it’s great for debugging pipe problems and particularly pipe wedges.

What it doesn’t do is give us a lot of insight into when and why individual instructions wait in the commit Q. Now we have a new tool for exploring the pipe:

placeholder

Each box represents an instruction, to the right is the commit Q index, the functional unit and the PC. Instructions that align on the left were issued in the same clock (ie come from the same 16-bit fetch bundle) instructions that align on the right are retired together (we never retire out of order, things retired together are committed together).

We can click on any instruction and see it hilighted (black), the instructions it depends on also get hilighted (red and blue). Clicking back in time we can explore dependency chains.

Within each instruction we can explore why it’s being delayed, pink means it’s waiting for other instruction’s to complete. Green means we’re short of functional units, blue means we’re executing, yellow is for store instructions (which can start executing when the address is ready but wait for the store data to come along), gray just means an instruction is done and it’s waiting for to be committed (an instruction can’t be committed until all the instructions ahead of it are committed).

This is great - but it doesn’t (yet) show load/store dependencies - in our new load/store units load instructions and store instructions get put into a temporary space if there are load/store instructions ahead of them in the instruction stream for which we don’t yet known the address. We allow accesses to different addresses to pass each other in the store and execution queues, but until we know all the addresses we have to wait.

What are we waiting for?

It turns out that resolving those low level memory dependencies is important, more importantly there’s a 3rd issue - we can issue 8 instructions per clock, and retire 8 (if they’re done) but have a 32-entry commit Q - that means that really there are only 16 effectively active entries in the middle of the queue.

We have 32 entries in out commit Q because we couldn’t fit any more into the AWS FPGA test environment, we’d always expected to increase this to 64 or 96 entries (all those comparable big CPUs in the field tend to have something of the order of 1-200 instructions in flight at a time so we’d always expected to increase this) - by doubling the size of the commit Q we give more things time to finish without causing stalls - with a 64 entry commit Q we can have ~150 instructions in flight at a time (including the store Q).

The results are great, better than we expected: For Dhrystone our previous 7.16 DMips/MHz expands by ~22% to 8.71 DMips/MHz - 3.39 IPC. That’s pretty amazing - we know that Dhrystone, as we’ve compiled it for RISC-V, issues instructions at an average of 3.52 instructions per 16-byte bundle (per clock) - since we can’t retire more than we issue that means we only have about 4% more we can get from the execution unit, for this workload they’re pretty well matched.

Note: there’s a lot of green in some parts of the above trace (and even more in other places), which implies that we might perform better if we have 4 rather than 3 ALUs (‘green’ means an instruction is waiting for an ALU). We did do that experiment, but it turns out that, like last time we tried it, the performance is identical (to the same number of clocks), even though the 4th ALU gets a lot of use there’s enough slack in the larger commit Q that instructions get shifted in time but end up being retired at the same rate. So we’ll stay with 3 ALUs for the moment, if we build a 2-way SMT version we’ll likely need to increase this to at least 4.

Of course Dhrystone is just one benchmark, and we’re pretty much done working on it, time to try something different. It may also be a good time to revisit a trace cache (which will up the issue rate of inner loops, and since Dhrystone is effectively perfectly predicted should up the issue rate to 8 instructions per bundle).

Next time: More misc extensions

VRoom! blog - quick note on a bug

TLDR

After the past new work we went to spend some time just going over the core pipes at the micro level looking to make sure we hadn’t broken anything - and found something that’s been broken since we rewrote the load/store unit.

Essentially the commit units are usually able to predict when an instruction will be completed one clock ahead of when its resulting data will be available to be bypassed so we can schedule units dependant on that data for the next clock. Usually this is easy because most instructions are of fixed length - but loads take a variable amount of time and even though we know a clock before writing the loaded data back into the register file we weren’t signalling - this is now fixed, it will be a difficult timing path.

Results - new Dhrystone numbers ~7.16 DMips/MHz - increased by ~6%

Pulling another ~11% more performance out of the design warrants some more exploration in this area, we know that for the above benchmark we’re getting on average ~2.8 instructions retired per clock, but the front end is decoding on average ~3.5 instructions per 16 byte bundle (per clock - Dhrystone is very ‘branchy’ so it’s not really possible to write a faster decoder) - that means we can estimate a potential top Dhrystone number of ~9 DMips/MHz, so we have about another 25% to find in the execution side.

Off to work on some better pipeline visualization tooling, and to experiment with a 64 entry commit queue (something intended for a real chip anyway).

Next time: More misc extensions