31 Aug 2022
Introduction
We started work on FP quite a while ago - building an FP adder and multiplier, that part of the design
had previously passed a few million test vectors but has been sitting on the shelf. What we’re working on now is integrating these blocks along
with the missing instructions into a schedulable FP unit, N of which can then be added to a VRoom! - we’re not done yet so this
blog post is a write up about how some of this new stuff works - mostly it’s about the register
file and instruction scheduling.
ALUs
We have not spent lot of time building fast (at the gate level) ALUs for VRoom!, at least not yet - we fully
expect a serious implementation will have carefully hand built register files, adders, data paths, multipliers, barrel shifters, FPUs, TLBs, caches etc
Instead we’ve made realistic estimates of the number of clocks we think each unit should take to run
in a high speed implementation and have
built versions that are bit accurate at the ALU block level, but have left the details of their cores to
future implementors (along with this code as working DV targets).
In our AWS test system that boots linux we’ve
just let vivado do its thing with these models and relaxed the timing targets (with the exception
of the caches and register files
where we were limited by the SRAM models available and had to do a little bit of special work).
So this FPU implementation will have all the details required to implement the RISC-V FP 32 and 64-bit
instruction sets - but low level timing work needs an actual timing and process goal to really
start work. So expect verilog ‘*’ and ‘+’ but not wallace trees and adders.
Thinking forwards to physical layout is important when we start to talk about the physical layout of things like ALUs and register files - read onward ….
Registers
A quick reminder our integer register file is split into 2 parts:
- ‘architectural’ registers - the ones that a programmer can see integer registers 1-31, and
- ‘commit’ registers - these registers are the results of completed operations that have not yet been architecturally committed
Commit registers have 3 states:
- not yet done
- completed
- committed
Instructions that use a “not yet done” registers as a source register cannot be scheduled to an ALU.
Instructions who’s operands are “completed” or “committed” can be scheduled - “completed” operands are found
in the commit register file, “committed” ones in the architectural register file. The output from each
instruction initially goes into the commit register file (ALU write ports only go there), when an
instruction is committed (ie it can’t be undone by an interrupt, trap or a mispredicted branch) it
is copied into the architectural register file (and the corresponding commitQ entry can be reused).
A note on write ports: by construction no two ALUs write to the same address in the commit registers during the
same clock. Equally by filtering no two architectural register file write ports are ever written with
the same address in the same clock.
It’s possible to generate a commit transfer that attempts multiple
writes to the same location, but the commit logic disables the least recent one(s). Initially we got this
logic backwards - it took a surprisingly long time before we found a real-world case where that broke
something.
So how do we implement FP registers in this world? there are lots of possible solutions, let’s talk about
two of them.
Firstly we should realize that:
- commit registers are just bits, they have no idea whether they are FP or integer, or of any particular size (single, double, i64, i32 etc)
- they are the uncommitted outputs of ALUs
- the will be written back into an architectural register file - the commitQ entry attached to them knows where they need to go
The output of most instructions will end up going into the integer register file, the outputs of some FP
instructions also end up in the integer register file. The output of most FP instructions end up in the
FP register files, as do the results of FP load instructions.
One possible architecture, and the one we’ve chosen to implement for the moment looks like this:
You’ll notice that the FPU and Integer sides share a single commit register file - this works well
for units that might otherwise need both FP and integer write ports (loads and the FPU) it has
the downside that
the number of read and write ports in the commit registers continue to grow.
An alternative design, if we’re being careful about layout and don’t mind spending gates to reduce routing congestion, could
split the commit registers into FP and integer register files, each with fewer read and write ports (but
at any one time half of the entries not being used) then we could have something like this:
You have to think of this diagram as being vaguely a hint as to layout (FPUs on one side, integer ALUs on the other registers in the middle).
The load/store unit will still need paths to the FPU registers and the FPUs to the integer ones, but
in general routing complexity is reduced and the max number of ports per individual register file is lower.
Switching from the current design to this one would be relatively trivial - less than the time one might
spend switching to a hand built register file anyway. This is another architectural choice that really
needs to be left to those who are actually laying out a real chip, and different people might make different choices.
Scheduling FP Instructions
First of all we don’t treat FP load/store instructions as being part of the FPU units - they are done
as part of the load/store unit - mostly they are treated as all other load/stores are (however single loads
are nan-boxed rather than sign extended). So we’re not addressing them here.
In our current design most FP instructions run in 1 clock, with some important exceptions
- add - 2 clocks
- mul, muladd - 3 clocks
- div/sqrt - N clocks
For any one FPU we can issue one instruction per clock - whichever instruction is scheduled gets to
use the operand read ports in the next clock - if we ignore div/sqrt for the moment we also need to
schedule the write port when the instruction completes.
Unlike the other ALUs the FPU scheduler needs to
keep track of which instructions are active and how long they will take - this isn’t hard it’s just a
3-bit vector per FPU saying when the FPU will complete previously scheduled instructions (the LSB of the vector is
actually ignored because nothing it can schedule will finish in the next clock - remember the scheduler is always running a clock ahead of the ALUs).
Just like the other ALUs all the commit stations doing FPU ops make a bid for an FPU when the
know their operands
will be ready in the next clock, unlike normal ALUs they also specify how many clocks their
instruction will take - that request is masked with each current FPU state before being presented to
the scheduler - a request that can’t run on FPU 0 because of write port contention might be
bumped to FPU 1.
We haven’t implemented div/sqrt yet - they’re essentially going to have two requests, one for starting
(operand port contention) and one for write back (write port contention), unlike all the other functions an FPU can only
perform one div/sqrt at a time, everything else is fully pipelined (we can issue an add or mul on every clock). We’ll talk
more about div/sqrt in the next blog post when they’re actually implemented.
Still To Do
We know how handle exceptions, the way RISC-V manages them works very well with our commit model,
but there’s still a bunch of work to do there.
We also have a small number of instructions still to implement - basically divide, sqrt and the
multiple-add instructions.
Finally we’re using our own testing for the moment, once we’re done with the above we’ll pull in
the more public verification tests and run them before we declare we’re finished.
Next time: more FPU stuff
05 Aug 2022
Introduction
The past few weeks we’ve been working on two relatively simple changes in the register renamer: renaming
registers known to be 0 to register x0, and optimizing register to register moves.
These are important classes of optimizations on x86 class CPUs, a real hot topic - previously I worked
on an x86 clone
where instructions were cracked into micro-ops - lots of these moves were generated by this process and
optimization is important there. On RISC-V code streams we see relatively fewer (but not 0) register moves.
This particular exercise isn’t expected to have a great performance bump, but it’s the same spot in the
design that we will also be doing opcode merging so it’s a chance to start to crack that part of the
design open.
Our renamer essentially does one thing - it keeps track of where the live value (from the point of view of
instructions coming out of the decoder or the trace cache) of architectural registers are actually stored,
they could either be in the architectural register files, or the commit Q register file - once an instruction
hits the commit Q it can’t be executed until all it’s operands are either in the commit Q register file (ie
done), or in the architectural register file.
The goal of these optimizations is not so much to remove
move instructions from the instruction stream but to allow subsequent instructions that depend on them
execute earlier (when the source register for the move instruction is ready rather than it’s destination register is).
0 register operations
This optimization is very simple - on the fly the renamer recognizes a bunch of arithmetic operations as generating
‘0’ - add/xor/and/or.
For example “li a0, 0” is really “add a0,x0,0”, or “xor a0,t0,t0” etc - it’s a
simple operation and is very cheap.
(by the time they hit the renamer all the subtract operations look like adds)
Move instruction operations
These essentially optimize:
mov a0, a1
add t1, a0, t5
into:
mov a0, a1
add t1, a1, t5
Note: they don’t remove the mov instruction (that’s a job for another day) as there’s not enough
information in the renamer to tell if a0 will be subsequently needed. The sole goal here is to allow the two
instructions to be able to run in the same clock (or rather for the second instruction to run earlier
if there’s an ALU slot available).
Recognizing a move in the renamer is easy, we’re looking for add/or/xor with one 0 operand.
Managing the book keeping for this is much more complex than the 0 case - largely because the
register being moved could be in 1 of 4 different places - imagine some code like:
sub X, ....
....
mov Y, X
...
add ,,Y
The live value for operand ‘Y’ to the add instruction could now be in:
- the actual architectural register X
- the commitQ entry for the sub instruction
- the commitQ entry for the mov instruction
- the actual architectural register Y
The commitQ logic that keeps track of registers now needs be more subtle and recognize these state changes.
As does the existing logic in the renamer that handles the interactions between instructions
that are decoded in the same clock and that update registers referenced by each other.
Conclusion
There’s not discernible performance bump for dhrystone here (though we can see the operations happening at
a micro level) - there’s only 1 move and a couple of 0 optimizations per dhrystone loop and they’re
just reducing pressure on the ALUs - that’s not surprising.
We think this is worth keeping in though, as we start running more complex benchmarks, especially
ALU bound ones we’re going to see the advantages here.
A note on Benchmarking
While checking out performance here we noticed a rounding bug in the code we’ve been running on the simulator
(but not on the FPGAs). The least significant digits of the “Microseconds for one run through Dhrystone”
number we’ve been quoting have been slightly bogus - that number is not used for calculation of the rest of the
numbers (including the Dhrystone numbers) so those numbers remain valid.
We recompiled the offending code and reran it for a few past designs … they gave different (and lower)
numbers (6.42 vs. 6.49 DMips/MHz) than we had before …. after a lot of investigation we’ve decided
that this is caused by code alignment in the recompiled code (another argument for a
fully working trace cache).
We could hack on the code to manually align stuff and make it faster (or as fast as it was) but
that would be cheating - we’re just compiling it without any tweaking. Instead we’ll just note that this
change has happened and start using the new code and compare with the new numbers. What’s important
for this optimization process is knowing when the architecture is doing better.
Next time
We’ve been putting off finishing the FPU(s) - we currently have working FP adders (2 clocks) and
multipliers (3 clocks) that pass many millions of test vectors, some of the misc instructions still need work, and it all still needs to be integrated
into a new FP renamer/scoreboard. Then on to a big wide vector unit …..
Next time: FPU stuff
14 Jul 2022
Introduction
It’s been a couple of months since I last posted about our work on a trace cache for VRroom, it’s
become a bit of slog - trace cache bugs are the worst: suddenly out of nowhere your code branches off
into the weeds, often caused by something that happened tens of thousands of clocks before. Short
story, we’re pausing work on my trace cache implementation to get more important stuff done, long story: well read on
What is good about our current Trace Cache?
You’ll remember from the previous blog post about our trace cache that it’s pretty simple, the idea
was to bolt on a direct mapped cache filled with traces extracted from the completion end of our commit engine
that competed with the normal icache/branch-predictor/PC/decoder to provide decoded data directly to the
rename units.
After the last blog post we integrated support to avoid traces from crossing instruction call/return
transitions, this preserved the call prediction stacks in the BTC and made overall performance
in the same ballpark as the existing performance.
Now that is up and working, it passes all our sanity tests, direct examination of the pipelines shows a higher
issue rate (8 instructions per clock regularly which was the goal) but with a higher stall rate (because the ALUs are busy and the commit queue is filling) - which indicates that a well working trace cache probably
needs an increase in the number of ALUs and/or load/store units to further increase performance - that in
itself is not a bad thing it means we have more performance choices when choosing a final architecture.
What is bad about our current Trace Cache?
However while the back end pipelines are performing better the one bad thing that benchmarks show is that because
our naive direct mapped trace cache bypasses the branch predictor and as a result it performs worse (about 15% worse) on
branch prediction - you will recall we’re using a traditional dual combined branch predictor (bimodal and global history) - the trace
cache is effectively embedding the bimodal BTC data in the cache, but not the global predictor data.
What we really need is a way to incorporate the global predictor into the trace cache tags, that way we
can collect multiple traces for the same starting PC address (there are lots of examples of this in the
literature on trace caches, it’s not a new idea).
This is not a trivial change, it means pulling apart the BTC and PC fetcher code and more closely
integrating the trace cache at a lower level - the BTC needs to be able to track both trace streams and icache streams to update its predictors from both sources - the current BTC/PC fetcher are very focused
on icache boundaries and predicting multiple branches within them, the data coming out of the trace cache
tends to have arbitrary PC boundaries, and has unconditional branches elided - merging them more
than we currently have is a non trivial piece of work.
Note that getting a direct mapped trace cache to ~5/6 the performance of the existing CPU is actually not
that bad - it means we’re on the right track, just not there yet.
Conclusion
Our basic trace cache design seems to be functional, but doesn’t perform well enough yet - short to medium
term it’s become a bit derail against making other progress so we’re going to set it aside for the moment,
probably until
the next time the BTC has been opened up and is spread over the desktop for some major rearrangement and
then dig into integrating global history into the trace cache tags and BTC predictor updates from trace data.
This is not the last we’ve seen of the trace cache.
Our next work - some simple work on optimising moves and 0ing of registers - some minor instruction recognition
in the renamer will allow commit Q entries to resolve earlier (more in parallel) reducing instruction
dependencies on the fly.
Next time: Simple Renamer Optimisations
30 Apr 2022
Introduction
I’ve spent the past month working on adding a trace cache to VRoom! - it’s not done yet but is mildly
functional. Here’s a quick write up of what I’ve done so far.
What is a Trace cache?
Essentially a trace cache is an instruction cache of already decoded instructions. On a CISC CPU like an Intel/AMD x86 it might contain RISC-like micro-operations decoded from complicated CISC instructions.
On VRoom! we have an instruction decode that every clock can decode 4 32-bit instructions, 8 16-bit instructions, or some mixture of the two from an 128-bit bundle fetched from the L1 instruction cache. Theoretically we can feed our CPU’s core (the renamer onwards) with 8 instructions on every clock, but in reality we seldom do. There are four main reasons why:
- we can only get 8 instructions per clock through the decoder if they are all 16-bit instructions - most instruction streams are a mix of both 16-bit and 32-bit instructions
- there’s a very rough rule of thumb from computer architecture in general that says that about every 5th instruction is a branch - with potentially 8 instructions per icache fetch that means that (very roughly) on average only 5 of them get executed
- branch targets are not always on a 16-byte (8 instruction) boundary - when this happens we only issue instructions after the target
- when the BTC make a micro-misprediction we sometimes lose a clock (and up to 8 instructions) retrying the icache fetch
And indeed measurement of instruction use rates on VRoom! (again on the still useful dhrystone - which currently does almost perfect branch prediction) we measure 3.68 instructions/128-bit instruction bundle - on average less than 4 instructions being issued per clock to the core. Of course every instruction stream is different - we look both at largish benchmark streams and drill down into particular hot spots looking to understand low level issues.
There is also another an issue for tight loops - imagine something like this looking for the null at the end of a string:
loop:
ldb a0, (a1)
add a1, a1, 1
bne a0, loop
our current execution core can issue up to 4 loads per clock and execute 3 add/branch instructions per clock - theoretical it should be able to speculatively go 3 times around this loop every 2 clocks - if only it can be fed from the instruction stream quickly enough (9 instructions every 2 clocks), but in this case the current fetch/decode logic is limited to 3 instructions per clock (or per 128-bit bundle) - it might even be less (3 instructions every 2 clocks) if the instructions happen to straddle a 128-bit bundle boundary.
So a trace cache allows us to record ‘traces’ of instructions - streams of decoded consecutive instructions from the
system and then consistently play them back at the full rate so that the core can execute them without the limitations placed on them by the decoder, instruction cache, and the alignment of instructions within an instruction stream.
A trace cache is an interesting beast - when it hits it replaces not only the instruction cache but also part of the branch target cache - and there’s likely some interesting interactions between the BTC and the trace cache.
A trace cache hit returns N consecutive instructions (not necessarily consecutive in memory, but consecutive within the instruction stream) along with the PC of the instruction to execute following the last instruction in the trace that’s returned. On VRoom! we don’t ‘execute’ unconditional branches, they’re handled solely in the
fetch unit so they don’t find their way into the execution pipe or the trace cache - the ‘next’ address of
an instruction might be the address the branch after it jumped to - so this loop:
loop: bne a1, done
add a1, a1, -1
j loop
done:
results in only 2 instructions in the commitQ instruction stream and 2 instructions per loop in a trace cache.
Implementation
Each instruction from VRoom!s decoders is represented by a roughly 160-bit bundle (registers, PC, decoded opcode, functional unit number, predicted branch destination, immediate value etc) - after renaming (matching registers to the commitQ entry that will produce their values) the renamed instructions are entered as a block into the commitQ.
The commitQ is an ordered list of instructions to be out-of-order and speculatively executed - because of branch mispredictions, traps/faults, TLB misses etc an instruction may or may not reach the far end of the commitQ, where it will be ‘committed’ (ie will update the architectural state - writes to memory will actually happen, register updates will be committed to the architecturally visible register file etc).
The commit end of the commitQ can currently commit up to 8 instructions per clock (essentially we need a number here that is higher than our target IPC so that we can burst where possible - - it’s strongly related to the number of write ports into the architectural register file) - it regularly achieves 8 instructions per clock, but on average it’s more like 2-3 instructions per clock (it’s how we measure our system IPC)
one of our main goals with VRoom! is to increase this number to something more like 4 IPC - of course it’s going to be
different on each instruction stream.
So let’s look at how a trace cache (I$0 - 0th level instruction cache) fits into our system
Red arrows are control signals, black ones are instruction bundles, blue ones register access.
Note that the trace cache is filled from the portion of the commitQ containing committed instructions, on
every clock it will be filled with 0-8 instructions. On the fetch side data is read up to 8 at a time and replaces data from the decoders into the rename unit.
The fetch side of things is managed by the PC (program counter) unit, on every clock it fetches from the normal instruction cache and the trace cache in parallel, if it hits in the trace cache the instruction cache data is discarded. If the decode unit is currently busy (from the previous fetch) the PC has to wait a
clock before allowing the trace cache data be used (to allow the decode instructions to be renamed),
subsequent trace cache data can be dispatched on the every clock. Because we fetch instruction cache data
in parallel with trace cache data switching back causes a 1 clock (decode time) penalty. Some of the times we
gain a clock - for example when when we take a branch misprediction, the pipe will start one clock more quickly if the new branch target is in the trace cache.
Normally the PC uses BTC information, or information from the decoder to choose what the next PC fetch will be - on a trace cache hit we use the next PC value from the trace cache instead.
Initial Trace Cache Implementation
This is all still very much a work in process, here’s a brief description of our initial implementation, it’s not good enough
yet, but the basic connection described above probably wont change.
This initial implementation consists of N traces each containing 8 instructions (~1k bits per trace) - on the read side they act as a fully associative cache - so N can be any value, doesn’t have to be a power of 2, we’re starting with N=64, a number pulled out of thin air.
Each trace has the following metadata:
- PC of the first entry (tag for associative lookup)
- next PC of the last entry in the trace
- bit mask describing which instruction entries in the trace are valid (the first M will be set)
- use tag - a pinned counter - incremented on each valid access, decremented every P CPU clocks, a trace entry with a use tag of 0 can be stolen
In general some other CPU’s trace cache implementations have much larger trace entries (far more than 8) - our implementation is
an attempt to create a system that will self assemble larger trace lines out of the smaller 8 instruction chunks - this part is still a work in progress.
Internally the current implementation looks like this:
The fetch interface is described above, the Commit side interface has to handle unaligned incoming data - for example imagine on clock N we get 4 instructions and on clock N+1 we get 8 - we want to be able to write 4 in clock N, 4 more into the same line on clock N+1 and the final 4 instructions into a new line at clock N+1. For this reason there’s a temporary holding buffer “waiting” that holds data and allows us to share a single write port. In essence there’s a single write port with a shift mux and a holding register in its input.
There are two pointers to trace lines:
- “Last” which points to the last line written (in case we want to add more data to the end of it)
- “Next” which will be the next line we can write to (if one is available) - this is calculated using the “use tag” metadata a clock ahead
During operation fresh data line goes into Next at which time Last become Next - new data that is in the same trace and fits into the trace pointed to by Last goes there, skipping any input invalidates Last.
So the above design works, passes our CPU sanity tests (there were lots of interesting bugs around interrupts) however the performance is meh - it’s a little slower than the existing non-trace-cache model which obviously isn’t good enough.
Why? mostly it seems that the current mechanism for choosing traces is too simple,
it’s greedy and it grabs any trace when there is space available - we do start traces at jump targets but that’s about it.
A particular problem is that we happily collect traces across subroutine calls and returns - this results in
branch mispredictions and confuses the branch target cache. Another issue is that we are collecting
the same data over and over again essentially in different phases (at different offsets) within
different trace entries which limits the effective size of the trace cache.
So the next plan is to work with the existing working design but modifying the criteria for when traces are started and when they finish - initially making sure that they end at instruction calls and returns, and then some smarts around looping, perhaps limiting trace length.
Conclusion
Our basic trace cache design seems to be vaguely functional, but doesn’t perform well - originally we’d hoped for a 1.5-2x boost - we think this is still possible - watch this space in a month or so.
Next time: Trace cache - Part 2
25 Mar 2022
Introduction
Wow, this was a fun week VRoom! made it to the front page of HackerNews - for those new here this is an occasional blog post on architectural issues as
they are investigated - VRoom! is very much a work in progress.
This particular blog entry is about a recent exploration around the way that we handle branches
and ALUs.
Branch units vs ALU units
The current design has 1 branch unit per hart (a ‘hart’ is essentially a RISCV CPU
context - an N-way simultaneous multi-threaded machine has N harts even if they share
memory interfaces, caches and ALUs). It also has M ALUs - currently two.
After a lot of thinking about branch units and ALU units it seemed that they have a lot in
common - both have a 64-bit adder/subtracter in their core, and use 2 register read ports and
a write port. Branch units only use the write port for call instructions - some branches, unconditional
relative branches don’t actually hit the commitQ at all, conditional branches simply check that the
predicted destination is correct, if so they become an effective no-op, otherwise they
trigger a commitQ flush and a BTC miss.
So we’ve made some changes to the design to be able to optionally make a combined ALU/branch unit,
and to be able to build the system with those instead of the existing independent branch and ALUs units
(it’s a Makefile option so relatively easy to change). Running a bunch of tests on our still
useful Dhrystone we get:
Branch |
ALU |
Combined |
DMips/MHz |
Area |
Register Ports |
|
1 |
2 |
0 |
6.33 |
3x |
6/3 |
2022-03-25-combined-branches.md |
0 |
0 |
2 |
6.14 |
2x |
4/2 |
|
0 |
0 |
3 |
6.49 |
3x |
6/3 |
|
0 |
0 |
4 |
6.49 |
4x |
8/4 |
|
Which is interesting - 3 combined Branch/ALU units outperform the existing 1 Branch/2 ALU with
roughly the same area/register file ports. So we’ll keep that.
It’s also interesting that 4 combined ALUs performs exactly the same as the 3 ALU system (to the clock) even
though the 4th ALU gets scheduled about 1/12 of the time - essentially this is because because we’re scheduling ALUs
out of order (and speculatively) the 3rd ALU happily takes on that extra work without changing how fast the
final instructions get retired.
One other interesting thing here, and the likely reason for much of this performance improvement is
that we can now retire multiple branches per clock - we need to be able to do something sensible if
multiple branches fail branch prediction in the same clock - the correct solution is to give priority
to the misprediction closest to the end of the pipe (since the earlier instruction should cause the later one
to be flushed from the pipe).
What’s also interesting is: what would happen if we build a 2-hart SMT machine? previously such a system
would have had 2 branch units and 2 ALUs - looking at current simulations a CPU is keeping 1 ALU busy
close to 90%, the second to ~50%, the 3rd ~20% - while we don’t have any good simulation data yet we
can guess that 4 combined ALUs (so about the same area) would likely satisfy a dual SMT system - mostly
because the 2 threads
would share I/Dcaches and as a result run a little more slowly (add that to the list of future
experiments).
VRoom! go Boom!
Scheduling N units is difficult - essentially we need to look at all the entries in the commitQ and
choose the one closest to the end of the pipe ready to perform an ALU operation on ALU 0. That’s easy
the core of it looks something a bit like this (for an 8 entry Q):
always @(*)
casez (req)
8'b???????1: begin alu0_enable = 1; alu0_rd = 0; end
8'b??????10: begin alu0_enable = 1; alu0_rd = 1; end
8'b?????100: begin alu0_enable = 1; alu0_rd = 2; end
.....
8'b10000000: begin alu0_enable = 1; alu0_rd = 7; end
8'b00000000: alu0_enable = 0;
endcase
for ALU 1 it looks like (answering the question: what is the 2nd bit if 2 or more bits are set):
always @(*)
casez (req)
8'b??????11: begin alu1_enable = 1; alu1_rd = 1; end
8'b?????101,
8'b?????110: begin alu1_enable = 1; alu1_rd = 2; end
8'b????1001,
8'b????1010,
8'b????1100: begin alu1_enable = 1; alu1_rd = 3; end
.....
8'b00000000: alu1_enable = 0;
endcase
For more ALUs it gets more complex for a 3264 entry commitQ it’s also much bigger, for dual
SMT systems there are 2 commitQs so 64/128 entries (we interleave the request bits from the
two commitQs to give them fair access to the resources).
Simply listing all the bit combinations with 3 bits set out of 128 bits in a case statement
just listing all of them gets unruly - but really we do want to express this in a manner
where we can provide a maximum amount of parallelism to the synthesis tools we’re using, hopefully
they’ll find optimizations that are not obvious - so long ago we had reformulated it
to something like this:
always @(*)
casez (req)
8'b???????1: begin alu0_enable = 1; alu0_rd = 0;
casez (req[7:1])
7'b??????1: begin alu1_enable = 1; alu1_rd = 1; end
7'b?????10: begin alu1_enable = 1; alu1_rd = 2; end
7'b????100: begin alu1_enable = 1; alu1_rd = 3; end
.....
7'b1000000: begin alu1_enable = 1; alu1_rd = 7; end
7'b0000000: alu1_enable = 0;
endcase
end
8'b??????10: begin alu0_enable = 1; alu0_rd = 1;
casez (req[7:2])
6'b?????1: begin alu1_enable = 1; alu1_rd = 2; end
6'b????10: begin alu1_enable = 1; alu1_rd = 3; end
6'b???100: begin alu1_enable = 1; alu1_rd = 4; end
.....
6'b100000: begin alu1_enable = 1; alu1_rd = 7; end
6'b000000: alu1_enable = 0;
endcase
end
8'b?????100: begin alu0_enable = 1; alu0_rd = 2;
.....
8'b10000000: begin alu0_enable = 1; alu0_rd = 7; alu1_enable = 0; end
8'b00000000: begin alu0_enable = 0; alu1_enable = 0; end
endcase
etc etc we have C code that will spit this out for an arbitrary number of ALUs (arbitrary depths)
the 2 ALU scheduler for a 32 entry commitQ happily compiles under verilator/iverilog and on
Vivado (yosys currently goes bang! we suspect upset by this combinatorial explosion). When we switched
to 3 ALUs (we tried this a while ago) it happily compiled on verilator (takes a while) and runs. When
we compiled up the 4 ALU scheduler on verilator it went Bang! the kernel OOM killer got it (on a
96Gb laptop) - looking at the machine generated code it was 200K lines of verilog …. oops … 3 ALUs
was compiling 50k, 2 ALUs ~10k …. serious combinatorial explosion!
Luckily we’d already found another way to solve this problem elsewhere (we have 6! address units)
so dropping some other code in to generate this scheduler wasn’t hard (900 lines rather than 200k) -
we’re going to need to
spend some time looking at how well this new code performs, it had always been expected to be
one of the more difficult areas for timing - we might need to find some 95% heuristics here that are
not perfect but allow us higher core clock speeds - time will tell. Sadly Yosys still goes bang, must be something else.
Conclusion
Combined branch ALUs seem to provide a performance improvement with little increase in area - we’ll
keep them.
Next time: Probably something about trace caches, might take a while