05 Aug 2022
Introduction
The past few weeks we’ve been working on two relatively simple changes in the register renamer: renaming
registers known to be 0 to register x0, and optimizing register to register moves.
These are important classes of optimizations on x86 class CPUs, a real hot topic - previously I worked
on an x86 clone
where instructions were cracked into micro-ops - lots of these moves were generated by this process and
optimization is important there. On RISC-V code streams we see relatively fewer (but not 0) register moves.
This particular exercise isn’t expected to have a great performance bump, but it’s the same spot in the
design that we will also be doing opcode merging so it’s a chance to start to crack that part of the
design open.
Our renamer essentially does one thing - it keeps track of where the live value (from the point of view of
instructions coming out of the decoder or the trace cache) of architectural registers are actually stored,
they could either be in the architectural register files, or the commit Q register file - once an instruction
hits the commit Q it can’t be executed until all it’s operands are either in the commit Q register file (ie
done), or in the architectural register file.

The goal of these optimizations is not so much to remove
move instructions from the instruction stream but to allow subsequent instructions that depend on them
execute earlier (when the source register for the move instruction is ready rather than it’s destination register is).
0 register operations
This optimization is very simple - on the fly the renamer recognizes a bunch of arithmetic operations as generating
‘0’ - add/xor/and/or.
For example “li a0, 0” is really “add a0,x0,0”, or “xor a0,t0,t0” etc - it’s a
simple operation and is very cheap.
(by the time they hit the renamer all the subtract operations look like adds)
Move instruction operations
These essentially optimize:
mov a0, a1
add t1, a0, t5
into:
mov a0, a1
add t1, a1, t5
Note: they don’t remove the mov instruction (that’s a job for another day) as there’s not enough
information in the renamer to tell if a0 will be subsequently needed. The sole goal here is to allow the two
instructions to be able to run in the same clock (or rather for the second instruction to run earlier
if there’s an ALU slot available).
Recognizing a move in the renamer is easy, we’re looking for add/or/xor with one 0 operand.
Managing the book keeping for this is much more complex than the 0 case - largely because the
register being moved could be in 1 of 4 different places - imagine some code like:
sub X, ....
....
mov Y, X
...
add ,,Y
The live value for operand ‘Y’ to the add instruction could now be in:
- the actual architectural register X
- the commitQ entry for the sub instruction
- the commitQ entry for the mov instruction
- the actual architectural register Y
The commitQ logic that keeps track of registers now needs be more subtle and recognize these state changes.
As does the existing logic in the renamer that handles the interactions between instructions
that are decoded in the same clock and that update registers referenced by each other.
Conclusion
There’s not discernible performance bump for dhrystone here (though we can see the operations happening at
a micro level) - there’s only 1 move and a couple of 0 optimizations per dhrystone loop and they’re
just reducing pressure on the ALUs - that’s not surprising.
We think this is worth keeping in though, as we start running more complex benchmarks, especially
ALU bound ones we’re going to see the advantages here.
A note on Benchmarking
While checking out performance here we noticed a rounding bug in the code we’ve been running on the simulator
(but not on the FPGAs). The least significant digits of the “Microseconds for one run through Dhrystone”
number we’ve been quoting have been slightly bogus - that number is not used for calculation of the rest of the
numbers (including the Dhrystone numbers) so those numbers remain valid.
We recompiled the offending code and reran it for a few past designs … they gave different (and lower)
numbers (6.42 vs. 6.49 DMips/MHz) than we had before …. after a lot of investigation we’ve decided
that this is caused by code alignment in the recompiled code (another argument for a
fully working trace cache).
We could hack on the code to manually align stuff and make it faster (or as fast as it was) but
that would be cheating - we’re just compiling it without any tweaking. Instead we’ll just note that this
change has happened and start using the new code and compare with the new numbers. What’s important
for this optimization process is knowing when the architecture is doing better.
Next time
We’ve been putting off finishing the FPU(s) - we currently have working FP adders (2 clocks) and
multipliers (3 clocks) that pass many millions of test vectors, some of the misc instructions still need work, and it all still needs to be integrated
into a new FP renamer/scoreboard. Then on to a big wide vector unit …..
Next time: FPU stuff
14 Jul 2022
Introduction
It’s been a couple of months since I last posted about our work on a trace cache for VRroom, it’s
become a bit of slog - trace cache bugs are the worst: suddenly out of nowhere your code branches off
into the weeds, often caused by something that happened tens of thousands of clocks before. Short
story, we’re pausing work on my trace cache implementation to get more important stuff done, long story: well read on
What is good about our current Trace Cache?
You’ll remember from the previous blog post about our trace cache that it’s pretty simple, the idea
was to bolt on a direct mapped cache filled with traces extracted from the completion end of our commit engine
that competed with the normal icache/branch-predictor/PC/decoder to provide decoded data directly to the
rename units.
After the last blog post we integrated support to avoid traces from crossing instruction call/return
transitions, this preserved the call prediction stacks in the BTC and made overall performance
in the same ballpark as the existing performance.
Now that is up and working, it passes all our sanity tests, direct examination of the pipelines shows a higher
issue rate (8 instructions per clock regularly which was the goal) but with a higher stall rate (because the ALUs are busy and the commit queue is filling) - which indicates that a well working trace cache probably
needs an increase in the number of ALUs and/or load/store units to further increase performance - that in
itself is not a bad thing it means we have more performance choices when choosing a final architecture.
What is bad about our current Trace Cache?
However while the back end pipelines are performing better the one bad thing that benchmarks show is that because
our naive direct mapped trace cache bypasses the branch predictor and as a result it performs worse (about 15% worse) on
branch prediction - you will recall we’re using a traditional dual combined branch predictor (bimodal and global history) - the trace
cache is effectively embedding the bimodal BTC data in the cache, but not the global predictor data.
What we really need is a way to incorporate the global predictor into the trace cache tags, that way we
can collect multiple traces for the same starting PC address (there are lots of examples of this in the
literature on trace caches, it’s not a new idea).
This is not a trivial change, it means pulling apart the BTC and PC fetcher code and more closely
integrating the trace cache at a lower level - the BTC needs to be able to track both trace streams and icache streams to update its predictors from both sources - the current BTC/PC fetcher are very focused
on icache boundaries and predicting multiple branches within them, the data coming out of the trace cache
tends to have arbitrary PC boundaries, and has unconditional branches elided - merging them more
than we currently have is a non trivial piece of work.
Note that getting a direct mapped trace cache to ~5/6 the performance of the existing CPU is actually not
that bad - it means we’re on the right track, just not there yet.
Conclusion
Our basic trace cache design seems to be functional, but doesn’t perform well enough yet - short to medium
term it’s become a bit derail against making other progress so we’re going to set it aside for the moment,
probably until
the next time the BTC has been opened up and is spread over the desktop for some major rearrangement and
then dig into integrating global history into the trace cache tags and BTC predictor updates from trace data.
This is not the last we’ve seen of the trace cache.
Our next work - some simple work on optimising moves and 0ing of registers - some minor instruction recognition
in the renamer will allow commit Q entries to resolve earlier (more in parallel) reducing instruction
dependencies on the fly.
Next time: Simple Renamer Optimisations
30 Apr 2022
Introduction
I’ve spent the past month working on adding a trace cache to VRoom! - it’s not done yet but is mildly
functional. Here’s a quick write up of what I’ve done so far.
What is a Trace cache?
Essentially a trace cache is an instruction cache of already decoded instructions. On a CISC CPU like an Intel/AMD x86 it might contain RISC-like micro-operations decoded from complicated CISC instructions.
On VRoom! we have an instruction decode that every clock can decode 4 32-bit instructions, 8 16-bit instructions, or some mixture of the two from an 128-bit bundle fetched from the L1 instruction cache. Theoretically we can feed our CPU’s core (the renamer onwards) with 8 instructions on every clock, but in reality we seldom do. There are four main reasons why:
- we can only get 8 instructions per clock through the decoder if they are all 16-bit instructions - most instruction streams are a mix of both 16-bit and 32-bit instructions
- there’s a very rough rule of thumb from computer architecture in general that says that about every 5th instruction is a branch - with potentially 8 instructions per icache fetch that means that (very roughly) on average only 5 of them get executed
- branch targets are not always on a 16-byte (8 instruction) boundary - when this happens we only issue instructions after the target
- when the BTC make a micro-misprediction we sometimes lose a clock (and up to 8 instructions) retrying the icache fetch
And indeed measurement of instruction use rates on VRoom! (again on the still useful dhrystone - which currently does almost perfect branch prediction) we measure 3.68 instructions/128-bit instruction bundle - on average less than 4 instructions being issued per clock to the core. Of course every instruction stream is different - we look both at largish benchmark streams and drill down into particular hot spots looking to understand low level issues.
There is also another an issue for tight loops - imagine something like this looking for the null at the end of a string:
loop:
ldb a0, (a1)
add a1, a1, 1
bne a0, loop
our current execution core can issue up to 4 loads per clock and execute 3 add/branch instructions per clock - theoretical it should be able to speculatively go 3 times around this loop every 2 clocks - if only it can be fed from the instruction stream quickly enough (9 instructions every 2 clocks), but in this case the current fetch/decode logic is limited to 3 instructions per clock (or per 128-bit bundle) - it might even be less (3 instructions every 2 clocks) if the instructions happen to straddle a 128-bit bundle boundary.
So a trace cache allows us to record ‘traces’ of instructions - streams of decoded consecutive instructions from the
system and then consistently play them back at the full rate so that the core can execute them without the limitations placed on them by the decoder, instruction cache, and the alignment of instructions within an instruction stream.
A trace cache is an interesting beast - when it hits it replaces not only the instruction cache but also part of the branch target cache - and there’s likely some interesting interactions between the BTC and the trace cache.
A trace cache hit returns N consecutive instructions (not necessarily consecutive in memory, but consecutive within the instruction stream) along with the PC of the instruction to execute following the last instruction in the trace that’s returned. On VRoom! we don’t ‘execute’ unconditional branches, they’re handled solely in the
fetch unit so they don’t find their way into the execution pipe or the trace cache - the ‘next’ address of
an instruction might be the address the branch after it jumped to - so this loop:
loop: bne a1, done
add a1, a1, -1
j loop
done:
results in only 2 instructions in the commitQ instruction stream and 2 instructions per loop in a trace cache.
Implementation
Each instruction from VRoom!s decoders is represented by a roughly 160-bit bundle (registers, PC, decoded opcode, functional unit number, predicted branch destination, immediate value etc) - after renaming (matching registers to the commitQ entry that will produce their values) the renamed instructions are entered as a block into the commitQ.
The commitQ is an ordered list of instructions to be out-of-order and speculatively executed - because of branch mispredictions, traps/faults, TLB misses etc an instruction may or may not reach the far end of the commitQ, where it will be ‘committed’ (ie will update the architectural state - writes to memory will actually happen, register updates will be committed to the architecturally visible register file etc).
The commit end of the commitQ can currently commit up to 8 instructions per clock (essentially we need a number here that is higher than our target IPC so that we can burst where possible - - it’s strongly related to the number of write ports into the architectural register file) - it regularly achieves 8 instructions per clock, but on average it’s more like 2-3 instructions per clock (it’s how we measure our system IPC)
one of our main goals with VRoom! is to increase this number to something more like 4 IPC - of course it’s going to be
different on each instruction stream.
So let’s look at how a trace cache (I$0 - 0th level instruction cache) fits into our system

Red arrows are control signals, black ones are instruction bundles, blue ones register access.
Note that the trace cache is filled from the portion of the commitQ containing committed instructions, on
every clock it will be filled with 0-8 instructions. On the fetch side data is read up to 8 at a time and replaces data from the decoders into the rename unit.
The fetch side of things is managed by the PC (program counter) unit, on every clock it fetches from the normal instruction cache and the trace cache in parallel, if it hits in the trace cache the instruction cache data is discarded. If the decode unit is currently busy (from the previous fetch) the PC has to wait a
clock before allowing the trace cache data be used (to allow the decode instructions to be renamed),
subsequent trace cache data can be dispatched on the every clock. Because we fetch instruction cache data
in parallel with trace cache data switching back causes a 1 clock (decode time) penalty. Some of the times we
gain a clock - for example when when we take a branch misprediction, the pipe will start one clock more quickly if the new branch target is in the trace cache.
Normally the PC uses BTC information, or information from the decoder to choose what the next PC fetch will be - on a trace cache hit we use the next PC value from the trace cache instead.
Initial Trace Cache Implementation
This is all still very much a work in process, here’s a brief description of our initial implementation, it’s not good enough
yet, but the basic connection described above probably wont change.
This initial implementation consists of N traces each containing 8 instructions (~1k bits per trace) - on the read side they act as a fully associative cache - so N can be any value, doesn’t have to be a power of 2, we’re starting with N=64, a number pulled out of thin air.
Each trace has the following metadata:
- PC of the first entry (tag for associative lookup)
- next PC of the last entry in the trace
- bit mask describing which instruction entries in the trace are valid (the first M will be set)
- use tag - a pinned counter - incremented on each valid access, decremented every P CPU clocks, a trace entry with a use tag of 0 can be stolen
In general some other CPU’s trace cache implementations have much larger trace entries (far more than 8) - our implementation is
an attempt to create a system that will self assemble larger trace lines out of the smaller 8 instruction chunks - this part is still a work in progress.
Internally the current implementation looks like this:

The fetch interface is described above, the Commit side interface has to handle unaligned incoming data - for example imagine on clock N we get 4 instructions and on clock N+1 we get 8 - we want to be able to write 4 in clock N, 4 more into the same line on clock N+1 and the final 4 instructions into a new line at clock N+1. For this reason there’s a temporary holding buffer “waiting” that holds data and allows us to share a single write port. In essence there’s a single write port with a shift mux and a holding register in its input.
There are two pointers to trace lines:
- “Last” which points to the last line written (in case we want to add more data to the end of it)
- “Next” which will be the next line we can write to (if one is available) - this is calculated using the “use tag” metadata a clock ahead
During operation fresh data line goes into Next at which time Last become Next - new data that is in the same trace and fits into the trace pointed to by Last goes there, skipping any input invalidates Last.
So the above design works, passes our CPU sanity tests (there were lots of interesting bugs around interrupts) however the performance is meh - it’s a little slower than the existing non-trace-cache model which obviously isn’t good enough.
Why? mostly it seems that the current mechanism for choosing traces is too simple,
it’s greedy and it grabs any trace when there is space available - we do start traces at jump targets but that’s about it.
A particular problem is that we happily collect traces across subroutine calls and returns - this results in
branch mispredictions and confuses the branch target cache. Another issue is that we are collecting
the same data over and over again essentially in different phases (at different offsets) within
different trace entries which limits the effective size of the trace cache.
So the next plan is to work with the existing working design but modifying the criteria for when traces are started and when they finish - initially making sure that they end at instruction calls and returns, and then some smarts around looping, perhaps limiting trace length.
Conclusion
Our basic trace cache design seems to be vaguely functional, but doesn’t perform well - originally we’d hoped for a 1.5-2x boost - we think this is still possible - watch this space in a month or so.
Next time: Trace cache - Part 2
25 Mar 2022
Introduction
Wow, this was a fun week VRoom! made it to the front page of HackerNews - for those new here this is an occasional blog post on architectural issues as
they are investigated - VRoom! is very much a work in progress.
This particular blog entry is about a recent exploration around the way that we handle branches
and ALUs.
Branch units vs ALU units
The current design has 1 branch unit per hart (a ‘hart’ is essentially a RISCV CPU
context - an N-way simultaneous multi-threaded machine has N harts even if they share
memory interfaces, caches and ALUs). It also has M ALUs - currently two.
After a lot of thinking about branch units and ALU units it seemed that they have a lot in
common - both have a 64-bit adder/subtracter in their core, and use 2 register read ports and
a write port. Branch units only use the write port for call instructions - some branches, unconditional
relative branches don’t actually hit the commitQ at all, conditional branches simply check that the
predicted destination is correct, if so they become an effective no-op, otherwise they
trigger a commitQ flush and a BTC miss.
So we’ve made some changes to the design to be able to optionally make a combined ALU/branch unit,
and to be able to build the system with those instead of the existing independent branch and ALUs units
(it’s a Makefile option so relatively easy to change). Running a bunch of tests on our still
useful Dhrystone we get:
Branch |
ALU |
Combined |
DMips/MHz |
Area |
Register Ports |
|
1 |
2 |
0 |
6.33 |
3x |
6/3 |
2022-03-25-combined-branches.md |
0 |
0 |
2 |
6.14 |
2x |
4/2 |
|
0 |
0 |
3 |
6.49 |
3x |
6/3 |
|
0 |
0 |
4 |
6.49 |
4x |
8/4 |
|
Which is interesting - 3 combined Branch/ALU units outperform the existing 1 Branch/2 ALU with
roughly the same area/register file ports. So we’ll keep that.
It’s also interesting that 4 combined ALUs performs exactly the same as the 3 ALU system (to the clock) even
though the 4th ALU gets scheduled about 1/12 of the time - essentially this is because because we’re scheduling ALUs
out of order (and speculatively) the 3rd ALU happily takes on that extra work without changing how fast the
final instructions get retired.
One other interesting thing here, and the likely reason for much of this performance improvement is
that we can now retire multiple branches per clock - we need to be able to do something sensible if
multiple branches fail branch prediction in the same clock - the correct solution is to give priority
to the misprediction closest to the end of the pipe (since the earlier instruction should cause the later one
to be flushed from the pipe).
What’s also interesting is: what would happen if we build a 2-hart SMT machine? previously such a system
would have had 2 branch units and 2 ALUs - looking at current simulations a CPU is keeping 1 ALU busy
close to 90%, the second to ~50%, the 3rd ~20% - while we don’t have any good simulation data yet we
can guess that 4 combined ALUs (so about the same area) would likely satisfy a dual SMT system - mostly
because the 2 threads
would share I/Dcaches and as a result run a little more slowly (add that to the list of future
experiments).
VRoom! go Boom!
Scheduling N units is difficult - essentially we need to look at all the entries in the commitQ and
choose the one closest to the end of the pipe ready to perform an ALU operation on ALU 0. That’s easy
the core of it looks something a bit like this (for an 8 entry Q):
always @(*)
casez (req)
8'b???????1: begin alu0_enable = 1; alu0_rd = 0; end
8'b??????10: begin alu0_enable = 1; alu0_rd = 1; end
8'b?????100: begin alu0_enable = 1; alu0_rd = 2; end
.....
8'b10000000: begin alu0_enable = 1; alu0_rd = 7; end
8'b00000000: alu0_enable = 0;
endcase
for ALU 1 it looks like (answering the question: what is the 2nd bit if 2 or more bits are set):
always @(*)
casez (req)
8'b??????11: begin alu1_enable = 1; alu1_rd = 1; end
8'b?????101,
8'b?????110: begin alu1_enable = 1; alu1_rd = 2; end
8'b????1001,
8'b????1010,
8'b????1100: begin alu1_enable = 1; alu1_rd = 3; end
.....
8'b00000000: alu1_enable = 0;
endcase
For more ALUs it gets more complex for a 3264 entry commitQ it’s also much bigger, for dual
SMT systems there are 2 commitQs so 64/128 entries (we interleave the request bits from the
two commitQs to give them fair access to the resources).
Simply listing all the bit combinations with 3 bits set out of 128 bits in a case statement
just listing all of them gets unruly - but really we do want to express this in a manner
where we can provide a maximum amount of parallelism to the synthesis tools we’re using, hopefully
they’ll find optimizations that are not obvious - so long ago we had reformulated it
to something like this:
always @(*)
casez (req)
8'b???????1: begin alu0_enable = 1; alu0_rd = 0;
casez (req[7:1])
7'b??????1: begin alu1_enable = 1; alu1_rd = 1; end
7'b?????10: begin alu1_enable = 1; alu1_rd = 2; end
7'b????100: begin alu1_enable = 1; alu1_rd = 3; end
.....
7'b1000000: begin alu1_enable = 1; alu1_rd = 7; end
7'b0000000: alu1_enable = 0;
endcase
end
8'b??????10: begin alu0_enable = 1; alu0_rd = 1;
casez (req[7:2])
6'b?????1: begin alu1_enable = 1; alu1_rd = 2; end
6'b????10: begin alu1_enable = 1; alu1_rd = 3; end
6'b???100: begin alu1_enable = 1; alu1_rd = 4; end
.....
6'b100000: begin alu1_enable = 1; alu1_rd = 7; end
6'b000000: alu1_enable = 0;
endcase
end
8'b?????100: begin alu0_enable = 1; alu0_rd = 2;
.....
8'b10000000: begin alu0_enable = 1; alu0_rd = 7; alu1_enable = 0; end
8'b00000000: begin alu0_enable = 0; alu1_enable = 0; end
endcase
etc etc we have C code that will spit this out for an arbitrary number of ALUs (arbitrary depths)
the 2 ALU scheduler for a 32 entry commitQ happily compiles under verilator/iverilog and on
Vivado (yosys currently goes bang! we suspect upset by this combinatorial explosion). When we switched
to 3 ALUs (we tried this a while ago) it happily compiled on verilator (takes a while) and runs. When
we compiled up the 4 ALU scheduler on verilator it went Bang! the kernel OOM killer got it (on a
96Gb laptop) - looking at the machine generated code it was 200K lines of verilog …. oops … 3 ALUs
was compiling 50k, 2 ALUs ~10k …. serious combinatorial explosion!
Luckily we’d already found another way to solve this problem elsewhere (we have 6! address units)
so dropping some other code in to generate this scheduler wasn’t hard (900 lines rather than 200k) -
we’re going to need to
spend some time looking at how well this new code performs, it had always been expected to be
one of the more difficult areas for timing - we might need to find some 95% heuristics here that are
not perfect but allow us higher core clock speeds - time will tell. Sadly Yosys still goes bang, must be something else.
Conclusion
Combined branch ALUs seem to provide a performance improvement with little increase in area - we’ll
keep them.
Next time: Probably something about trace caches, might take a while
12 Mar 2022
Introduction
Not really an architectural posting, more about tooling, skip if it’s not really your thing, also new
some performance numbers.
I found the bug in the BTC - mispredicted 4-byte branches that crossed a 16 byte boundary (our
instruction bundle size) this is now fixed and we have better dhrystone numbers: 6.33 DMIPS/MHz
at an IPC of 2.51 - better than I expected! We can still do better.
More System Verilog, Interfaces for the (mostly) win
Last week I posted a description about my recent major changes to the way that the load/store
unit works, what I didn’t touch on is some pretty major changes to the ways the actual Verilog code
is written.
I’m an old-time verilog hack, I’ve even written a compiler (actually 2 of them, you can
thank me for the * in the “always @(*)” construct). Back in the
90s, I tried to build a cloud biz (before ‘cloud’ was a word) selling time with it for that last
couple of months when you need zillions of licenses and machines to run them on. Sadly we were
probably ahead of our time, and we got killed by Enron the “smartest guys in the room” who made
any Californian business that depended on the price of electricity impossible to budget for. I
opened sourced the compiler and it died a quiet death.
I started this project using Icarus Verilog, using a relatively simple System Verilog subset,
making sure I could also build on Verilator once I started running larger sims.
I’m a big fan of Xs, verilog’s 4-value ‘undefined’ values - in particular using them as clues to
synthesis about when we don’t care and using them during simulation to propagate errors when
assumptions are not met (and to check that we really don’t care)
For example - a 1-hot Mux:
always @(*)
casez (control) // synthesis full_case parallel_case
4'b1000: out = in0;
4'b0100: out = in1;
4'b0010: out = in2;
4'b0001: out = in3;
default: out = 'bx;
endcase
One could just use:
always @(*)
casez (control) // synthesis full_case parallel_case
4'b1???: out = in0;
4'b?1??: out = in1;
4'b??1?: out = in2;
4'b???1: out = in3;
endcase
Chances are synthesis will make the same gates - but in the first case simulation is more likely to blow up
(and here I mean blow up ‘constructively’ - signalling an error) if more than 1 bit of
control is set. “full_case parallel_case” is a potentially dangerous thing, it gives the synthesis tool
permission to do something that might not be the same as what the verilog compiler does during simulation
(if the assumption here that control is one-hot isn’t met) the “out = ‘bx” gets us a warning as the Xs
propagate if our design is broken.
Xs are also good for figuring out what really needs to be reset, propagating timing failures etc etc
Sadly Verilator doesn’t support Xs, while Icarus Verilog does. Icarus Verilog also compiles
a lot faster making for far faster turn around for small tests - I had got to a point where I would
use Verilator for long tests and Icarus for actual debugging once I could reproduce a problem “in the
small”.
One of the problems I’ve had with this project is that in simple verilog there really wasn’t a
way to pass an arbitrary
number of an interface to an object - for example the dcache needs L dcache read ports, where L is the
number of load ports - you can pass bit vectors of 1-bit structures, but not an array of n-bit structures
like addresses, or an array of data results.
System Verilog provides a mechanism to do this via “interfaces” - and this time around I have embraced them
with a passion - here’s NLOAD async dcache lookups:
interface DCACHE_LOAD #(int RV=64, NPHYS=56, NLOAD=2);
typedef struct packed {
bit [NPHYS-1:$clog2(RV/8)]addr; // CPU read port
} DCACHE_REQ;
typedef struct packed {
bit hit;
bit hit_need_o;
bit [RV-1:0]data;
} DCACHE_ACK;
DCACHE_REQ req[0:NLOAD-1];
DCACHE_ACK ack[0:NLOAD-1];
endinterface
Sadly Icarus Verilog (currently) doesn’t support interfaces and I’ve had to remove them from our
supported tools - there’s still stuff in the Makefile for its support (I live in hope :-). Even Verilator’s
support is relatively minimal, it doesn’t support arrays within arrays, nor does it support
unpacked structures (so debug in gtkwave is primitive).
I’m really going to miss 4-value simulation, I hope IV supports interfaces sometime - I wish
I had the time to just do it and offer it up, but realisticly one can really only do one Big Thing at a time.
What this does mean though is that longer term a lot of those places where I’m currently making
on-the-fly verilog from small C programs to get around the inability to pass arbitrary numbers
of a thing to a module (the register file comes to mind), or using ifdef’s to enable
particular interfaces will likely go away in the future to be replaced by interfaces. They won’t completely
go away, I still need them to make things like arbitrary width 1-hot muxes - and more generally
for large repetitive things that could be ruined by fat finger typos.
You can find the new load/store interfaces in “lstypes.si”.
Next time: Combining ALUs and Branch units