VRoom! blog: Rename Optimizations

Introduction

The past few weeks we’ve been working on two relatively simple changes in the register renamer: renaming registers known to be 0 to register x0, and optimizing register to register moves.

These are important classes of optimizations on x86 class CPUs, a real hot topic - previously I worked on an x86 clone where instructions were cracked into micro-ops - lots of these moves were generated by this process and optimization is important there. On RISC-V code streams we see relatively fewer (but not 0) register moves.

This particular exercise isn’t expected to have a great performance bump, but it’s the same spot in the design that we will also be doing opcode merging so it’s a chance to start to crack that part of the design open.

Our renamer essentially does one thing - it keeps track of where the live value (from the point of view of instructions coming out of the decoder or the trace cache) of architectural registers are actually stored, they could either be in the architectural register files, or the commit Q register file - once an instruction hits the commit Q it can’t be executed until all it’s operands are either in the commit Q register file (ie done), or in the architectural register file.

placeholder

The goal of these optimizations is not so much to remove move instructions from the instruction stream but to allow subsequent instructions that depend on them execute earlier (when the source register for the move instruction is ready rather than it’s destination register is).

0 register operations

This optimization is very simple - on the fly the renamer recognizes a bunch of arithmetic operations as generating ‘0’ - add/xor/and/or.

For example “li a0, 0” is really “add a0,x0,0”, or “xor a0,t0,t0” etc - it’s a simple operation and is very cheap.

(by the time they hit the renamer all the subtract operations look like adds)

Move instruction operations

These essentially optimize:

mov	a0, a1
add	t1, a0, t5

into:

mov	a0, a1
add	t1, a1, t5

Note: they don’t remove the mov instruction (that’s a job for another day) as there’s not enough information in the renamer to tell if a0 will be subsequently needed. The sole goal here is to allow the two instructions to be able to run in the same clock (or rather for the second instruction to run earlier if there’s an ALU slot available).

Recognizing a move in the renamer is easy, we’re looking for add/or/xor with one 0 operand.

Managing the book keeping for this is much more complex than the 0 case - largely because the register being moved could be in 1 of 4 different places - imagine some code like:

sub	X, ....
....
mov	Y, X
...
add	,,Y

The live value for operand ‘Y’ to the add instruction could now be in:

  • the actual architectural register X
  • the commitQ entry for the sub instruction
  • the commitQ entry for the mov instruction
  • the actual architectural register Y

The commitQ logic that keeps track of registers now needs be more subtle and recognize these state changes. As does the existing logic in the renamer that handles the interactions between instructions that are decoded in the same clock and that update registers referenced by each other.

Conclusion

There’s not discernible performance bump for dhrystone here (though we can see the operations happening at a micro level) - there’s only 1 move and a couple of 0 optimizations per dhrystone loop and they’re just reducing pressure on the ALUs - that’s not surprising.

We think this is worth keeping in though, as we start running more complex benchmarks, especially ALU bound ones we’re going to see the advantages here.

A note on Benchmarking

While checking out performance here we noticed a rounding bug in the code we’ve been running on the simulator (but not on the FPGAs). The least significant digits of the “Microseconds for one run through Dhrystone” number we’ve been quoting have been slightly bogus - that number is not used for calculation of the rest of the numbers (including the Dhrystone numbers) so those numbers remain valid.

We recompiled the offending code and reran it for a few past designs … they gave different (and lower) numbers (6.42 vs. 6.49 DMips/MHz) than we had before …. after a lot of investigation we’ve decided that this is caused by code alignment in the recompiled code (another argument for a fully working trace cache).

We could hack on the code to manually align stuff and make it faster (or as fast as it was) but that would be cheating - we’re just compiling it without any tweaking. Instead we’ll just note that this change has happened and start using the new code and compare with the new numbers. What’s important for this optimization process is knowing when the architecture is doing better.

Next time

We’ve been putting off finishing the FPU(s) - we currently have working FP adders (2 clocks) and multipliers (3 clocks) that pass many millions of test vectors, some of the misc instructions still need work, and it all still needs to be integrated into a new FP renamer/scoreboard. Then on to a big wide vector unit …..

Next time: FPU stuff

VRoom! blog: Trace cache - Part 2

Introduction

It’s been a couple of months since I last posted about our work on a trace cache for VRroom, it’s become a bit of slog - trace cache bugs are the worst: suddenly out of nowhere your code branches off into the weeds, often caused by something that happened tens of thousands of clocks before. Short story, we’re pausing work on my trace cache implementation to get more important stuff done, long story: well read on

What is good about our current Trace Cache?

You’ll remember from the previous blog post about our trace cache that it’s pretty simple, the idea was to bolt on a direct mapped cache filled with traces extracted from the completion end of our commit engine that competed with the normal icache/branch-predictor/PC/decoder to provide decoded data directly to the rename units.

After the last blog post we integrated support to avoid traces from crossing instruction call/return transitions, this preserved the call prediction stacks in the BTC and made overall performance in the same ballpark as the existing performance.

Now that is up and working, it passes all our sanity tests, direct examination of the pipelines shows a higher issue rate (8 instructions per clock regularly which was the goal) but with a higher stall rate (because the ALUs are busy and the commit queue is filling) - which indicates that a well working trace cache probably needs an increase in the number of ALUs and/or load/store units to further increase performance - that in itself is not a bad thing it means we have more performance choices when choosing a final architecture.

What is bad about our current Trace Cache?

However while the back end pipelines are performing better the one bad thing that benchmarks show is that because our naive direct mapped trace cache bypasses the branch predictor and as a result it performs worse (about 15% worse) on branch prediction - you will recall we’re using a traditional dual combined branch predictor (bimodal and global history) - the trace cache is effectively embedding the bimodal BTC data in the cache, but not the global predictor data.

What we really need is a way to incorporate the global predictor into the trace cache tags, that way we can collect multiple traces for the same starting PC address (there are lots of examples of this in the literature on trace caches, it’s not a new idea).

This is not a trivial change, it means pulling apart the BTC and PC fetcher code and more closely integrating the trace cache at a lower level - the BTC needs to be able to track both trace streams and icache streams to update its predictors from both sources - the current BTC/PC fetcher are very focused on icache boundaries and predicting multiple branches within them, the data coming out of the trace cache tends to have arbitrary PC boundaries, and has unconditional branches elided - merging them more than we currently have is a non trivial piece of work.

Note that getting a direct mapped trace cache to ~5/6 the performance of the existing CPU is actually not that bad - it means we’re on the right track, just not there yet.

Conclusion

Our basic trace cache design seems to be functional, but doesn’t perform well enough yet - short to medium term it’s become a bit derail against making other progress so we’re going to set it aside for the moment, probably until the next time the BTC has been opened up and is spread over the desktop for some major rearrangement and then dig into integrating global history into the trace cache tags and BTC predictor updates from trace data.

This is not the last we’ve seen of the trace cache.

Our next work - some simple work on optimising moves and 0ing of registers - some minor instruction recognition in the renamer will allow commit Q entries to resolve earlier (more in parallel) reducing instruction dependencies on the fly.

Next time: Simple Renamer Optimisations

VRoom! blog: Trace cache - Part 1

Introduction

I’ve spent the past month working on adding a trace cache to VRoom! - it’s not done yet but is mildly functional. Here’s a quick write up of what I’ve done so far.

What is a Trace cache?

Essentially a trace cache is an instruction cache of already decoded instructions. On a CISC CPU like an Intel/AMD x86 it might contain RISC-like micro-operations decoded from complicated CISC instructions.

On VRoom! we have an instruction decode that every clock can decode 4 32-bit instructions, 8 16-bit instructions, or some mixture of the two from an 128-bit bundle fetched from the L1 instruction cache. Theoretically we can feed our CPU’s core (the renamer onwards) with 8 instructions on every clock, but in reality we seldom do. There are four main reasons why:

  • we can only get 8 instructions per clock through the decoder if they are all 16-bit instructions - most instruction streams are a mix of both 16-bit and 32-bit instructions
  • there’s a very rough rule of thumb from computer architecture in general that says that about every 5th instruction is a branch - with potentially 8 instructions per icache fetch that means that (very roughly) on average only 5 of them get executed
  • branch targets are not always on a 16-byte (8 instruction) boundary - when this happens we only issue instructions after the target
  • when the BTC make a micro-misprediction we sometimes lose a clock (and up to 8 instructions) retrying the icache fetch

And indeed measurement of instruction use rates on VRoom! (again on the still useful dhrystone - which currently does almost perfect branch prediction) we measure 3.68 instructions/128-bit instruction bundle - on average less than 4 instructions being issued per clock to the core. Of course every instruction stream is different - we look both at largish benchmark streams and drill down into particular hot spots looking to understand low level issues.

There is also another an issue for tight loops - imagine something like this looking for the null at the end of a string:

loop:
	ldb	a0, (a1)
	add	a1, a1, 1
	bne	a0, loop

our current execution core can issue up to 4 loads per clock and execute 3 add/branch instructions per clock - theoretical it should be able to speculatively go 3 times around this loop every 2 clocks - if only it can be fed from the instruction stream quickly enough (9 instructions every 2 clocks), but in this case the current fetch/decode logic is limited to 3 instructions per clock (or per 128-bit bundle) - it might even be less (3 instructions every 2 clocks) if the instructions happen to straddle a 128-bit bundle boundary.

So a trace cache allows us to record ‘traces’ of instructions - streams of decoded consecutive instructions from the system and then consistently play them back at the full rate so that the core can execute them without the limitations placed on them by the decoder, instruction cache, and the alignment of instructions within an instruction stream.

A trace cache is an interesting beast - when it hits it replaces not only the instruction cache but also part of the branch target cache - and there’s likely some interesting interactions between the BTC and the trace cache.

A trace cache hit returns N consecutive instructions (not necessarily consecutive in memory, but consecutive within the instruction stream) along with the PC of the instruction to execute following the last instruction in the trace that’s returned. On VRoom! we don’t ‘execute’ unconditional branches, they’re handled solely in the fetch unit so they don’t find their way into the execution pipe or the trace cache - the ‘next’ address of an instruction might be the address the branch after it jumped to - so this loop:

loop:	bne     a1, done
        add     a1, a1, -1
	j	loop
done:

results in only 2 instructions in the commitQ instruction stream and 2 instructions per loop in a trace cache.

Implementation

Each instruction from VRoom!s decoders is represented by a roughly 160-bit bundle (registers, PC, decoded opcode, functional unit number, predicted branch destination, immediate value etc) - after renaming (matching registers to the commitQ entry that will produce their values) the renamed instructions are entered as a block into the commitQ.

The commitQ is an ordered list of instructions to be out-of-order and speculatively executed - because of branch mispredictions, traps/faults, TLB misses etc an instruction may or may not reach the far end of the commitQ, where it will be ‘committed’ (ie will update the architectural state - writes to memory will actually happen, register updates will be committed to the architecturally visible register file etc).

The commit end of the commitQ can currently commit up to 8 instructions per clock (essentially we need a number here that is higher than our target IPC so that we can burst where possible - - it’s strongly related to the number of write ports into the architectural register file) - it regularly achieves 8 instructions per clock, but on average it’s more like 2-3 instructions per clock (it’s how we measure our system IPC) one of our main goals with VRoom! is to increase this number to something more like 4 IPC - of course it’s going to be different on each instruction stream.

So let’s look at how a trace cache (I$0 - 0th level instruction cache) fits into our system

placeholder

Red arrows are control signals, black ones are instruction bundles, blue ones register access.

Note that the trace cache is filled from the portion of the commitQ containing committed instructions, on every clock it will be filled with 0-8 instructions. On the fetch side data is read up to 8 at a time and replaces data from the decoders into the rename unit.

The fetch side of things is managed by the PC (program counter) unit, on every clock it fetches from the normal instruction cache and the trace cache in parallel, if it hits in the trace cache the instruction cache data is discarded. If the decode unit is currently busy (from the previous fetch) the PC has to wait a clock before allowing the trace cache data be used (to allow the decode instructions to be renamed), subsequent trace cache data can be dispatched on the every clock. Because we fetch instruction cache data in parallel with trace cache data switching back causes a 1 clock (decode time) penalty. Some of the times we gain a clock - for example when when we take a branch misprediction, the pipe will start one clock more quickly if the new branch target is in the trace cache.

Normally the PC uses BTC information, or information from the decoder to choose what the next PC fetch will be - on a trace cache hit we use the next PC value from the trace cache instead.

Initial Trace Cache Implementation

This is all still very much a work in process, here’s a brief description of our initial implementation, it’s not good enough yet, but the basic connection described above probably wont change.

This initial implementation consists of N traces each containing 8 instructions (~1k bits per trace) - on the read side they act as a fully associative cache - so N can be any value, doesn’t have to be a power of 2, we’re starting with N=64, a number pulled out of thin air.

Each trace has the following metadata:

  • PC of the first entry (tag for associative lookup)
  • next PC of the last entry in the trace
  • bit mask describing which instruction entries in the trace are valid (the first M will be set)
  • use tag - a pinned counter - incremented on each valid access, decremented every P CPU clocks, a trace entry with a use tag of 0 can be stolen

In general some other CPU’s trace cache implementations have much larger trace entries (far more than 8) - our implementation is an attempt to create a system that will self assemble larger trace lines out of the smaller 8 instruction chunks - this part is still a work in progress.

Internally the current implementation looks like this:

placeholder

The fetch interface is described above, the Commit side interface has to handle unaligned incoming data - for example imagine on clock N we get 4 instructions and on clock N+1 we get 8 - we want to be able to write 4 in clock N, 4 more into the same line on clock N+1 and the final 4 instructions into a new line at clock N+1. For this reason there’s a temporary holding buffer “waiting” that holds data and allows us to share a single write port. In essence there’s a single write port with a shift mux and a holding register in its input.

There are two pointers to trace lines:

  • “Last” which points to the last line written (in case we want to add more data to the end of it)
  • “Next” which will be the next line we can write to (if one is available) - this is calculated using the “use tag” metadata a clock ahead

During operation fresh data line goes into Next at which time Last become Next - new data that is in the same trace and fits into the trace pointed to by Last goes there, skipping any input invalidates Last.

Performance

So the above design works, passes our CPU sanity tests (there were lots of interesting bugs around interrupts) however the performance is meh - it’s a little slower than the existing non-trace-cache model which obviously isn’t good enough.

Why? mostly it seems that the current mechanism for choosing traces is too simple, it’s greedy and it grabs any trace when there is space available - we do start traces at jump targets but that’s about it.

A particular problem is that we happily collect traces across subroutine calls and returns - this results in branch mispredictions and confuses the branch target cache. Another issue is that we are collecting the same data over and over again essentially in different phases (at different offsets) within different trace entries which limits the effective size of the trace cache.

So the next plan is to work with the existing working design but modifying the criteria for when traces are started and when they finish - initially making sure that they end at instruction calls and returns, and then some smarts around looping, perhaps limiting trace length.

Conclusion

Our basic trace cache design seems to be vaguely functional, but doesn’t perform well - originally we’d hoped for a 1.5-2x boost - we think this is still possible - watch this space in a month or so.

Next time: Trace cache - Part 2

VRoom! blog: Combining ALUs and Branch Units

Introduction

Wow, this was a fun week VRoom! made it to the front page of HackerNews - for those new here this is an occasional blog post on architectural issues as they are investigated - VRoom! is very much a work in progress.

This particular blog entry is about a recent exploration around the way that we handle branches and ALUs.

Branch units vs ALU units

The current design has 1 branch unit per hart (a ‘hart’ is essentially a RISCV CPU context - an N-way simultaneous multi-threaded machine has N harts even if they share memory interfaces, caches and ALUs). It also has M ALUs - currently two.

After a lot of thinking about branch units and ALU units it seemed that they have a lot in common - both have a 64-bit adder/subtracter in their core, and use 2 register read ports and a write port. Branch units only use the write port for call instructions - some branches, unconditional relative branches don’t actually hit the commitQ at all, conditional branches simply check that the predicted destination is correct, if so they become an effective no-op, otherwise they trigger a commitQ flush and a BTC miss.

So we’ve made some changes to the design to be able to optionally make a combined ALU/branch unit, and to be able to build the system with those instead of the existing independent branch and ALUs units (it’s a Makefile option so relatively easy to change). Running a bunch of tests on our still useful Dhrystone we get:

Branch ALU Combined DMips/MHz Area Register Ports  
1 2 0 6.33 3x 6/3 2022-03-25-combined-branches.md
0 0 2 6.14 2x 4/2  
0 0 3 6.49 3x 6/3  
0 0 4 6.49 4x 8/4  

Which is interesting - 3 combined Branch/ALU units outperform the existing 1 Branch/2 ALU with roughly the same area/register file ports. So we’ll keep that.

It’s also interesting that 4 combined ALUs performs exactly the same as the 3 ALU system (to the clock) even though the 4th ALU gets scheduled about 1/12 of the time - essentially this is because because we’re scheduling ALUs out of order (and speculatively) the 3rd ALU happily takes on that extra work without changing how fast the final instructions get retired.

One other interesting thing here, and the likely reason for much of this performance improvement is that we can now retire multiple branches per clock - we need to be able to do something sensible if multiple branches fail branch prediction in the same clock - the correct solution is to give priority to the misprediction closest to the end of the pipe (since the earlier instruction should cause the later one to be flushed from the pipe).

What’s also interesting is: what would happen if we build a 2-hart SMT machine? previously such a system would have had 2 branch units and 2 ALUs - looking at current simulations a CPU is keeping 1 ALU busy close to 90%, the second to ~50%, the 3rd ~20% - while we don’t have any good simulation data yet we can guess that 4 combined ALUs (so about the same area) would likely satisfy a dual SMT system - mostly because the 2 threads would share I/Dcaches and as a result run a little more slowly (add that to the list of future experiments).

VRoom! go Boom!

Scheduling N units is difficult - essentially we need to look at all the entries in the commitQ and choose the one closest to the end of the pipe ready to perform an ALU operation on ALU 0. That’s easy the core of it looks something a bit like this (for an 8 entry Q):

always @(*)
casez (req) 
8'b???????1: begin alu0_enable = 1; alu0_rd = 0; end
8'b??????10: begin alu0_enable = 1; alu0_rd = 1; end
8'b?????100: begin alu0_enable = 1; alu0_rd = 2; end
.....
8'b10000000: begin alu0_enable = 1; alu0_rd = 7; end
8'b00000000: alu0_enable = 0;
endcase

for ALU 1 it looks like (answering the question: what is the 2nd bit if 2 or more bits are set):

always @(*)
casez (req) 
8'b??????11: begin alu1_enable = 1; alu1_rd = 1; end
8'b?????101,
8'b?????110: begin alu1_enable = 1; alu1_rd = 2; end
8'b????1001,
8'b????1010,
8'b????1100: begin alu1_enable = 1; alu1_rd = 3; end
.....
8'b00000000: alu1_enable = 0;
endcase

For more ALUs it gets more complex for a 3264 entry commitQ it’s also much bigger, for dual SMT systems there are 2 commitQs so 64/128 entries (we interleave the request bits from the two commitQs to give them fair access to the resources).

Simply listing all the bit combinations with 3 bits set out of 128 bits in a case statement just listing all of them gets unruly - but really we do want to express this in a manner where we can provide a maximum amount of parallelism to the synthesis tools we’re using, hopefully they’ll find optimizations that are not obvious - so long ago we had reformulated it to something like this:

always @(*)
casez (req) 
8'b???????1: begin alu0_enable = 1; alu0_rd = 0;
		casez (req[7:1]) 
		7'b??????1: begin alu1_enable = 1; alu1_rd = 1; end
		7'b?????10: begin alu1_enable = 1; alu1_rd = 2; end
		7'b????100: begin alu1_enable = 1; alu1_rd = 3; end
		.....
		7'b1000000: begin alu1_enable = 1; alu1_rd = 7; end
		7'b0000000: alu1_enable = 0;
		endcase
	     end
8'b??????10: begin alu0_enable = 1; alu0_rd = 1; 
		casez (req[7:2]) 
		6'b?????1: begin alu1_enable = 1; alu1_rd = 2; end
		6'b????10: begin alu1_enable = 1; alu1_rd = 3; end
		6'b???100: begin alu1_enable = 1; alu1_rd = 4; end
		.....
		6'b100000: begin alu1_enable = 1; alu1_rd = 7; end
		6'b000000: alu1_enable = 0;
		endcase
	     end
8'b?????100: begin alu0_enable = 1; alu0_rd = 2; 
.....
8'b10000000: begin alu0_enable = 1; alu0_rd = 7; alu1_enable = 0; end
8'b00000000: begin alu0_enable = 0; alu1_enable = 0; end
endcase

etc etc we have C code that will spit this out for an arbitrary number of ALUs (arbitrary depths) the 2 ALU scheduler for a 32 entry commitQ happily compiles under verilator/iverilog and on Vivado (yosys currently goes bang! we suspect upset by this combinatorial explosion). When we switched to 3 ALUs (we tried this a while ago) it happily compiled on verilator (takes a while) and runs. When we compiled up the 4 ALU scheduler on verilator it went Bang! the kernel OOM killer got it (on a 96Gb laptop) - looking at the machine generated code it was 200K lines of verilog …. oops … 3 ALUs was compiling 50k, 2 ALUs ~10k …. serious combinatorial explosion!

Luckily we’d already found another way to solve this problem elsewhere (we have 6! address units) so dropping some other code in to generate this scheduler wasn’t hard (900 lines rather than 200k) - we’re going to need to spend some time looking at how well this new code performs, it had always been expected to be one of the more difficult areas for timing - we might need to find some 95% heuristics here that are not perfect but allow us higher core clock speeds - time will tell. Sadly Yosys still goes bang, must be something else.

Conclusion

Combined branch ALUs seem to provide a performance improvement with little increase in area - we’ll keep them.

Next time: Probably something about trace caches, might take a while

VRoom! blog: Verilog changes, new performance numbers

Introduction

Not really an architectural posting, more about tooling, skip if it’s not really your thing, also new some performance numbers.

New Performance Numbers

I found the bug in the BTC - mispredicted 4-byte branches that crossed a 16 byte boundary (our instruction bundle size) this is now fixed and we have better dhrystone numbers: 6.33 DMIPS/MHz at an IPC of 2.51 - better than I expected! We can still do better.

More System Verilog, Interfaces for the (mostly) win

Last week I posted a description about my recent major changes to the way that the load/store unit works, what I didn’t touch on is some pretty major changes to the ways the actual Verilog code is written.

I’m an old-time verilog hack, I’ve even written a compiler (actually 2 of them, you can thank me for the * in the “always @(*)” construct). Back in the 90s, I tried to build a cloud biz (before ‘cloud’ was a word) selling time with it for that last couple of months when you need zillions of licenses and machines to run them on. Sadly we were probably ahead of our time, and we got killed by Enron the “smartest guys in the room” who made any Californian business that depended on the price of electricity impossible to budget for. I opened sourced the compiler and it died a quiet death.

I started this project using Icarus Verilog, using a relatively simple System Verilog subset, making sure I could also build on Verilator once I started running larger sims.

I’m a big fan of Xs, verilog’s 4-value ‘undefined’ values - in particular using them as clues to synthesis about when we don’t care and using them during simulation to propagate errors when assumptions are not met (and to check that we really don’t care)

For example - a 1-hot Mux:

always @(*)
casez (control) // synthesis full_case parallel_case
4'b1000: out = in0;
4'b0100: out = in1;
4'b0010: out = in2;
4'b0001: out = in3;
default: out = 'bx;
endcase

One could just use:

always @(*) 
casez (control) // synthesis full_case parallel_case
4'b1???: out = in0;
4'b?1??: out = in1;
4'b??1?: out = in2;
4'b???1: out = in3;
endcase

Chances are synthesis will make the same gates - but in the first case simulation is more likely to blow up (and here I mean blow up ‘constructively’ - signalling an error) if more than 1 bit of control is set. “full_case parallel_case” is a potentially dangerous thing, it gives the synthesis tool permission to do something that might not be the same as what the verilog compiler does during simulation (if the assumption here that control is one-hot isn’t met) the “out = ‘bx” gets us a warning as the Xs propagate if our design is broken.

Xs are also good for figuring out what really needs to be reset, propagating timing failures etc etc

Sadly Verilator doesn’t support Xs, while Icarus Verilog does. Icarus Verilog also compiles a lot faster making for far faster turn around for small tests - I had got to a point where I would use Verilator for long tests and Icarus for actual debugging once I could reproduce a problem “in the small”.

One of the problems I’ve had with this project is that in simple verilog there really wasn’t a way to pass an arbitrary number of an interface to an object - for example the dcache needs L dcache read ports, where L is the number of load ports - you can pass bit vectors of 1-bit structures, but not an array of n-bit structures like addresses, or an array of data results.

System Verilog provides a mechanism to do this via “interfaces” - and this time around I have embraced them with a passion - here’s NLOAD async dcache lookups:

interface DCACHE_LOAD #(int  RV=64, NPHYS=56, NLOAD=2);     
	typedef struct packed {
		bit [NPHYS-1:$clog2(RV/8)]addr; // CPU read port
	} DCACHE_REQ;
	typedef struct packed {
		bit hit;
		bit hit_need_o;                                     
		bit [RV-1:0]data;
	} DCACHE_ACK;
	DCACHE_REQ req[0:NLOAD-1];
	DCACHE_ACK ack[0:NLOAD-1];
endinterface

Sadly Icarus Verilog (currently) doesn’t support interfaces and I’ve had to remove them from our supported tools - there’s still stuff in the Makefile for its support (I live in hope :-). Even Verilator’s support is relatively minimal, it doesn’t support arrays within arrays, nor does it support unpacked structures (so debug in gtkwave is primitive).

I’m really going to miss 4-value simulation, I hope IV supports interfaces sometime - I wish I had the time to just do it and offer it up, but realisticly one can really only do one Big Thing at a time.

What this does mean though is that longer term a lot of those places where I’m currently making on-the-fly verilog from small C programs to get around the inability to pass arbitrary numbers of a thing to a module (the register file comes to mind), or using ifdef’s to enable particular interfaces will likely go away in the future to be replaced by interfaces. They won’t completely go away, I still need them to make things like arbitrary width 1-hot muxes - and more generally for large repetitive things that could be ruined by fat finger typos.

You can find the new load/store interfaces in “lstypes.si”.

Next time: Combining ALUs and Branch units