VRoom! blog - quick note on a bug11 Feb 2023
After the past new work we went to spend some time just going over the core pipes at the micro level looking to make sure we hadn’t broken anything - and found something that’s been broken since we rewrote the load/store unit.
Essentially the commit units are usually able to predict when an instruction will be completed one clock ahead of when its resulting data will be available to be bypassed so we can schedule units dependant on that data for the next clock. Usually this is easy because most instructions are of fixed length - but loads take a variable amount of time and even though we know a clock before writing the loaded data back into the register file we weren’t signalling - this is now fixed, it will be a difficult timing path.
Results - new Dhrystone numbers ~7.16 DMips/MHz - increased by ~6%
Pulling another ~11% more performance out of the design warrants some more exploration in this area, we know that for the above benchmark we’re getting on average ~2.8 instructions retired per clock, but the front end is decoding on average ~3.5 instructions per 16 byte bundle (per clock - Dhrystone is very ‘branchy’ so it’s not really possible to write a faster decoder) - that means we can estimate a potential top Dhrystone number of ~9 DMips/MHz, so we have about another 25% to find in the execution side.
Off to work on some better pipeline visualization tooling, and to experiment with a 64 entry commit queue (something intended for a real chip anyway).
Next time: More misc extensions