10 Feb 2023
Introduction
With FP out of the way the decks have been cleared and we’re quickly moving along, rapidly adding
support for more ratified extensions. Lots of new stuff here today!
Results - new Dhrystone numbers ~6.74 DMips/MHz - increased by ~5%
Note: we use the term “crypto” here - in this context it has all to do with
cryptography and absolutely nothing to do with cryptocurrencies.
Bitfield instructions
We had implemented the bitfield extensions from the original proposal quite a while ago,
but had not tested them, and they were disabled.
Now that we’ve spent time with the ratified extension, a lot of the work has been removing
instructions that had been removed from the spec.
Most of the instructions ended up in the shift ALU (which is essentially a
machine generated bunch of muxes) where all instructions run in 1 clock. The instructions that
involve adders or arithmetic comparisons (shlXadd and min/max) went into the ALU and clmul which takes 2 clocks went into
the multiplier (which itself mostly runs with 2 clock latencies).
All in all this came up really fast, we have now implemented all of the bitfield extensions: Zba, Zbb, Zbc, and Zbs and they pass all the compliance tests.
Crypto instructions
The crypto K extensions include some of the bitfield extensions along with a bunch of more crypto oriented
instructions, in particular support for the NIST mandated SHA256/512, AES32/64, and the Chinese SM3
and SM4 instructions.
We have implemented all of the crypto extensions Zbkb, Zbkc, Zbkx, Zknd, Zkne, Zknh, Zksed, Zksh and they pass compliance testing.
We can attest that we meet the “Data Independent Execution Latency Subset: Zkt” as all VRoom!’s instructions
(with the exceptions of integer and FP division, which is allowable) run in fixed time irrespective of the value
of their operands.
We have also implemented the entropy source register (Zkr) - we have not implemented a real entropy source
in the verilog model, nor a whitener - those are really something that needs to be decided in
a final implementation. Instead we’ve created a dummy plugin for testing and hardware to distribute
entropy to multiple HARTs.
We’ve completed support for all of the Zhn (NIST Algorithm Suite), Zks (ShangMi Algorithm Suite) and
Zk (Standard scalar cryptography extension) extensions, and they have passed compliance testing.
RV32
In the past one thing we’ve not been treating 32-bit mode as a full part of our instruction set (running 32-bit
instructions (RV32)
in our RV64 bit world) - the design was all in there but none of it was tested. With the advent of a whole
bunch of new 32-bit only crypto instructions they needed to be tested, so we’ve brought up RV32 as a
compliance
testing target and can now report that VRoom! passes all of the RV32 compliance testing as well. Compliance testing now take ~8 hours to run ….
Misc
We’re now working our way through the newly ratified extensions - adding in the security bits for the crypto
seed we also implemented the rest of the security register (new PMAP modes for the OPTEE people).
Summary
Retesting Dhrystone (recompiled with the new instructions) we see a minor (5%) increase in Dhrystone numbers
from 6.42 to 6.74 DMips/MHz - this is almost totally due to the compiler now using the new sh1add/sh2add/zextb
instructions from the bitfield extension.
Next time: More misc extensions,
31 Jan 2023
Introduction
We finally finished the FP unit, it passes all the compliance tests and our extended
FP data tests.
Exceptions
After finishing the sub-FP functional units (add/multiply/div) the one main missing piece
was generating exceptions - on a RISC-V CPU FP exceptions are not system traps, they are
accumulated bit masks in the CSR unit - generating them is not hard, getting them correct
is harder, and requires lots and lots of testing.
Because we can retire/commit multiple instructions per clock (up to 8) we need to be able to
store exception results in each instruction’s commit unit until they are ready to commit (alongside
each instruction’s result data in the commit register file).
When the time comes to commit an instruction
(write back it’s commit register data into the
architectural register file) we logical or together all 5 FP exception bits from each instruction being retired
in the same clock (non FP instructions simply report 0) along with a bit that reports that FP state
has changed (for example FP loads are
non FP instructions that change FP state). The results are accumulated into the CSR unit exceptions
register, and the FP MSTATUS.FS (clean/dirty) state.
Speculative Execution
We’re an out-of-order/speculative system - floating point instructions can be executed out of order,
this means that, like FP register data, we can’t commit exception state until an instruction is no
longer speculative, also we need to retire exception information in-order with respect to CSR
instructions that access and update that same state.
Committing accumulated exception data, as described in the previous section, combined with the fact that the
CPU only retires CSR access instructions alone (not with the instructions before and after them) means that
at that point everything is synchronized.
One problem with speculative and OO execution is that an instruction that changes the global FP rounding
mode may execute AFTER an instruction that depends on it - we solve this problem using an existing
mechanism, there are a set of registers in the CSR unit that, if you change their contents, trigger a pipe
flush (essentially a null trap that flushes all the subsequent uncommitted instructions and then refetches
the next instruction). We use this same mechanism by adding the rounding register to this list, but
only if it’s contents change.
One of the hardest bugs to find when we were bringing up linux on the design on the AWS FPGA instance
was in the integer divider - a speculative divide would be shot down by a mispredicted branch
but it would keep on going,
20 clocks later it would write random data into a commit register - for most functional units this is not
an issue as it takes 3 clocks for a pipe break, so units that run in under 3 clocks simply write
back into commit registers and the results are ignored. For the FP divide and sqrt we’ve made sure
that this works correctly.
Compliance Testing
Previously we’d been using an older, rather ad-hoc chunk of the standard compliance
tests. In these latest checkin we’ve updated to use the current tip of tree with the new
mechanisms (thanks to the authors and the riscv-sail people) - our design currently passes all
the IMCFD tests. We don’t yet have RISC-V 16-bit FP tests so they are not nearly as well verified.
It also passes a much larger group of stand alone FP tests (240M vectors) - we’re intending to use the same
FP submodules in the vector unit so they’ve been built in a modular fashion.
Conclusion/Future Work
Time to work on something different for a while, next up will be bit field instructions (the B extension),
this part of the design is already coded, but not at all tested.
We’ll revisit FP in a while adding in a second FP units and do some benchmarking
Next time: Bit field Instructions
14 Jan 2023
Introduction
This blog entry is about floating point divide and square root and in particular the algorithms we use.
Divide
So let’s talk about how we do division - starting with integer divide - in the multiplier unit we use
a pretty simple integer divide algorithm, essentially it’s the the same algorithm we all
learned in primary school - a 64-bit unsigned divide goes something like:
count = 64;
remainder = {64'b0, dividend};
quotient = 0;
while (count > 0) {
tmp = remainder[127:64] - divisor;
if (tmp >= 0) {
remainder = {tmp, r_remainder[62:0], 1'b0};
quotient = (quotient<<1) | 1;
} else {
remainder = {r_remainder[126:0], 1'b0};
quotient = (quotient<<1) | 0;
}
count--;
}
There’s a small counter, a 64-bit subtraction and a couple of shifters, one 128 bits, one 64 bits (the “tmp >= 0” is just the borrow from the subtraction).
After 64 clock the quotient contains the answer and the high 64 bits of remainder contain the remainder (in case you are really doing a rem operation rather than a div).
64 clocks is a long time - we can speed things up by calculating 2 bits per clock which goes like this:
count = 64/2;
remainder = {64'b0, dividend};
quotient = 0;
while (count > 0) {
tmp3 = remainder[127:64] - 3*divisor;
tmp2 = remainder[127:64] - 2*divisor;
tmp1 = remainder[127:64] - divisor;
if (tmp3 >= 0) {
remainder = {tmp3, r_remainder[62:0], 2'b00};
quotient = (quotient<<2) | 3;
} else
if (tmp2 >= 0) {
remainder = {tmp2, r_remainder[62:0], 2'b00};
quotient = (quotient<<2) | 2;
} else
if (tmp1 >= 0) {
remainder = {tmp1, r_remainder[62:0], 2'b00};
quotient = (quotient<<2) | 1;
} else {
remainder = {r_remainder[126:0], 2'b00};
quotient = (quotient<<2) | 0;
}
count--;
}
That takes 3 subtraction units rather than 1 (plus an adder to make the 3*divisor).
You can do 3 or 4 or more bits per clock - but it takes 2**N-1 subtraction units and a bunch of adders
2**(N-1)-1 which gets really expensive really fast - somewhere in the 2 (3 sub/1 add),
3 (7 sub/3 add), or 4 (15 sub/7 add) bits/clock range is probably reasonable
For floating point we don’t need a 64-bit remainder, we do however need 3 guard bits - we also only have
at most 51 bits of mantissa so our 64-bit FP divider is actually smaller and a little faster than the
equivalent integer one. On the other hand for integer division we can speed up division of small numbers by
skipping initial 0s while floating point numbers all have a MSB of 1 which makes that optimization
largely pointless.
The rest of the work in FP divide is dealing with the exponents (subtracting the divisor from the dividend)
and handling denorms and nans/divide by 0s, as well as the underflow/overflow logic which is similar to what we use elsewhere.
Square Root
First I’d like to thank Michael Field and Bruce Hoult for this Twitter thread.
Looking at the algorithm it seemed obvious that this looked a lot like our divider …. so the question is “can we cheaply leverage the FP divide logic to do sqrt”? (the answer is “yes”)
First here’s where we started (54 is from 51 bits of precision plus 3 guard bits), recoded from the twitter thread:
long long sq(long long v)
{
long long n = 0;
long long bit = 1ul<<(54/2);
long long a = 0;
long long b = 1ul<<54;
while (bit != 0) {
long long t = v - (a|b);
a = a>>1;
if (t >= 0) {
v = t;
a |= b;
n |= bit;
}
b = b>>2;
bit = bit >> 1;
}
return n;
}
Let’s start by noticing that ‘v’ behaves a lot like the remainder, let’s also change things to count like
the divider does (with a count variable) and get rid of the ‘bit’ shift register - ‘n’ is
now the quotient and we’ll start calculating it the same way we did for division, finally ‘a’ is similar to the divisor:
long long sq(long long remainder)
{
long long quotient = 0;
long long a = 0;
long long b = 1ul<<54;
int count = 54/2;
while (count >= 0) {
long long tmp = remainder - (a|b);
if (tmp >= 0) {
remainder = tmp;
divisor = (divisor>>1)|b;
quotient = (quotient<<1)|1;
} else {
divisor = (divisor>>1);
quotient = (quotient<<1)|0;
}
b = b>>2;
count--;
}
return quotient;
}
There’s one problem here - it calculates integer square roots - and FP mantissas always start with a 1
in the MSB - this algorithm generates half the number of bits you put in, we want to generate a full
precision result - we could run this algorithm for twice as many clocks - but b would run out of bits, we’d need to make it 108 bits wide as we would also need to do the same for remainder and tmp.
It would be a lot bigger and slower.
Luckily there’s a simple hack we can use to run the algorithm in place - simply shift everything (a, b, remainder) left
one at the end of every round. Since they’re mostly already being shifted right this also simplifies a whole bunch of stuff.
long long sq(long long remainder)
{
long long quotient = 0;
long long divisor = 0;
long long b = 1ul<<54;
int count = 54;
while (count >= 0) {
long long tmp = remainder - (a|b);
if (tmp >= 0) {
remainder = tmp<<1;
divisor = divisor|(b<<1);
quotient = (quotient<<1)|1;
} else {
remainder = remainder<<1;
divisor = divisor;
quotient = (quotient<<1)|0;
}
b = b>>1;
count--;
}
return quotient;
}
What’s interesting here is that the divisor doesn’t shift anymore, it just accumulates ‘b’s in place, the remainder
does all it’s math in place too, so we can deal with 54 bits (plus a couple of extras) rather
than 108. ‘b’ is also interesting - now it’s exactly equal to 1<<count and all it’s flops
could be replaced by a demux.
This now looks very much like our division code, the only real differences are the second argument to
the subtraction that creates tmp and that the divisor accumulates.
Reality is a little more complicated - mostly because of the nature of square roots we have
to deal with mantissas between 0.5 and 2
rather than the usual cases of 1 up to 2 (that adds an extra bit of precision). This is
because when we deal with the exponent we calculate it by halving
it but if the LSB is non-zero then that means we have to do a 1 bit shift of the incoming mantissa before we the start the square root
operation.
We’ve currently only implemented the 1-bit/clock for sqrt - we could do 2/3/4 bits/clock/etc using the same number of
subtraction units as divide (in fact the same subtraction units with a mux on one input). Generating the
terms for this is a little more complex, but not a lot, the main
difference is that the divisor changes on every clock, for division you can calculate the divisors once at the start.
Still To Do
Next up FP exceptions, and running the full suite of verification tests.
Next time: Final FPU stuff.
27 Dec 2022
Introduction
Work has slowed, we’ve not been getting a lot of time to work on FP …. this is an update
as we move from one part of the FP unit to the next.
New Stuff
Most of the work in the past couple of months has been a rewrite of the multiplier to add fused
multiply-add to the FP ALU - in the end this involved pulling the whole thing apart and a
major rewrite, this is now done.
At the same time we added 16-bit floating point all the way through the FP unit (and decode/load/store/etc).
Finally in order to test fused mul-add and FP16 we upgraded our stand-alone FP multiplier/adder test infrastructure
and found far better sources for test vectors - and in the process found lots of wonderfully
annoying corner cases - which is where most of the time has been spent over the past few months.
Our FP test vectors are now many gigabytes in size and too big to fit in git, they’re easier to just
regenerate - they all ran clean for the first time last night which is why we’re checking
everything in today - when we’re done we’ll run something 100 times larger.
Still To Do
We know how handle exceptions, the way RISC-V manages them works very well with our commit model,
but there’s still a bunch of work to do there.
We also have a small number of instructions still to implement - basically divide and sqrt, they will be next.
Finally we’re using our own testing for the moment, once we’re done with the above we’ll pull in
the more public verification tests and run them before we declare we’re finished with FP.
Next time: more FPU stuff - divide/sqrt
01 Oct 2022
Bugs
A quick note, still working on FP, but spent some time fixing bugs - thanks to Hirosh
who’s been playing with VRoom! I
went back and ran some old code fragments and discovered they locked up, that led to rerunning our
standard set of regressikons we used before switching to running linux on AWS - turns out a lot
of code to do with fence instructions and load conditional instructions were broken by the memory
subsystem overhaul.
These are fixed now, though we probably need a bunch of even more subtle fence tests. Running
all the regression tests on every change is now a lot easier.
Next time: more FPU stuff