VRoom! blog - Lots of progress Bitfields, Crypto etc

Introduction

With FP out of the way the decks have been cleared and we’re quickly moving along, rapidly adding support for more ratified extensions. Lots of new stuff here today!

Results - new Dhrystone numbers ~6.74 DMips/MHz - increased by ~5%

Note: we use the term “crypto” here - in this context it has all to do with cryptography and absolutely nothing to do with cryptocurrencies.

Bitfield instructions

We had implemented the bitfield extensions from the original proposal quite a while ago, but had not tested them, and they were disabled. Now that we’ve spent time with the ratified extension, a lot of the work has been removing instructions that had been removed from the spec. Most of the instructions ended up in the shift ALU (which is essentially a machine generated bunch of muxes) where all instructions run in 1 clock. The instructions that involve adders or arithmetic comparisons (shlXadd and min/max) went into the ALU and clmul which takes 2 clocks went into the multiplier (which itself mostly runs with 2 clock latencies).

All in all this came up really fast, we have now implemented all of the bitfield extensions: Zba, Zbb, Zbc, and Zbs and they pass all the compliance tests.

Crypto instructions

The crypto K extensions include some of the bitfield extensions along with a bunch of more crypto oriented instructions, in particular support for the NIST mandated SHA256/512, AES32/64, and the Chinese SM3 and SM4 instructions.

We have implemented all of the crypto extensions Zbkb, Zbkc, Zbkx, Zknd, Zkne, Zknh, Zksed, Zksh and they pass compliance testing.

We can attest that we meet the “Data Independent Execution Latency Subset: Zkt” as all VRoom!’s instructions (with the exceptions of integer and FP division, which is allowable) run in fixed time irrespective of the value of their operands.

We have also implemented the entropy source register (Zkr) - we have not implemented a real entropy source in the verilog model, nor a whitener - those are really something that needs to be decided in a final implementation. Instead we’ve created a dummy plugin for testing and hardware to distribute entropy to multiple HARTs.

We’ve completed support for all of the Zhn (NIST Algorithm Suite), Zks (ShangMi Algorithm Suite) and Zk (Standard scalar cryptography extension) extensions, and they have passed compliance testing.

RV32

In the past one thing we’ve not been treating 32-bit mode as a full part of our instruction set (running 32-bit instructions (RV32) in our RV64 bit world) - the design was all in there but none of it was tested. With the advent of a whole bunch of new 32-bit only crypto instructions they needed to be tested, so we’ve brought up RV32 as a compliance testing target and can now report that VRoom! passes all of the RV32 compliance testing as well. Compliance testing now take ~8 hours to run ….

Misc

We’re now working our way through the newly ratified extensions - adding in the security bits for the crypto seed we also implemented the rest of the security register (new PMAP modes for the OPTEE people).

Summary

Retesting Dhrystone (recompiled with the new instructions) we see a minor (5%) increase in Dhrystone numbers from 6.42 to 6.74 DMips/MHz - this is almost totally due to the compiler now using the new sh1add/sh2add/zextb instructions from the bitfield extension.

Next time: More misc extensions,

VRoom! blog - Floating Point 4 - exceptions etc

Introduction

We finally finished the FP unit, it passes all the compliance tests and our extended FP data tests.

Exceptions

After finishing the sub-FP functional units (add/multiply/div) the one main missing piece was generating exceptions - on a RISC-V CPU FP exceptions are not system traps, they are accumulated bit masks in the CSR unit - generating them is not hard, getting them correct is harder, and requires lots and lots of testing.

Because we can retire/commit multiple instructions per clock (up to 8) we need to be able to store exception results in each instruction’s commit unit until they are ready to commit (alongside each instruction’s result data in the commit register file).

When the time comes to commit an instruction (write back it’s commit register data into the architectural register file) we logical or together all 5 FP exception bits from each instruction being retired in the same clock (non FP instructions simply report 0) along with a bit that reports that FP state has changed (for example FP loads are non FP instructions that change FP state). The results are accumulated into the CSR unit exceptions register, and the FP MSTATUS.FS (clean/dirty) state.

Speculative Execution

We’re an out-of-order/speculative system - floating point instructions can be executed out of order, this means that, like FP register data, we can’t commit exception state until an instruction is no longer speculative, also we need to retire exception information in-order with respect to CSR instructions that access and update that same state.

Committing accumulated exception data, as described in the previous section, combined with the fact that the CPU only retires CSR access instructions alone (not with the instructions before and after them) means that at that point everything is synchronized.

One problem with speculative and OO execution is that an instruction that changes the global FP rounding mode may execute AFTER an instruction that depends on it - we solve this problem using an existing mechanism, there are a set of registers in the CSR unit that, if you change their contents, trigger a pipe flush (essentially a null trap that flushes all the subsequent uncommitted instructions and then refetches the next instruction). We use this same mechanism by adding the rounding register to this list, but only if it’s contents change.

One of the hardest bugs to find when we were bringing up linux on the design on the AWS FPGA instance was in the integer divider - a speculative divide would be shot down by a mispredicted branch but it would keep on going, 20 clocks later it would write random data into a commit register - for most functional units this is not an issue as it takes 3 clocks for a pipe break, so units that run in under 3 clocks simply write back into commit registers and the results are ignored. For the FP divide and sqrt we’ve made sure that this works correctly.

Compliance Testing

Previously we’d been using an older, rather ad-hoc chunk of the standard compliance tests. In these latest checkin we’ve updated to use the current tip of tree with the new mechanisms (thanks to the authors and the riscv-sail people) - our design currently passes all the IMCFD tests. We don’t yet have RISC-V 16-bit FP tests so they are not nearly as well verified.

It also passes a much larger group of stand alone FP tests (240M vectors) - we’re intending to use the same FP submodules in the vector unit so they’ve been built in a modular fashion.

Conclusion/Future Work

Time to work on something different for a while, next up will be bit field instructions (the B extension), this part of the design is already coded, but not at all tested.

We’ll revisit FP in a while adding in a second FP units and do some benchmarking

Next time: Bit field Instructions

VRoom! blog - Floating Point 3 - divide/sqrt

Introduction

This blog entry is about floating point divide and square root and in particular the algorithms we use.

Divide

So let’s talk about how we do division - starting with integer divide - in the multiplier unit we use a pretty simple integer divide algorithm, essentially it’s the the same algorithm we all learned in primary school - a 64-bit unsigned divide goes something like:

    count = 64;
    remainder = {64'b0, dividend};
    quotient = 0;
    while (count > 0) {
	tmp = remainder[127:64] - divisor;
	if (tmp >= 0) {
	    remainder = {tmp, r_remainder[62:0], 1'b0};
	    quotient  = (quotient<<1) | 1;
	} else {
	    remainder = {r_remainder[126:0], 1'b0};
	    quotient  = (quotient<<1) | 0;
	}
        count--;
    }

There’s a small counter, a 64-bit subtraction and a couple of shifters, one 128 bits, one 64 bits (the “tmp >= 0” is just the borrow from the subtraction).

After 64 clock the quotient contains the answer and the high 64 bits of remainder contain the remainder (in case you are really doing a rem operation rather than a div).

64 clocks is a long time - we can speed things up by calculating 2 bits per clock which goes like this:

    count = 64/2;
    remainder = {64'b0, dividend};
    quotient = 0;
    while (count > 0) {
	tmp3 = remainder[127:64] - 3*divisor;
	tmp2 = remainder[127:64] - 2*divisor;
	tmp1 = remainder[127:64] - divisor;
	if (tmp3 >= 0) {
	    remainder = {tmp3, r_remainder[62:0], 2'b00};
	    quotient  = (quotient<<2) | 3;
	} else 
	if (tmp2 >= 0) {
	    remainder = {tmp2, r_remainder[62:0], 2'b00};
	    quotient  = (quotient<<2) | 2;
	} else 
	if (tmp1 >= 0) {
	    remainder = {tmp1, r_remainder[62:0], 2'b00};
	    quotient  = (quotient<<2) | 1;
	} else {
	    remainder = {r_remainder[126:0], 2'b00};
	    quotient  = (quotient<<2) | 0;
	}
        count--;
    }

That takes 3 subtraction units rather than 1 (plus an adder to make the 3*divisor).

You can do 3 or 4 or more bits per clock - but it takes 2**N-1 subtraction units and a bunch of adders 2**(N-1)-1 which gets really expensive really fast - somewhere in the 2 (3 sub/1 add), 3 (7 sub/3 add), or 4 (15 sub/7 add) bits/clock range is probably reasonable

For floating point we don’t need a 64-bit remainder, we do however need 3 guard bits - we also only have at most 51 bits of mantissa so our 64-bit FP divider is actually smaller and a little faster than the equivalent integer one. On the other hand for integer division we can speed up division of small numbers by skipping initial 0s while floating point numbers all have a MSB of 1 which makes that optimization largely pointless.

The rest of the work in FP divide is dealing with the exponents (subtracting the divisor from the dividend) and handling denorms and nans/divide by 0s, as well as the underflow/overflow logic which is similar to what we use elsewhere.

Square Root

First I’d like to thank Michael Field and Bruce Hoult for this Twitter thread. Looking at the algorithm it seemed obvious that this looked a lot like our divider …. so the question is “can we cheaply leverage the FP divide logic to do sqrt”? (the answer is “yes”)

First here’s where we started (54 is from 51 bits of precision plus 3 guard bits), recoded from the twitter thread:

    long long sq(long long v)
    {
        long long n = 0;
        long long bit = 1ul<<(54/2);
        long long a = 0;
        long long b = 1ul<<54;
        while (bit != 0) {
            long long t = v - (a|b); 
            a = a>>1;
            if (t >= 0) {
                v = t;
                a |= b;
                n |= bit;
            }
            b = b>>2;
            bit = bit >> 1;
        }
        return n;
    }

Let’s start by noticing that ‘v’ behaves a lot like the remainder, let’s also change things to count like the divider does (with a count variable) and get rid of the ‘bit’ shift register - ‘n’ is now the quotient and we’ll start calculating it the same way we did for division, finally ‘a’ is similar to the divisor:

    long long sq(long long remainder)
    {
        long long quotient = 0;
        long long a = 0;
        long long b = 1ul<<54;
        int count = 54/2;
        while (count >= 0) {
            long long tmp = remainder - (a|b); 
            if (tmp >= 0) {
                remainder = tmp;
                divisor = (divisor>>1)|b;
                quotient = (quotient<<1)|1;
            } else {
                divisor = (divisor>>1);
                quotient = (quotient<<1)|0;
            }
            b = b>>2;
            count--;
        }
        return quotient;
    }

There’s one problem here - it calculates integer square roots - and FP mantissas always start with a 1 in the MSB - this algorithm generates half the number of bits you put in, we want to generate a full precision result - we could run this algorithm for twice as many clocks - but b would run out of bits, we’d need to make it 108 bits wide as we would also need to do the same for remainder and tmp. It would be a lot bigger and slower.

Luckily there’s a simple hack we can use to run the algorithm in place - simply shift everything (a, b, remainder) left one at the end of every round. Since they’re mostly already being shifted right this also simplifies a whole bunch of stuff.

    long long sq(long long remainder)
    {
        long long quotient = 0;
        
        long long divisor = 0;
        long long b = 1ul<<54;
        int count = 54;
        while (count >= 0) {
            long long tmp = remainder - (a|b); 
            if (tmp >= 0) {
                remainder = tmp<<1;
                divisor = divisor|(b<<1);
                quotient = (quotient<<1)|1;
            } else {
                remainder = remainder<<1;
                divisor = divisor;
                quotient = (quotient<<1)|0;
            }
            b = b>>1;
            count--;
        }
        return quotient;
    }

What’s interesting here is that the divisor doesn’t shift anymore, it just accumulates ‘b’s in place, the remainder does all it’s math in place too, so we can deal with 54 bits (plus a couple of extras) rather than 108. ‘b’ is also interesting - now it’s exactly equal to 1<<count and all it’s flops could be replaced by a demux.

This now looks very much like our division code, the only real differences are the second argument to the subtraction that creates tmp and that the divisor accumulates.

Reality is a little more complicated - mostly because of the nature of square roots we have to deal with mantissas between 0.5 and 2 rather than the usual cases of 1 up to 2 (that adds an extra bit of precision). This is because when we deal with the exponent we calculate it by halving it but if the LSB is non-zero then that means we have to do a 1 bit shift of the incoming mantissa before we the start the square root operation.

We’ve currently only implemented the 1-bit/clock for sqrt - we could do 2/3/4 bits/clock/etc using the same number of subtraction units as divide (in fact the same subtraction units with a mux on one input). Generating the terms for this is a little more complex, but not a lot, the main difference is that the divisor changes on every clock, for division you can calculate the divisors once at the start.

Still To Do

Next up FP exceptions, and running the full suite of verification tests.

Next time: Final FPU stuff.

VRoom! blog - Floating Point 2

Introduction

Work has slowed, we’ve not been getting a lot of time to work on FP …. this is an update as we move from one part of the FP unit to the next.

New Stuff

Most of the work in the past couple of months has been a rewrite of the multiplier to add fused multiply-add to the FP ALU - in the end this involved pulling the whole thing apart and a major rewrite, this is now done.

At the same time we added 16-bit floating point all the way through the FP unit (and decode/load/store/etc).

Finally in order to test fused mul-add and FP16 we upgraded our stand-alone FP multiplier/adder test infrastructure and found far better sources for test vectors - and in the process found lots of wonderfully annoying corner cases - which is where most of the time has been spent over the past few months. Our FP test vectors are now many gigabytes in size and too big to fit in git, they’re easier to just regenerate - they all ran clean for the first time last night which is why we’re checking everything in today - when we’re done we’ll run something 100 times larger.

Still To Do

We know how handle exceptions, the way RISC-V manages them works very well with our commit model, but there’s still a bunch of work to do there.

We also have a small number of instructions still to implement - basically divide and sqrt, they will be next.

Finally we’re using our own testing for the moment, once we’re done with the above we’ll pull in the more public verification tests and run them before we declare we’re finished with FP.

Next time: more FPU stuff - divide/sqrt

VRoom! blog - Bugs

Bugs

A quick note, still working on FP, but spent some time fixing bugs - thanks to Hirosh who’s been playing with VRoom! I went back and ran some old code fragments and discovered they locked up, that led to rerunning our standard set of regressikons we used before switching to running linux on AWS - turns out a lot of code to do with fence instructions and load conditional instructions were broken by the memory subsystem overhaul.

These are fixed now, though we probably need a bunch of even more subtle fence tests. Running all the regression tests on every change is now a lot easier.

Next time: more FPU stuff