VRoom! blog - Floating Point 3 - divide/sqrt

Introduction

This blog entry is about floating point divide and square root and in particular the algorithms we use.

Divide

So let’s talk about how we do division - starting with integer divide - in the multiplier unit we use a pretty simple integer divide algorithm, essentially it’s the the same algorithm we all learned in primary school - a 64-bit unsigned divide goes something like:

    count = 64;
    remainder = {64'b0, dividend};
    quotient = 0;
    while (count > 0) {
	tmp = remainder[127:64] - divisor;
	if (tmp >= 0) {
	    remainder = {tmp, r_remainder[62:0], 1'b0};
	    quotient  = (quotient<<1) | 1;
	} else {
	    remainder = {r_remainder[126:0], 1'b0};
	    quotient  = (quotient<<1) | 0;
	}
        count--;
    }

There’s a small counter, a 64-bit subtraction and a couple of shifters, one 128 bits, one 64 bits (the “tmp >= 0” is just the borrow from the subtraction).

After 64 clock the quotient contains the answer and the high 64 bits of remainder contain the remainder (in case you are really doing a rem operation rather than a div).

64 clocks is a long time - we can speed things up by calculating 2 bits per clock which goes like this:

    count = 64/2;
    remainder = {64'b0, dividend};
    quotient = 0;
    while (count > 0) {
	tmp3 = remainder[127:64] - 3*divisor;
	tmp2 = remainder[127:64] - 2*divisor;
	tmp1 = remainder[127:64] - divisor;
	if (tmp3 >= 0) {
	    remainder = {tmp3, r_remainder[62:0], 2'b00};
	    quotient  = (quotient<<2) | 3;
	} else 
	if (tmp2 >= 0) {
	    remainder = {tmp2, r_remainder[62:0], 2'b00};
	    quotient  = (quotient<<2) | 2;
	} else 
	if (tmp1 >= 0) {
	    remainder = {tmp1, r_remainder[62:0], 2'b00};
	    quotient  = (quotient<<2) | 1;
	} else {
	    remainder = {r_remainder[126:0], 2'b00};
	    quotient  = (quotient<<2) | 0;
	}
        count--;
    }

That takes 3 subtraction units rather than 1 (plus an adder to make the 3*divisor).

You can do 3 or 4 or more bits per clock - but it takes 2**N-1 subtraction units and a bunch of adders 2**(N-1)-1 which gets really expensive really fast - somewhere in the 2 (3 sub/1 add), 3 (7 sub/3 add), or 4 (15 sub/7 add) bits/clock range is probably reasonable

For floating point we don’t need a 64-bit remainder, we do however need 3 guard bits - we also only have at most 51 bits of mantissa so our 64-bit FP divider is actually smaller and a little faster than the equivalent integer one. On the other hand for integer division we can speed up division of small numbers by skipping initial 0s while floating point numbers all have a MSB of 1 which makes that optimization largely pointless.

The rest of the work in FP divide is dealing with the exponents (subtracting the divisor from the dividend) and handling denorms and nans/divide by 0s, as well as the underflow/overflow logic which is similar to what we use elsewhere.

Square Root

First I’d like to thank Michael Field and Bruce Hoult for this Twitter thread. Looking at the algorithm it seemed obvious that this looked a lot like our divider …. so the question is “can we cheaply leverage the FP divide logic to do sqrt”? (the answer is “yes”)

First here’s where we started (54 is from 51 bits of precision plus 3 guard bits), recoded from the twitter thread:

    long long sq(long long v)
    {
        long long n = 0;
        long long bit = 1ul<<(54/2);
        long long a = 0;
        long long b = 1ul<<54;
        while (bit != 0) {
            long long t = v - (a|b); 
            a = a>>1;
            if (t >= 0) {
                v = t;
                a |= b;
                n |= bit;
            }
            b = b>>2;
            bit = bit >> 1;
        }
        return n;
    }

Let’s start by noticing that ‘v’ behaves a lot like the remainder, let’s also change things to count like the divider does (with a count variable) and get rid of the ‘bit’ shift register - ‘n’ is now the quotient and we’ll start calculating it the same way we did for division, finally ‘a’ is similar to the divisor:

    long long sq(long long remainder)
    {
        long long quotient = 0;
        long long a = 0;
        long long b = 1ul<<54;
        int count = 54/2;
        while (count >= 0) {
            long long tmp = remainder - (a|b); 
            if (tmp >= 0) {
                remainder = tmp;
                divisor = (divisor>>1)|b;
                quotient = (quotient<<1)|1;
            } else {
                divisor = (divisor>>1);
                quotient = (quotient<<1)|0;
            }
            b = b>>2;
            count--;
        }
        return quotient;
    }

There’s one problem here - it calculates integer square roots - and FP mantissas always start with a 1 in the MSB - this algorithm generates half the number of bits you put in, we want to generate a full precision result - we could run this algorithm for twice as many clocks - but b would run out of bits, we’d need to make it 108 bits wide as we would also need to do the same for remainder and tmp. It would be a lot bigger and slower.

Luckily there’s a simple hack we can use to run the algorithm in place - simply shift everything (a, b, remainder) left one at the end of every round. Since they’re mostly already being shifted right this also simplifies a whole bunch of stuff.

    long long sq(long long remainder)
    {
        long long quotient = 0;
        
        long long divisor = 0;
        long long b = 1ul<<54;
        int count = 54;
        while (count >= 0) {
            long long tmp = remainder - (a|b); 
            if (tmp >= 0) {
                remainder = tmp<<1;
                divisor = divisor|(b<<1);
                quotient = (quotient<<1)|1;
            } else {
                remainder = remainder<<1;
                divisor = divisor;
                quotient = (quotient<<1)|0;
            }
            b = b>>1;
            count--;
        }
        return quotient;
    }

What’s interesting here is that the divisor doesn’t shift anymore, it just accumulates ‘b’s in place, the remainder does all it’s math in place too, so we can deal with 54 bits (plus a couple of extras) rather than 108. ‘b’ is also interesting - now it’s exactly equal to 1<<count and all it’s flops could be replaced by a demux.

This now looks very much like our division code, the only real differences are the second argument to the subtraction that creates tmp and that the divisor accumulates.

Reality is a little more complicated - mostly because of the nature of square roots we have to deal with mantissas between 0.5 and 2 rather than the usual cases of 1 up to 2 (that adds an extra bit of precision). This is because when we deal with the exponent we calculate it by halving it but if the LSB is non-zero then that means we have to do a 1 bit shift of the incoming mantissa before we the start the square root operation.

We’ve currently only implemented the 1-bit/clock for sqrt - we could do 2/3/4 bits/clock/etc using the same number of subtraction units as divide (in fact the same subtraction units with a mux on one input). Generating the terms for this is a little more complex, but not a lot, the main difference is that the divisor changes on every clock, for division you can calculate the divisors once at the start.

Still To Do

Next up FP exceptions, and running the full suite of verification tests.

Next time: Final FPU stuff.