[patch V3] lib: GCD: add binary GCD algorithm
George Spelvin
linux at horizon.com
Thu Apr 28 12:15:51 PDT 2016
> How does a CPU lack an efficient ffs/ctz anyway? There are all sorts
> of ways to implement it without a native insn, some of which are
> almost or just as fast as the native insn on cpus that have the
> latter. On anything with a fast multiply, the de Bruijn sequence
> approach is near-optimal, and otherwise one of the binary-search type
> approaches (possibly branchless) can be used. If the compiler doesn't
> generate an appropriate one for __builtin_ctz, that's arguably a
> compiler bug.
What's wanted here is something faster than any of those.
Yes, there's a simple constant-time branch-free implementation:
unsigned inline __attribute__((const))
hweight32(uint32_t x)
{
x -= (x >> 1) & 0x55555555;
x = ((x >> 2) & 0x33333333) + (x & 0x33333333);
x += x >> 4;
x &= 0x0f0f0f0f;
x += x >> 8;
x += x >> 16;
return x & 63;
}
unsigned inline __attribute__((const))
__ffs32(uint32_t x)
{
return hweight(~x & (x-1));
}
but if you work it through, that's about 19 instructions; a few more on
platforms without 32-bit immediates. The shift itself makes an even 20,
and there are a lot of sequential dependencies (I count a 17-op chain
including the shift) limiting execution time.
The de Bruijn hack reduces the length but adds a memory access for
the table lookup. (http://supertech.csail.mit.edu/papers/debruijn.pdf)
In the GCD code, the number to normalize is basically random, so the
normalization loop shifts an average of 1 bit. One bit half the time,
a second bit 1/4 of the time, etc.
(The posted code in the FAST_FFS case omits one guaranteed shift at the
end of the loop because the normalization code is constant-time.)
So "fast __ffs" basically means faster than *one* iteration of
"while (!(x & 1)) x >>= 1;".
In this case "fast" means cheaper than *one* unpredictable branch, which
is a very small handful of instructions.
More information about the linux-arm-kernel
mailing list