gcc miscompiles csum_tcpudp_magic() on ARMv5
Willy Tarreau
w at 1wt.eu
Thu Dec 12 12:35:37 EST 2013
On Thu, Dec 12, 2013 at 05:20:49PM +0000, Russell King - ARM Linux wrote:
> On Thu, Dec 12, 2013 at 06:11:08PM +0100, Willy Tarreau wrote:
> > Another thing that can be done to improve the folding of the 16-bit
> > checksum is to swap the values to be added, sum them and only keep
> > the high half integer which already contains the carry. At least on
> > x86 I save some cycles doing this :
> >
> > 31:24 23:16 15:8 7:0
> > sum32 = D C B A
> >
> > To fold this into 16-bit at a time, I just do this :
> >
> > 31:24 23:16 15:8 7:0
> > sum32 D C B A
> > + sum32swapped B A D C
> > = A+B C+A+carry(B+D/C+A) B+D C+A
> >
> > so just take the upper result and you get the final 16-bit word at
> > once.
> >
> > In C it does :
> >
> > fold16 = (((sum32 >> 16) | (sum32 << 16)) + sum32) >> 16
> >
> > When the CPU has a rotate instruction, it's fast :-)
>
> Indeed - and if your CPU can do the rotate and add at the same time,
> it's just a singe instruction, and it ends up looking remarkably
> similar to this:
>
> static inline __sum16 csum_fold(__wsum sum)
> {
> __asm__(
> "add %0, %1, %1, ror #16 @ csum_fold"
> : "=r" (sum)
> : "r" (sum)
> : "cc");
> return (__force __sum16)(~(__force u32)sum >> 16);
> }
Marvelous :-)
Willy
More information about the linux-arm-kernel
mailing list