gcc miscompiles csum_tcpudp_magic() on ARMv5
Russell King - ARM Linux
linux at arm.linux.org.uk
Thu Dec 12 12:20:49 EST 2013
On Thu, Dec 12, 2013 at 06:11:08PM +0100, Willy Tarreau wrote:
> Another thing that can be done to improve the folding of the 16-bit
> checksum is to swap the values to be added, sum them and only keep
> the high half integer which already contains the carry. At least on
> x86 I save some cycles doing this :
>
> 31:24 23:16 15:8 7:0
> sum32 = D C B A
>
> To fold this into 16-bit at a time, I just do this :
>
> 31:24 23:16 15:8 7:0
> sum32 D C B A
> + sum32swapped B A D C
> = A+B C+A+carry(B+D/C+A) B+D C+A
>
> so just take the upper result and you get the final 16-bit word at
> once.
>
> In C it does :
>
> fold16 = (((sum32 >> 16) | (sum32 << 16)) + sum32) >> 16
>
> When the CPU has a rotate instruction, it's fast :-)
Indeed - and if your CPU can do the rotate and add at the same time,
it's just a singe instruction, and it ends up looking remarkably
similar to this:
static inline __sum16 csum_fold(__wsum sum)
{
__asm__(
"add %0, %1, %1, ror #16 @ csum_fold"
: "=r" (sum)
: "r" (sum)
: "cc");
return (__force __sum16)(~(__force u32)sum >> 16);
}
More information about the linux-arm-kernel
mailing list