gcc miscompiles csum_tcpudp_magic() on ARMv5
Willy Tarreau
w at 1wt.eu
Thu Dec 12 12:11:08 EST 2013
On Thu, Dec 12, 2013 at 04:47:48PM +0000, Russell King - ARM Linux wrote:
> > Then changing the type of the function argument would probably be safer!
>
> Actually, I think we can do a bit better with this code. We really don't
> need much of this messing around here, we can combine some of these steps.
>
> We have:
>
> 16-bit protocol in host endian
> 16-bit length in host endian
>
> and we need to combine them into a 32-bit checksum which is then
> subsequently folded down to 16-bits by adding the top and bottom halves.
>
> Now, what we can do is this:
>
> 1. Construct a combined 32-bit protocol and length:
>
> unsigned lenproto = len | proto << 16;
>
> 2. Pass this into the assembly thusly:
>
> __asm__(
> "adds %0, %1, %2 @ csum_tcpudp_nofold \n\t"
> "adcs %0, %0, %3 \n\t"
> #ifdef __ARMEB__
> "adcs %0, %0, %4 \n\t"
> #else
> "adcs %0, %0, %4, ror #8 \n\t"
> #endif
> "adc %0, %0, #0"
> : "=&r"(sum)
> : "r" (sum), "r" (daddr), "r" (saddr), "r" (lenprot)
> : "cc");
>
> with no swabbing at this stage. Well, where do we get the endian
> conversion? See that ror #8 - that a 32 bit rotate by 8 bits. As
> these are two 16-bit quantities, we end up with this:
>
> original:
> 31..24 23..16 15..8 7..0
> len_h len_l pro_h pro_l
>
> accumulated:
> 31..24 23..16 15..8 7..0
> pro_l len_h len_l pro_h
>
> And now when we fold it down to 16-bit:
>
> 15..8 7..0
> len_l pro_h
> pro_l len_h
Amusing, I've used the same optimization yesterday when computing a
TCP pseudo-header checksum.
Another thing that can be done to improve the folding of the 16-bit
checksum is to swap the values to be added, sum them and only keep
the high half integer which already contains the carry. At least on
x86 I save some cycles doing this :
31:24 23:16 15:8 7:0
sum32 = D C B A
To fold this into 16-bit at a time, I just do this :
31:24 23:16 15:8 7:0
sum32 D C B A
+ sum32swapped B A D C
= A+B C+A+carry(B+D/C+A) B+D C+A
so just take the upper result and you get the final 16-bit word at
once.
In C it does :
fold16 = (((sum32 >> 16) | (sum32 << 16)) + sum32) >> 16
When the CPU has a rotate instruction, it's fast :-)
Cheers,
Willy
More information about the linux-arm-kernel
mailing list