gcc miscompiles csum_tcpudp_magic() on ARMv5

Willy Tarreau w at 1wt.eu
Thu Dec 12 12:11:08 EST 2013


On Thu, Dec 12, 2013 at 04:47:48PM +0000, Russell King - ARM Linux wrote:
> > Then changing the type of the function argument would probably be safer!
> 
> Actually, I think we can do a bit better with this code.  We really don't
> need much of this messing around here, we can combine some of these steps.
> 
> We have:
> 
> 16-bit protocol in host endian
> 16-bit length in host endian
> 
> and we need to combine them into a 32-bit checksum which is then
> subsequently folded down to 16-bits by adding the top and bottom halves.
> 
> Now, what we can do is this:
> 
> 1. Construct a combined 32-bit protocol and length:
> 
> 	unsigned lenproto = len | proto << 16;
> 
> 2. Pass this into the assembly thusly:
> 
>                 __asm__(
>                 "adds   %0, %1, %2      @ csum_tcpudp_nofold    \n\t"
>                 "adcs   %0, %0, %3                              \n\t"
> #ifdef __ARMEB__
>                 "adcs   %0, %0, %4                              \n\t"
> #else
>                 "adcs   %0, %0, %4, ror #8                      \n\t"
> #endif
>                 "adc    %0, %0, #0"
>                 : "=&r"(sum)
>                 : "r" (sum), "r" (daddr), "r" (saddr), "r" (lenprot)
>                 : "cc");
> 
> with no swabbing at this stage.  Well, where do we get the endian
> conversion?  See that ror #8 - that a 32 bit rotate by 8 bits.  As
> these are two 16-bit quantities, we end up with this:
> 
> original:
> 	31..24	23..16	15..8	7..0
> 	len_h	len_l	pro_h	pro_l
> 
> accumulated:
> 	31..24	23..16	15..8	7..0
> 	pro_l	len_h	len_l	pro_h
> 
> And now when we fold it down to 16-bit:
> 
> 			15..8	7..0
> 			len_l	pro_h
> 			pro_l	len_h

Amusing, I've used the same optimization yesterday when computing a
TCP pseudo-header checksum.

Another thing that can be done to improve the folding of the 16-bit
checksum is to swap the values to be added, sum them and only keep
the high half integer which already contains the carry. At least on
x86 I save some cycles doing this :

              31:24  23:16  15:8  7:0
     sum32 =    D      C      B    A

     To fold this into 16-bit at a time, I just do this :

                   31:24     23:16          15:8  7:0
     sum32           D         C              B    A
  +  sum32swapped    B         A              D    C
  =                 A+B  C+A+carry(B+D/C+A)  B+D  C+A

so just take the upper result and you get the final 16-bit word at
once.

In C it does :

       fold16 = (((sum32 >> 16) | (sum32 << 16)) + sum32) >> 16

When the CPU has a rotate instruction, it's fast :-)

Cheers,
Willy




More information about the linux-arm-kernel mailing list