gcc miscompiles csum_tcpudp_magic() on ARMv5

Thu Dec 12 12:35:37 EST 2013

On Thu, Dec 12, 2013 at 05:20:49PM +0000, Russell King - ARM Linux wrote:
> On Thu, Dec 12, 2013 at 06:11:08PM +0100, Willy Tarreau wrote:
> > Another thing that can be done to improve the folding of the 16-bit
> > checksum is to swap the values to be added, sum them and only keep
> > the high half integer which already contains the carry. At least on
> > x86 I save some cycles doing this :
> > 
> >               31:24  23:16  15:8  7:0
> >      sum32 =    D      C      B    A
> > 
> >      To fold this into 16-bit at a time, I just do this :
> > 
> >                    31:24     23:16          15:8  7:0
> >      sum32           D         C              B    A
> >   +  sum32swapped    B         A              D    C
> >   =                 A+B  C+A+carry(B+D/C+A)  B+D  C+A
> > 
> > so just take the upper result and you get the final 16-bit word at
> > once.
> > 
> > In C it does :
> > 
> >        fold16 = (((sum32 >> 16) | (sum32 << 16)) + sum32) >> 16
> > 
> > When the CPU has a rotate instruction, it's fast :-)
> 
> Indeed - and if your CPU can do the rotate and add at the same time,
> it's just a singe instruction, and it ends up looking remarkably
> similar to this:
> 
> static inline __sum16 csum_fold(__wsum sum)
> {
>         __asm__(
>         "add    %0, %1, %1, ror #16     @ csum_fold"
>         : "=r" (sum)
>         : "r" (sum)
>         : "cc");
>         return (__force __sum16)(~(__force u32)sum >> 16);
> }

Marvelous :-)

Willy