gcc miscompiles csum_tcpudp_magic() on ARMv5
Maxime Bizon
mbizon at freebox.fr
Thu Dec 12 10:26:23 EST 2013
On Thu, 2013-12-12 at 15:00 +0000, Russell King - ARM Linux wrote:
> This is different - the compiler has recognised in both cases that the
> addition od 42 to foo is useless as the result is not used, and
> therefore has optimised the addition away. In the second case, it has
> realised that the narrowing cast used to then add 42 to is also not
> used, and it has also optimised that away.
I was just trying to show that in the inline case the generated code was
not the same (apparently there was a misunderstanding on what I was
arguing about).
Now if you want a more interesting/on topic example:
#include <stdint.h>
static __attribute__((noinline)) unsigned int set_bit_2(unsigned int foo)
{
foo |= 0x4;
return foo;
}
#define local_swab16(x) ((uint16_t)( \
(((uint16_t)(x) & (uint16_t)0x00ffU) << 8) | \
(((uint16_t)(x) & (uint16_t)0xff00U) >> 8)))
int func(uint16_t a)
{
a = set_bit_2(local_swab16(a));
if ((a & 0xffff) == 0x1234)
return 1;
return 0;
}
00000000 <set_bit_2>:
0: e3800004 orr r0, r0, #4
4: e12fff1e bx lr
00000008 <func>:
8: e92d4008 push {r3, lr}
c: e1a03420 lsr r3, r0, #8
10: e1830400 orr r0, r3, r0, lsl #8
14: e1a00800 lsl r0, r0, #16
18: e1a00820 lsr r0, r0, #16
1c: ebfffff7 bl 0 <set_bit_2>
20: e3a03c12 mov r3, #4608 ; 0x1200
24: e1a00800 lsl r0, r0, #16
28: e2833034 add r3, r3, #52 ; 0x34
2c: e1530820 cmp r3, r0, lsr #16
30: 03a00001 moveq r0, #1
34: 13a00000 movne r0, #0
38: e8bd8008 pop {r3, pc}
you can see here the full swab16 code, with the narrowing before calling
set_bit_2() per calling convention.
Now remove the noninline:
00000000 <func>:
0: e3a03c12 mov r3, #4608 ; 0x1200
4: e1a02420 lsr r2, r0, #8
8: e1820400 orr r0, r2, r0, lsl #8
c: e3802004 orr r2, r0, #4
10: e1a02802 lsl r2, r2, #16
14: e2833034 add r3, r3, #52 ; 0x34
18: e1530822 cmp r3, r2, lsr #16
1c: 03a00001 moveq r0, #1
20: 13a00000 movne r0, #0
24: e12fff1e bx lr
and you see that the double shift is removed because gcc knows it is not
needed by the remaining code, which uses only the lower 16 bits.
This is IMO exactly what happen in the csum case, the inline assembly
rule (the one we are not aware of) must be that you cannot use the 16
higher bits of the register in the assembly code, and so gcc takes the
liberty not to zero extend the register.
--
Maxime
More information about the linux-arm-kernel
mailing list