[PATCH v3 3/3] arm64, compiler-context-analysis: Permit alias analysis through __READ_ONCE() with CONFIG_LTO=y

Wed Feb 18 12:18:40 PST 2026

On Wed, 18 Feb 2026 at 11:34, Boqun Feng <boqun at kernel.org> wrote:
>
> > I  just checked a few test-cases, and I don't think anything has changed.
> >
> > All the trivial things where __atomic_load_n(__ATOMIC_RELAXED) *could*
> > do better than just our old code using "cast pointer to volatile"
> > resulted in no better code generation.
> >
>
> By any chance, could you share the test cases? It'll be really helpful
> to represent what semantics we really want to compiler/language folks.

Lol, it was just a one-liner test where I compared a volatile pointer
with a regular pointer and a __atomic_load_n(), I didn't even save it.
It was checking two different bits in the same word.

It was something trivial like "(*ptr & 1) && (*ptr & 2)".

With a "volatile", that obviously has to give two loads, because
that's the required behavior of "the load is visible as an IO".

But what I want from __atomic_load_n() is that it does all the
optimizations that *hardware* is allowed to do on a load.  Loads are
not "visible as an IO", but they have ordering requirements, and they
cannot be torn.

That means that basic CSE is obviously allowed, so loading the pointer
value twice results in the compiler just doing a single load.

But it has a subtler meaning: while CSE is allowed, a load has to be
done at least once. That means that you can combine them, but you
can't hoist them outside loops and mke them go away entirely inside
the loop.

So my basic test-case is literally just check for that trivial case.
And the atomics fail that completely. Which makes them entirely
pointless. They don't generate any better code than "volatile" does.

And if they don't generate better code, then they are worthless
complexity and garbage.

And yes, that "check two different bits" actually comes from real use,
ie it comes from my annoyance with "test_bit()" kind of semantics.

[ Time passes ]

Bah. Based on the above, I just recreated a trivial test-case of what
I think *should* work. Here:

    #ifdef BAD
    #define ACCESS(p) (*(p))
    #elif defined(VOLATILE)
    #define ACCESS(p) (*(volatile unsigned long *)(p))
    #else
    #define ACCESS(p) __atomic_load_n(p,__ATOMIC_RELAXED)
    #endif

    static inline bool test_bit(int n, unsigned long *p)
    {
        return (1ul << n) & ACCESS(p);
    }

    int myfn(unsigned long *a)
    {
        while (!test_bit(3, a))
                /* Busy loop doing nothing*/ ;
        return test_bit(1, a) && test_bit(2,a);
    }

and then just do something like this:

    gcc -O2 -c -DVOLATILE t.c && objdump --disassemble --no-show-raw-insn t.o

to see what gcc generates.

The -DVOLATILE case works:

0000000000000000 <myfn>:
   0: mov    (%rdi),%rax
   3: test   $0x8,%al
   5: je     0 <myfn>
   7: mov    (%rdi),%rdx
   a: xor    %eax,%eax
   c: and    $0x2,%edx
   f: je     1b <myfn+0x1b>
  11: mov    (%rdi),%rax
  14: shr    $0x2,%rax
  18: and    $0x1,%eax
  1b: ret

Look, it loops until bit three is set, then it ands bits 1 and two,
but it has up to three loads to do this all.

With just plain loads (the -DBAD case), you get this:

0000000000000000 <myfn>:
   0: mov    (%rdi),%rax
   3: test   $0x8,%al
   5: je     13 <myfn+0x13>
   7: not    %rax
   a: test   $0x6,%al
   c: sete   %al
   f: movzbl %al,%eax
  12: ret
  13: jmp    13 <myfn+0x13>

which just results in an endless loop if bit three wasn't set, because
the atomic load was hoisted outside the loop. Plus there was no
guarantee that you couldn't have TOCTOU races (which this example
doesn't show). So that BAD case really is unacceptable, because it's
actively buggy.

And the __atomic_load_n() case generates pretty much the same thing as volatile:

0000000000000000 <myfn>:
   0: mov    (%rdi),%rax
   3: test   $0x8,%al
   5: je     0 <myfn>
   7: mov    (%rdi),%rdx
   a: xor    %eax,%eax
   c: and    $0x2,%edx
   f: jne    18 <myfn+0x18>
  11: ret
  12: nopw   0x0(%rax,%rax,1)
  18: mov    (%rdi),%rax
  1b: shr    $0x2,%rax
  1f: and    $0x1,%eax
  22: ret

except it's actually randomly worse for some unfathomable reason.

So tell me: why should anybody ever use that "modern" thing that
generates worse code?

What I *WANT* is something like this:

0000000000000030 <myfn>:
  30: mov    (%rdi),%rax
  33: test   $0x8,%al
  35: je     30 <myfn2>
  37: not    %rax
  3a: test   $0x6,%al
  3c: sete   %al
  3f: movzbl %al,%eax
  42: ret

and I can make the compiler do that by just doing the CSE myself and
not using a test_bit() helper.

But until the atomics do this *trivial* case right, there is
absolutely no way that we'll use them in the kernel.

And no, clang does no better than gcc. It's possible that the C
standards have specified the semantics of atomics badly and compilers
are forced to generate crap - I wouldn't be surprised.

But more likely it's that nobody cares, and compilers generate crap
because the atomics are complicated enough that it's not worth the
pain to do anything better.

But as long as nobody cares about those code generations issues,
"volatile" is simply superior.

It is standard and portable, doesn't rely on modern compilers, and it
generates equivalent or better code.

I feel like I'm not asking for much. I'm literally asking for an
__atomic_load_n() that is better than volatile

Right now it is WORSE.

              Linus