[PATCH] mm/arm: pgtable: remove young bit check for pte_valid_user

Fri Apr 10 04:01:41 PDT 2026

On Apr 09, Russell King (Oracle) wrote:
> 
> On Thu, Apr 09, 2026 at 06:17:36PM +0300, Brian Ruley wrote:
> > However, in the case I describe, if VA_B is mapped immediately to pfn_q
> > after it been has unmapped and freed for VA_A, then it's quite possible
> > that the page is still indexed in the cache.
> 
> True.
> 
> > The hypothesis is that if
> > VA_A and VA_B land in the same I-cache set and VA_A old cache entry
> > still exists (tagged with pfn_q), then the CPU can fetch stale
> > instructions because the tag will match. That's one reason why we need
> > to invalidate the cache, but that will be skipped in the path:
> >
> >     migrate_pages
> >      migrate_pages_batch
> >       migrate_folio_move
> >        remove_migration_ptes
> >         remove_migration_pte
> >          set_pte_at
> >           set_ptes
> >            __sync_icache_dcache  (skipped if !young)
> >             set_pte_ext
> 
> In this case, if the old PTE was marked !young, then the new PTE will
> have:
>         pte = pte_mkold(pte);
> 
> on it, which marks it !young. As you say, __sync_icache_dcache() will
> be skipped. While a PTE entry will be set for the kernel, the code in
> set_pte_ext() will *not* establish a hardware PTE entry. For the
> 2-level pte code:
> 
>         tst     r1, #L_PTE_YOUNG        @ <- results in Z being set
>         tstne   r1, #L_PTE_VALID        @ <- not executed
>         eorne   r1, r1, #L_PTE_NONE     @ <- not executed
>         tstne   r1, #L_PTE_NONE         @ <- not executed
>         moveq   r3, #0                  @ <- hardware PTE value
>  ARM(   str     r3, [r0, #2048]! )      @ <- writes hardware PTE
> 
> So, for a !young PTE, the hardware PTE entry is written as zero,
> which means accesses should fault, which will then cause the PTE to
> be marked young.
> 
> For the 3-level case, the L_PTE_YOUNG bit corresponds with the AF bit
> in the PTE, and there aren't split Linux / hardware PTE entries. AF
> being clear should result in a page fault being generated for the
> kernel to handle making the PTE young.
> 
> In both of these cases, set_ptes() will need to be called with the
> updated PTE which will now be marked young, and that will result in
> the I-cache being flushed.

Hi Russell,

Thank you for the clarification, this is very educational for me.
I understand your scepticism, and I can't explain what's going on based
on what you replied. However, I do honestly believe there is a problem
here. I'll share the exact testing details and the instrumentation
we added that convinced us to reach out at the end. One idea we also
had was that could cache aliasing be happening here.

To clarify any potential misunderstanding, we've observed the
following:

- Sporadic SIGILL and SIGSEGV under memory pressure
- Scales with core count, i.e., quad core more likely to reproduce
  than dual core. We haven't observed an issue on single core.
- Coredumps show valid instructions at the faulting PC.
  The CPU executed something different from what's in memory.
  This pointed us to stale I-cache.
- Instrumentation indicates a correlation.
  A per-CPU ring buffer tracking exec page migrations was dumped on
  SIGILL. The faulting PC matched a recently migrated pages.
- We started seeing this after upgrade 6.1->6.12->6.18. We bisected
  two commits which had an impact, but we weren't convinced that
  either was the root cause: 5dfab109d5193e6c224d96cabf90e9cc2c039884
  and 6faea3422e3b4e8de44a55aa3e6e843320da66d2.
- Failed processes include systemd, tar, bash, ...
- Debug options, e.g., page poisoning, seems to hide the bug

> So you're saying that stress-ng doesn't reproduce this bug but
triggers the OOM-killer... confused.

Apologies for the confusion. I meant that with `stress-ng' we created
the memory pressure and OOM might have played a role in exposing the
"bug" as we (at the time) believed that anything that would trigger
memory free/reclaims and page migration was the key. One note I'll add
is that in our test we invoked stress-ng for 2 minutes (--timeout 2m)
and after each we would reboot the device. We had observed that reboots
seemed to have a discernible effect on the occurence in earlier testing
so we kept that in. I'm beginning to doubt if it had an effect now,
and unfortunately it's all anecdotal.

One more thing, even if you don't accept the patch, is this patch
harmful in any way or is it just sub-optimal?

I'll send the instrumentation patch as a follow-up, migh be there's a
flaw in it.

Best regards,
Brian

###TESTING###

1. stress-ng --vm 4 --vm-bytes 2G --vm-method zero-one --verify \
             --timeout 2m
2. reboot
3. repeat

Cleaned up logs of instrumentation:
```
kernel: [  104.610248] SIGILL at b6e6f1c0 pid 896, recent exec migrations:
kernel: [  104.610313]   cpu0: addr=b6e6f000 old_pfn=467d3 new_pfn=577fe pid=34 flushed=0
[...]
kernel: [  456.066661] SIGILL at b6d99f40 pid 455, recent exec migrations:
[...]
kernel: [  456.066963]   cpu0: addr=b6d99000 old_pfn=44270 new_pfn=7c9ea pid=34 flushed=0
[...] 
```