[PATCH 1/2] arm64/entry: Fix involuntary preemption exception masking

Fri Mar 20 07:57:37 PDT 2026

On Fri, Mar 20, 2026 at 03:11:20PM +0100, Thomas Gleixner wrote:
> On Fri, Mar 20 2026 at 14:04, Peter Zijlstra wrote:
> > On Fri, Mar 20, 2026 at 11:30:25AM +0000, Mark Rutland wrote:
> >> Thomas, Peter, I have a couple of things I'd like to check:
> >> 
> >> (1) The generic irq entry code will preempt from any exception (e.g. a
> >>     synchronous fault) where interrupts were unmasked in the original
> >>     context. Is that intentional/necessary, or was that just the way the
> >>     x86 code happened to be implemented?
> >> 
> >>     I assume that it'd be fine if arm64 only preempted from true
> >>     interrupts, but if that was intentional/necessary I can go rework
> >>     this.
> >
> > So NMI-from-kernel must not trigger resched IIRC. There is some code
> > that relies on this somewhere. And on x86 many of those synchronous
> > exceptions are marked as NMI, since they can happen with IRQs disabled
> > inside locks etc.
> >
> > But for the rest I don't think we care particularly. Notably page-fault
> > will already schedule itself when possible (faults leading to IO and
> > blocking).
> 
> Right. In general we allow preemption on any interrupt, trap and exception
> when:
> 
>   1) the interrupted context had interrupts enabled
> 
>   2) RCU was watching in the original context
> 
> This _is_ intentional as there is no reason to defer preemption in such
> a case. The RT people might get upset if you do so.

Ok. Thanks for confirming!

As above, I'll go see what I can do to address that. I suspect I'll need
something like irqentry_exit_to_kernel_mode_prepare(), analogous to
irqentry_exit_to_user_mode_prepare(), so that the preemption can happen
before the exception masking, but the rest of the exit logic can happen
afterwards.

I know that arm64 currently uses exit_to_user_mode_prepare_legacy(), and
I want to go clean that up too. :)

> NMI like exceptions, which are not allowed to schedule, should therefore
> never go through irqentry_irq_entry() and irqentry_irq_exit().
> 
> irqentry_nmi_enter() and irqentry_nmi_exit() exist for a technical
> reason and are not just of decorative nature. :)

Sorry, I should have been clearer that I was only trying to ask about
cases where irqentry_exit() would preempt. I understand
irqentry_nmi_exit() won't preempt.

Understood and agreed for NMI!

> >> (2) The generic irq entry code only preempts when RCU was watching in
> >>     the original context. IIUC that's just to avoid preempting from the
> >>     idle thread. Is it functionally necessary to avoid that, or is that
> >>     just an optimization?
> >> 
> >>     I'm asking because historically arm64 didn't check that, and I
> >>     haven't bothered checking here. I don't know whether we have a
> >>     latent functional bug.
> >
> > Like I told you on IRC, I *think* this is just an optimization, since if
> > you hit idle, the idle loop will take care of scheduling. But I can't
> > quite remember the details here, and wish we'd have written a sensible
> > comment at that spot.
> 
> There is one, but it's obviously not detailed enough.
> 
> > Other places where RCU isn't watching are userspace and KVM. The first
> > isn't relevant because this is return-to-kernel, and the second I'm not
> > sure about.
> >
> > Thomas, can you remember?
> 
> Yes. It's not an optimization. It's a correctness issue.
> 
> If the interrupted context is RCU idle then you have to carefully go
> back to that context. So that the context can tell RCU it is done with
> the idle state and RCU has to pay attention again. Otherwise all of this
> becomes imbalanced.
> 
> This is about context-level nesting:
> 
>         ...
> L1.A    ct_cpuidle_enter();
> 
>                         -> interrupt
>  L2.A                           ct_irq_enter();
>                                 ...             // Set NEED_RESCHED
>  L2.B                           ct_irq_exit();
>                                
>         ...
> L1.B    ct_cpuidle_exit();
> 
> Scheduling between #L2.B and #L1.B makes RCU rightfully upset. 

I suspect I'm missing something obvious here:

* Regardless of nesting, I see that scheduling between L2.B and L1.B is
  broken because RCU isn't watching.

* I'm not sure whether there's a problem with scheduling between L2.A
  and L2.B, which is what arm64 used to do, and what arm64 would do
  after this patch.

I *think* I just don't understand how context tracking actually works,
so I'll go dig into that and go learn how the struct context_tracking
fields are manipulated by ct_cpuidle_{enter,exit}() and
ct_irq_{enter,exit}().

If there's something else I should go look at, please let me know!

> Think about it this way:
> 
> L1.A    preempt_disable();
> L2.A    local_bh_disable();
>         ..
> L2.B    local_bh_enable();
>         if (need_resched())
>            schedule();
> L1.B    preempt_enable();
> 
> RCU is not any different. For context-level nesting of any kind the only
> valid order is:
> 
>    L1.A -> L2.A -> L2.B -> L1.B
> 
> Pretty obvious if you actually think about it, no?

I guess I'll need to think a bit harder ;)

Thanks for all of this. Even if I'm confused right now, it's very
helpful!

Mark.