[PATCH v5 3/5] iommu/arm-smmu: Fix spurious interrupts with stall-on-fault

Tue May 6 07:53:25 PDT 2025

On Tue, May 06, 2025 at 10:08:05AM -0400, Connor Abbott wrote:
> On Tue, May 6, 2025 at 8:24 AM Will Deacon <will at kernel.org> wrote:
> > On Wed, Mar 19, 2025 at 10:44:02AM -0400, Connor Abbott wrote:
> > > diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > index c7b5d7c093e71050d29a834c8d33125e96b04d81..9927f3431a2eab913750e6079edc6393d1938c98 100644
> > > --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> > > @@ -470,13 +470,52 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> > >       if (!(cfi->fsr & ARM_SMMU_CB_FSR_FAULT))
> > >               return IRQ_NONE;
> > >
> > > +     /*
> > > +      * On some implementations FSR.SS asserts a context fault
> > > +      * interrupt. We do not want this behavior, because resolving the
> > > +      * original context fault typically requires operations that cannot be
> > > +      * performed in IRQ context but leaving the stall unacknowledged will
> > > +      * immediately lead to another spurious interrupt as FSR.SS is still
> > > +      * set. Work around this by disabling interrupts for this context bank.
> > > +      * It's expected that interrupts are re-enabled after resuming the
> > > +      * translation.
> >
> > s/translation/transaction/
> >
> > > +      *
> > > +      * We have to do this before report_iommu_fault() so that we don't
> > > +      * leave interrupts disabled in case the downstream user decides the
> > > +      * fault can be resolved inside its fault handler.
> > > +      *
> > > +      * There is a possible race if there are multiple context banks sharing
> > > +      * the same interrupt and both signal an interrupt in between writing
> > > +      * RESUME and SCTLR. We could disable interrupts here before we
> > > +      * re-enable them in the resume handler, leaving interrupts enabled.
> > > +      * Lock the write to serialize it with the resume handler.
> > > +      */
> >
> > I'm struggling to understand this last part. If the resume handler runs
> > synchronously from report_iommu_fault(), then there's no need for
> > locking because we're in interrupt context. If the resume handler can
> > run asynchronously from report_iommu_fault(), then the locking doesn't
> > help because the code below could clear CFIE right after the resume
> > handler has set it.
> 
> The problem is indeed when the resume handler runs asynchronously.
> Clearing CFIE right after the resume handler has set it is normal and
> expected. The issue is the opposite, i.e. something like:
> 
> - Resume handler writes RESUME and stalls for some reason
> - The interrupt handler runs through and clears CFIE while it's already cleared
> - Resume handler sets CFIE, assuming that the handler hasn't run yet
> but it actually has
> 
> This wouldn't happen with only one context bank, because we wouldn't
> get an interrupt until the resume handler sets CFIE, but with multiple
> context banks and a shared interrupt line we could get a "spurious"
> interrupt due to a fault in an earlier context bank that becomes not
> spurious if the resume handler writes RESUME before the context fault
> handler for this bank reads FSR above.

Ah, gotcha. Thanks for the explanation.

If we moved the RESUME+CFIE into the interrupt handler after the call
to report_iommu_fault(), would it be possible to run the handler as a
threaded irq (see 'context_fault_needs_threaded_irq') and handle the
callback synchronously? In that case, I think we could avoid taking the
lock if we wrote CFIE _before_ RESUME.

Will