[PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints

Jordan Crouse jcrouse at codeaurora.org
Tue Dec 6 16:00:25 PST 2016


On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon at arm.com> wrote:
> > Enabling stalling faults can result in hardware deadlock on poorly
> > designed systems, particularly those with a PCI root complex upstream of
> > the SMMU.
> >
> > Although it's not really Linux's job to save hardware integrators from
> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
> > clients) from hosing the system for everybody else, even if they might
> > already be required to have elevated privileges.
> >
> > Given that the fault handling code currently executes entirely in IRQ
> > context, there is nothing that can sensibly be done to recover from
> > things like page faults anyway, so let's rip this code out for now and
> > avoid the potential for deadlock.
> 
> Hi Will,
> 
> so, I'd like to re-introduce this feature, I *guess* as some sort of
> opt-in quirk (ie. disabled by default unless something in DT tells you
> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
> hw was having problems due to this feature.)
> 
> On newer snapdragon devices we are using arm-smmu for the GPU, and
> halting the GPU so the driver's fault handler can dump some GPU state
> on faults is enormously helpful for debugging and tracking down where
> in the gpu cmdstream the fault was triggered.  In addition, we will
> eventually want the ability to update pagetables from fault handler
> and resuming the faulting transition.
> 
> Some additional comments below..
> 
> > Cc: <stable at vger.kernel.org>
> > Reported-by: Matt Evans <matt.evans at arm.com>
> > Signed-off-by: Will Deacon <will.deacon at arm.com>
> > ---
> >  drivers/iommu/arm-smmu.c | 34 +++++++---------------------------
> >  1 file changed, 7 insertions(+), 27 deletions(-)
> >
> > diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> > index 4f49fe29f202..2db74ebc3240 100644
> > --- a/drivers/iommu/arm-smmu.c
> > +++ b/drivers/iommu/arm-smmu.c
> > @@ -686,8 +686,7 @@ static struct iommu_gather_ops arm_smmu_gather_ops = {
> >
> >  static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> >  {
> > -       int flags, ret;
> > -       u32 fsr, fsynr, resume;
> > +       u32 fsr, fsynr;
> >         unsigned long iova;
> >         struct iommu_domain *domain = dev;
> >         struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> > @@ -701,34 +700,15 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> >         if (!(fsr & FSR_FAULT))
> >                 return IRQ_NONE;
> >
> > -       if (fsr & FSR_IGN)
> > -               dev_err_ratelimited(smmu->dev,
> > -                                   "Unexpected context fault (fsr 0x%x)\n",
> > -                                   fsr);
> > -
> >         fsynr = readl_relaxed(cb_base + ARM_SMMU_CB_FSYNR0);
> > -       flags = fsynr & FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ;
> > -
> >         iova = readq_relaxed(cb_base + ARM_SMMU_CB_FAR);
> > -       if (!report_iommu_fault(domain, smmu->dev, iova, flags)) {
> > -               ret = IRQ_HANDLED;
> > -               resume = RESUME_RETRY;
> > -       } else {
> > -               dev_err_ratelimited(smmu->dev,
> > -                   "Unhandled context fault: iova=0x%08lx, fsynr=0x%x, cb=%d\n",
> > -                   iova, fsynr, cfg->cbndx);
> 
> I would like to decouple this dev_err_ratelimit() print from the
> RESUME_RETRY vs RESUME_TERMINATE behaviour.  I need the ability to
> indicate by return from my fault handler whether to resume or
> terminate.  But I already have my own ratelimted prints and would
> prefer not to spam dmesg twice.
> 
> I'm thinking about report_iommu_fault() returning:
> 
>   0 => RESUME_RETRY
>   -EFAULT => RESUME_TERMINATE but don't print
>   anything else (or specifically -ENOSYS?) => RESUME_TERMINATE and print
> 
> thoughts?
> 
> > -               ret = IRQ_NONE;
> > -               resume = RESUME_TERMINATE;
> > -       }
> > -
> > -       /* Clear the faulting FSR */
> > -       writel(fsr, cb_base + ARM_SMMU_CB_FSR);
> >
> > -       /* Retry or terminate any stalled transactions */
> > -       if (fsr & FSR_SS)
> > -               writel_relaxed(resume, cb_base + ARM_SMMU_CB_RESUME);
> 
> This might be a bug in qcom's implementation of the smmu spec, but
> seems like we don't have SS bit set, yet we still require RESUME reg
> to be written, otherwise gpu is perma-wedged.  Maybe topic for a
> separate quirk?  I'm not sure if writing RESUME reg on other hw when
> SS bit is not set is likely to cause problems?  If not I suppose we
> could just unconditionally write it.
> 
> Anyways, I'm not super-familiar w/ arm-smmu so suggestions welcome..
> in between debugging freedreno I'll try to put together some patches.

>From what I can tell we need SCTLR_CFCFG to make the stall happen otherwise
the operation just gets terminated immediately and *then* we get notification
but by then the system keeps going.

I think SCTLR_HUPCF helps control that behavior (i.e. we don't go off faulting
through eternity) but I don't know how it works.

>From my very unlearned understanding I think we do want to set CFCFG and then
stall and let the interrupt handler decide to retry/terminate.

Jordan

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project



More information about the linux-arm-kernel mailing list