[PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints

Mon Sep 18 10:33:52 PDT 2017

On Wed, Sep 13, 2017 at 03:31:20PM -0400, Rob Clark wrote:
> On Fri, Dec 16, 2016 at 6:54 AM, Will Deacon <will.deacon at arm.com> wrote:
> > Hi Rob,
> >
> > On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
> >> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon at arm.com> wrote:
> >> > Enabling stalling faults can result in hardware deadlock on poorly
> >> > designed systems, particularly those with a PCI root complex upstream of
> >> > the SMMU.
> >> >
> >> > Although it's not really Linux's job to save hardware integrators from
> >> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
> >> > clients) from hosing the system for everybody else, even if they might
> >> > already be required to have elevated privileges.
> >> >
> >> > Given that the fault handling code currently executes entirely in IRQ
> >> > context, there is nothing that can sensibly be done to recover from
> >> > things like page faults anyway, so let's rip this code out for now and
> >> > avoid the potential for deadlock.
> >>
> >> so, I'd like to re-introduce this feature, I *guess* as some sort of
> >> opt-in quirk (ie. disabled by default unless something in DT tells you
> >> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
> >> hw was having problems due to this feature.)
> >>
> >> On newer snapdragon devices we are using arm-smmu for the GPU, and
> >> halting the GPU so the driver's fault handler can dump some GPU state
> >> on faults is enormously helpful for debugging and tracking down where
> >> in the gpu cmdstream the fault was triggered.  In addition, we will
> >> eventually want the ability to update pagetables from fault handler
> >> and resuming the faulting transition.
> >
> > I'm not against reintroducing this, but it would certainly need to be
> > opt-in, as you suggest. If we want to try to use stall faults to enable
> > demand paging for DMA, then that means running core mm code to resolve
> > the fault and we can't do that in irq context. Consequently, we have to
> > hand this off to a thread, which means the hardware must allow the SS
> > bit to remain set without immediately reasserting the interrupt line.
> > Furthermore, we can't handle multiple faults on a context-bank, so we'd
> > need to restrict ourselves to one device (i.e. faulting stream) per
> > domain (CB).
> >
> > I think that means we want both specific compatible strings to identify
> > the SS bit behaviour, but also a way to opt-in for the stall model as a
> > separate property to indicate that the SoC integration can support this
> > without e.g. deadlocking.
> >
> 
> How do you feel about a short-term step to keep things in irq context,
> but enable stall mode?  I'm debugging an issue, where it appears that
> the gpu cannot handle a non-stalled fault (triggers some fairly
> bizarre follow-on problems).  So I think even if we don't add a fault
> handler callback, we still want to set the CFCFG bit and
> RESUME_TERMINATE in the fault handler on this hardware.

Hmm, colour me unconvinced. Why does enabling stalls fix this problem?
I'd rather we bite the bullet and implement things properly on top of
a workqueue so that you can build on the same basic infrastructure as
the SVM work that Jean-Philippe is looking at, particular as you also
have a use-case for running fault code in non-atomic context.

Will