[PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints

Wed Sep 13 12:31:20 PDT 2017

On Fri, Dec 16, 2016 at 6:54 AM, Will Deacon <will.deacon at arm.com> wrote:
> Hi Rob,
>
> On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
>> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon at arm.com> wrote:
>> > Enabling stalling faults can result in hardware deadlock on poorly
>> > designed systems, particularly those with a PCI root complex upstream of
>> > the SMMU.
>> >
>> > Although it's not really Linux's job to save hardware integrators from
>> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
>> > clients) from hosing the system for everybody else, even if they might
>> > already be required to have elevated privileges.
>> >
>> > Given that the fault handling code currently executes entirely in IRQ
>> > context, there is nothing that can sensibly be done to recover from
>> > things like page faults anyway, so let's rip this code out for now and
>> > avoid the potential for deadlock.
>>
>> so, I'd like to re-introduce this feature, I *guess* as some sort of
>> opt-in quirk (ie. disabled by default unless something in DT tells you
>> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
>> hw was having problems due to this feature.)
>>
>> On newer snapdragon devices we are using arm-smmu for the GPU, and
>> halting the GPU so the driver's fault handler can dump some GPU state
>> on faults is enormously helpful for debugging and tracking down where
>> in the gpu cmdstream the fault was triggered.  In addition, we will
>> eventually want the ability to update pagetables from fault handler
>> and resuming the faulting transition.
>
> I'm not against reintroducing this, but it would certainly need to be
> opt-in, as you suggest. If we want to try to use stall faults to enable
> demand paging for DMA, then that means running core mm code to resolve
> the fault and we can't do that in irq context. Consequently, we have to
> hand this off to a thread, which means the hardware must allow the SS
> bit to remain set without immediately reasserting the interrupt line.
> Furthermore, we can't handle multiple faults on a context-bank, so we'd
> need to restrict ourselves to one device (i.e. faulting stream) per
> domain (CB).
>
> I think that means we want both specific compatible strings to identify
> the SS bit behaviour, but also a way to opt-in for the stall model as a
> separate property to indicate that the SoC integration can support this
> without e.g. deadlocking.
>

How do you feel about a short-term step to keep things in irq context,
but enable stall mode?  I'm debugging an issue, where it appears that
the gpu cannot handle a non-stalled fault (triggers some fairly
bizarre follow-on problems).  So I think even if we don't add a fault
handler callback, we still want to set the CFCFG bit and
RESUME_TERMINATE in the fault handler on this hardware.

BR,
-R