[PATCH v12 10/10] iommu/arm-smmu-v3: Add stall support for platform devices

Fri Feb 26 22:40:01 EST 2021

On 2021/2/27 0:29, Jean-Philippe Brucker wrote:
> Hi Zhou,
> 
> On Fri, Feb 26, 2021 at 05:43:27PM +0800, Zhou Wang wrote:
>> On 2021/2/1 19:14, Jean-Philippe Brucker wrote:
>>> Hi Zhou,
>>>
>>> On Mon, Feb 01, 2021 at 09:18:42AM +0800, Zhou Wang wrote:
>>>>> @@ -1033,8 +1076,7 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_domain *smmu_domain, int ssid,
>>>>>  			FIELD_PREP(CTXDESC_CD_0_ASID, cd->asid) |
>>>>>  			CTXDESC_CD_0_V;
>>>>>  
>>>>> -		/* STALL_MODEL==0b10 && CD.S==0 is ILLEGAL */
>>>>> -		if (smmu->features & ARM_SMMU_FEAT_STALL_FORCE)
>>>>> +		if (smmu_domain->stall_enabled)
>>>>
>>>> Could we add ssid checking here? like: if (smmu_domain->stall_enabled && ssid).
>>>> The reason is if not CD.S will also be set when ssid is 0, which is not needed.
>>>
>>> Some drivers may want to get stall events on SSID 0:
>>> https://lore.kernel.org/kvm/20210125090402.1429-1-lushenming@huawei.com/#t
>>>
>>> Are you seeing an issue with stall events on ssid 0?  Normally there
>>> shouldn't be any fault on this context, but if they happen and no handler
>>> is registered, the SMMU driver will just abort them and report them like a
>>> non-stall event.
>>
>> Hi Jean,
>>
>> I notice that there is problem. In my case, I expect that CD0 is for kernel
>> and other CDs are for user space. Normally there shouldn't be any fault in
>> kernel, however, we have RAS case which is for some reason there may has
>> invalid address access from hardware device.
>>
>> So at least there are two different address access failures: 1. hardware RAS problem;
>> 2. software fault fail(e.g. kill process when doing DMA). Handlings for these
>> two are different: for 1, we should reset hardware device; for 2, stop related
>> DMA is enough.
> 
> Right, and in case 2 there should be no report printed since it can be
> triggered by user, while you probably want to be loud in case 1.
> 
>> Currently if SMMU returns the same signal(by SMMU resume abort), master device
>> driver can not tell these two kinds of cases.
> 
> This part I don't understand. So the SMMU sends a RESUME(abort) command,
> and then the master reports the DMA error to the device driver, which
> cannot differentiate 1 from 2?  (I guess there is no SSID in this report?)
> But how does disabling stall change this?  The invalid DMA access will
> still be aborted by the SMMU.

This is about the hardware design. In D06 board, an invalid DMA access from
accelerator devices will be aborted, and an hardware error signal will be
returned to accelerator devices, which reports it as a RAS error irq.
while for the stall case, error signal triggered by SMMU resume abort is
also reported as same RAS error irq. This is problem in D60 board.

In next generation of hardware, a new irq will be added to report SMMU resume
abort information, it works with related registers in accelerator devices to
get related hardware queue, which need to be stopped.

So if CD0.S is 1, invalid DMA access in kernel will be reported into above
new added irq, which has not enough information to tell RAS errors(there are 10+
hardware RAS errors) from SMMU resume abort.

> 
> Hypothetically, would it work if all stall events that could not be
> handled went to the device driver?  Those reports would contain the SSID
> (or lack thereof), so you could reset the device in case 1 and ignore case
> 2. Though resetting the device in the middle of a stalled transaction

As above, it is hard to tell RAS errors and SMMU resume abort in SMMU resume abort
now :(

> probably comes with its own set of problems.
> 
>> From the basic concept, if a CD is used for kernel, its S bit should not be set.
>> How about we add iommu domain check here too, if DMA domain we do not set S bit for
>> CD0, if unmanaged domain we set S bit for all CDs?
> 
> I think disabling stall for CD0 of a DMA domain makes sense in general,
> even though I don't really understand how that fixes your issue. But

As above, if disabling stall for CD0, an invalid DMA access will be handled
by RAS error irq.

> someone might come up with a good use-case for receiving stall events on

If A DMA access in kernel fails, I think there should be a RAS issue :)
So better to disable CD0 stall for DMA domain.

Best,
Zhou

> DMA mappings, so I'm wondering whether the alternative solution where we
> report unhandled stall events to the device driver would also work for
> you.
> 
> Thanks,
> Jean
> 
> .
>