[PATCH v1 2/2] iommu/arm-smmu-v3: Recover ATC invalidate timeouts

Baolu Lu baolu.lu at linux.intel.com
Thu Mar 5 19:22:52 PST 2026


On 3/5/26 23:39, Jason Gunthorpe wrote:
> On Wed, Mar 04, 2026 at 09:21:42PM -0800, Nicolin Chen wrote:
>> +	/*
>> +	 * ATC timeout indicates the device has stopped responding to coherence
>> +	 * protocol requests. The only safe recovery is a reset to flush stale
>> +	 * cached translations. Note that pci_reset_function() internally calls
>> +	 * pci_dev_reset_iommu_prepare/done() as well and ensures to block ATS
>> +	 * if PCI-level reset fails.
>> +	 */
>> +	if (!pci_reset_function(pdev)) {
>> +		/*
>> +		 * If reset succeeds, set BME back. Otherwise, fence the system
>> +		 * from a faulty device, in which case user will have to replug
>> +		 * the device to invoke pci_set_master().
>> +		 */
>> +		pci_dev_lock(pdev);
>> +		pci_set_master(pdev);
>> +		pci_dev_unlock(pdev);
>> +	}
> I thought we talked about this, the iommu driver cannot just blindly
> issue a reset like this, the reset has to come from the actual device
> driver through the AERish mechanism. Otherwise the driver RAS is going
> to explode.
> 
> The smmu driver should immediately block the STE (reject translated
> requests) to protect the system before resuming whatever command
> submissio n has encountered the error.
> 
> You could delegate the STE change to the interrupted command
> submission to avoid doing it from a ISR, that makes alot of sense
> because the submission thread is already operating a cmdq so it could
> stick in a STE invalidation command, possibly even in place of the
> failed ATC command.
> 
> I think I'd break this up into smaller steps, just focus on this STE
> mechanism at start and have any future attach callback fix the STE.
> 
> Then we can talk about how to properly trigger the PCI RAS flow and so
> on.

I believe this issue is not unique to the arm-smmu-v3 driver. Device ATC
invalidation timeout is a generic challenge across all IOMMU
architectures that support PCI ATS. Would it be feasible to implement a
common 'fencing and recovery' mechanism in the IOMMU core so that all
IOMMU drivers could benefit?

Thanks,
baolu



More information about the linux-arm-kernel mailing list