[PATCH v4 11/24] iommu: Add iommu_report_device_broken() to quarantine a broken device

Jason Gunthorpe jgg at nvidia.com
Wed May 20 10:51:23 PDT 2026


On Wed, May 20, 2026 at 12:20:25AM -0700, Nicolin Chen wrote:
> > > I see you suggest to treat the entire batch as ATS-broken. Just to
> > > confirm: without per-SID retry, that might falsely block a healthy
> > > device in the ATC batch, right? The driver now batches all ATC_INV
> > > commands via arm_smmu_invs_end_batch().
> > 
> > Yes, it is not good, but a giant complex series is not reviewable. So
> > I'd start with trashing all the devices, then come with a narrowing.
> 
> I can take that path for now and leave a FIXME.
> 
> Another option is to not batch multiple devices, until we support
> retry (which shouldn't be hard to add since we've already done the
> coding)?

That's an interesting idea, it undoes some of the meaningful
optimization we have recently done though :\

> > We cannot eliminate parallel ATS invalidation. Two threads could be
> > concurrently processing the invs list. So it has handle it, the driver
> > is going to have to tolerate a number of redundant error events.
> 
> OK. That sounds like we still need a flag or locking so that at
> least pci_disable_ats() would not be called again. I will see
> what I can do.

I think we can call pci_disable_ats() as many times as we want, we
mostly need the driver to merge multiple error notifications for the
same event.

> > It depends on the driver, mlx5 has a FLR RAS flow for instance.
> 
> I assume a driver like that would trigger FLR flow on its own?

Yes
 
> > A driver with a device that can blow up ATS should implement the FLR
> > flow if it wants automatic RAS. It requires driver co-ordination.
> 
> Or FLR via sysfs, which I have been doing...

Yes
 
> > But I wasn't thinking we can rely on existing AER events here, yes
> > probably there will be AERs associated with the device exploding so
> > badly it cannot do ATS, but also maybe not..
> 
> So, should I put the AER injection on hold for a future work? To
> be honest, I am still not very clear how AER injection could help
> here; or is it for a case where ATC times out while device isn't
> aware of any AER fault?

Right, if we don't get an AER fault then we should ensure the ATC is
surfaced, but you have a reasonable point that it isn't so likely the
get an ATC invalidation timeout without a corresponding related AER..

Still, I'd feel better if it is was definititive and we didn't rely on
this. This further points that the driver has to merge multiple error
notifications if it gets some AERs and a new "ATC ERROR" all for the
same key event.

Jason



More information about the linux-arm-kernel mailing list