[PATCH v4 11/24] iommu: Add iommu_report_device_broken() to quarantine a broken device
Nicolin Chen
nicolinc at nvidia.com
Wed May 20 11:13:14 PDT 2026
On Wed, May 20, 2026 at 02:51:23PM -0300, Jason Gunthorpe wrote:
> On Wed, May 20, 2026 at 12:20:25AM -0700, Nicolin Chen wrote:
> > > > I see you suggest to treat the entire batch as ATS-broken. Just to
> > > > confirm: without per-SID retry, that might falsely block a healthy
> > > > device in the ATC batch, right? The driver now batches all ATC_INV
> > > > commands via arm_smmu_invs_end_batch().
> > >
> > > Yes, it is not good, but a giant complex series is not reviewable. So
> > > I'd start with trashing all the devices, then come with a narrowing.
> >
> > I can take that path for now and leave a FIXME.
> >
> > Another option is to not batch multiple devices, until we support
> > retry (which shouldn't be hard to add since we've already done the
> > coding)?
>
> That's an interesting idea, it undoes some of the meaningful
> optimization we have recently done though :\
I remember you didn't like it. That's why we had the retry(), which
I feel we should keep it..
> > > We cannot eliminate parallel ATS invalidation. Two threads could be
> > > concurrently processing the invs list. So it has handle it, the driver
> > > is going to have to tolerate a number of redundant error events.
> >
> > OK. That sounds like we still need a flag or locking so that at
> > least pci_disable_ats() would not be called again. I will see
> > what I can do.
>
> I think we can call pci_disable_ats() as many times as we want
That triggers WARN_ON(!dev->ats_enabled) in pci_disable_ats :-(
> we
> mostly need the driver to merge multiple error notifications for the
> same event.
Yes.
> > > But I wasn't thinking we can rely on existing AER events here, yes
> > > probably there will be AERs associated with the device exploding so
> > > badly it cannot do ATS, but also maybe not..
> >
> > So, should I put the AER injection on hold for a future work? To
> > be honest, I am still not very clear how AER injection could help
> > here; or is it for a case where ATC times out while device isn't
> > aware of any AER fault?
>
> Right, if we don't get an AER fault then we should ensure the ATC is
> surfaced, but you have a reasonable point that it isn't so likely the
> get an ATC invalidation timeout without a corresponding related AER..
>
> Still, I'd feel better if it is was definititive and we didn't rely on
> this. This further points that the driver has to merge multiple error
> notifications if it gets some AERs and a new "ATC ERROR" all for the
> same key event.
I feel some race here... Part of the complexity of this v4 is to deal
with concurrent device reset during the async report() between IOMMU
core and driver. Now, we add AER that could compete on the device side
as well...
I will see what I can do here, yet likely would defer it to a followup
series, given the direction is to shrink the size of the series.
Thanks
Nicolin
More information about the linux-arm-kernel
mailing list