[PATCH v4 11/24] iommu: Add iommu_report_device_broken() to quarantine a broken device
Jason Gunthorpe
jgg at nvidia.com
Tue May 19 17:30:23 PDT 2026
On Tue, May 19, 2026 at 05:21:36PM -0700, Nicolin Chen wrote:
> On Tue, May 19, 2026 at 08:02:04PM -0300, Jason Gunthorpe wrote:
> > > OK. So you are suggesting a quarantine at the driver-level only:
> > >
> > > 1. Driver detects ATC_INV timeout during an invalidation.
> > > 2. Driver retries the commands to identify the master.
> >
> > I might argue to push even this out to a followup series given it is
> > complex and I suspect it becomes much simpler after the batch
> > removal...
>
> I see you suggest to treat the entire batch as ATS-broken. Just to
> confirm: without per-SID retry, that might falsely block a healthy
> device in the ATC batch, right? The driver now batches all ATC_INV
> commands via arm_smmu_invs_end_batch().
Yes, it is not good, but a giant complex series is not reviewable. So
I'd start with trashing all the devices, then come with a narrowing.
> > > 5. Driver sets master->ats_broken to fence concurrent attach:
> > > arm_smmu_write_ste() and arm_smmu_ats_supported().
> >
> > Not sure this is needed, if we race some attach then the attach will
> > re-set EATS, get another timeout and clear EATS. Doesn't seem worth
> > trying to optimize for.
>
> I didn't see that coming. master->ats_enabled && state->ats_enabled
> in the commit() for a concurrent attachment would issue an ATC that
> may timeout again to re-start the step 1.
>
> And since arm_smmu_atc_inv_master() doesn't use domain->invs, it is
> not affected by INV_TYPE_ATS_BROKEN. So, ATC_INV can continue to be
> issued in this case.
>
> Ah, I feel that we are walking in the mine field where every single
> step could be a kaboom. But your insight is clearly a safe pathway.
We cannot eliminate parallel ATS invalidation. Two threads could be
concurrently processing the invs list. So it has handle it, the driver
is going to have to tolerate a number of redundant error events. It's
OK if the unlikely case of parallel attach also generates redundant
error events.
> > We do need to push a pci error event (didn't see that in this series)
> > so the driver can catch it and start the FLR process. I suppose that
> > will still need to bounce through a workqueue, and once you have that
> > it can also set the blocked domain prior to calling out to the driver.
>
> In the specific case that I am trying to tackle with this series, I
> do see AER error prints from the device already but there is no FLR
> process.
It depends on the driver, mlx5 has a FLR RAS flow for instance.
A driver with a device that can blow up ATS should implement the FLR
flow if it wants automatic RAS. It requires driver co-ordination.
But I wasn't thinking we can rely on existing AER events here, yes
probably there will be AERs associated with the device exploding so
badly it cannot do ATS, but also maybe not..
This is also a problem if we shoot healthy devices as the first stage,
there will not be an AER from heathly..
So I guess we need to decide which is better to tackle, the dedicated
event or the single invalidation sequence..
Jason
More information about the linux-arm-kernel
mailing list