[PATCH v4 11/24] iommu: Add iommu_report_device_broken() to quarantine a broken device

Nicolin Chen nicolinc at nvidia.com
Tue May 19 15:30:45 PDT 2026


On Tue, May 19, 2026 at 04:16:26PM -0300, Jason Gunthorpe wrote:
> On Tue, May 19, 2026 at 11:29:23AM -0700, Nicolin Chen wrote:
> > On Tue, May 19, 2026 at 09:07:37AM -0300, Jason Gunthorpe wrote:
> > > On Mon, May 18, 2026 at 08:38:54PM -0700, Nicolin Chen wrote:
> > Then, the core needs to block the device using the similar routine
> > to the reset prepare(). And that needs to hold group->mutex, so it
> > needs an async worker.
> > 
> > Do you see a much simpler way?
> 
> Put the work on the dev_iommu and forget about rcu.
> 
> But this is all probably better as some later series if at all. The
> driver can block the ATS and the expectation is something will FLR the
> device. The FLR will set the blocking and then restore the
> domain. None of this async work seems functionally necessary, though
> it would be a nice to have. Lets focus on the bare minimum here it, it
> is already a difficult enough problem without tacking on these
> extras..

OK. So you are suggesting a quarantine at the driver-level only:

1. Driver detects ATC_INV timeout during an invalidation.
2. Driver retries the commands to identify the master.
3. Driver calls pci_disable_ats() and clears STE.EATS.
4. Driver marks domain->invs ATS entries as BROKEN.
   (optional since pci_disable_ats() is done?)
5. Driver sets master->ats_broken to fence concurrent attach:
   arm_smmu_write_ste() and arm_smmu_ats_supported().
6. Something external triggers an FLR (sysfs or AER).
7. FLR goes through pci_dev_reset_iommu_prepare()/done(). done()
   reverts 3+4 and calls the reset_device_done callback clearing
   master->ats_broken (5).

Right?

Then, we'll have very limited work in the core for this series.

Nicolin



More information about the linux-arm-kernel mailing list