[PATCH v4 18/24] iommu/arm-smmu-v3: Introduce master->ats_broken flag

Nicolin Chen nicolinc at nvidia.com
Mon Jun 1 22:44:36 PDT 2026


On Mon, Jun 01, 2026 at 09:15:47PM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 01, 2026 at 01:41:26PM -0700, Nicolin Chen wrote:
> > On Mon, Jun 01, 2026 at 09:32:31AM -0300, Jason Gunthorpe wrote:
> > > On Fri, May 29, 2026 at 06:27:40PM -0700, Nicolin Chen wrote:
> > > > On Tue, May 19, 2026 at 09:06:58AM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, May 18, 2026 at 08:39:01PM -0700, Nicolin Chen wrote:
> > > > So I've tried INV_TYPE_ATS_BROKEN: during per-domain invalidation,
> > > > each batch is built from domain->invs so it can carry the "invs";
> > > > if the batch times out, we can immediately mutate its ATS entries.
> > > > 
> > > > But I realized a limitation. E.g., if a device attaches to two SVA
> > > > domains on two SSIDs. An invalidation timing out on one of the SVA
> > > > domains could mark INV_TYPE_ATS_BROKEN in its own invs, but not in
> > > > the other SVA domain's invs?
> > > 
> > > You'd have to mark all the S1's sharing the STE.
> > 
> > That would be a bit convoluted as we would have to go through all
> > other domains' invs arrays.
> 
> Ok, that is certainly an annoying problem.
> 
> I don't have a better idea than storing the master unfortunately
> 
> But I think the locking for that is going to be tricky, I'm not sure it does
> actually fully work..

Yes, there can be a race that sets STE.EATS back while per-master
flag is set, which would skip the ATC_INV in commit(), so no more
ATC_INV timeout that resets STE.EATS=0. To close it, we can force 
STE.EATS=0 at the end of commit() when state->ats_enabled and the
per-master flag are both set, which is only possible in a race.

However, after more thought, I found that there's a big trade-off
to pay in the invalidation path, if we go on this path. The invs
array only has SID, so it needs to convert it to master using the
rb_tree for every entry (holding streams_lock).

I asked AI to list and compare the three solutions that we have:

(1) Per-invs marker: INV_TYPE_ATS_BROKEN + master_domains
    disable_ats() in the timeout path walks master->master_domains
    and flips matching ATS invs entries to the BROKEN type.

  + invs walker is free (one case label in the existing type switch).
  + No lock or pointer deref in the invs walker.
  + No master pointer stored in invs; no lifetime concern.

  - disable_ats() walks every (master, domain) and marks each invs
    set; the list needs locking usable from atomic.
  - No master-level flag, so atc_inv_master loses the cheap early
    skip and attach has nothing to gate ats_enabled on. Concurrent
    quarantine self-heals only after the next ATC_INV times out (~1s),
    unless we also keep a master flag (i.e. a hybrid of (1) and (2)).

(2) Per-master flag + streams_lock
    invs walker resolves SID -> master via streams_lock and reads
    master->ats_broken.

  + Single source of truth on the master.
  + disable_ats() is one WRITE_ONCE.
  + atc_inv_master early-skips via one READ_ONCE.
  + attach gates ats_enabled on the flag; a concurrent quarantine
    race can be closed by a short post-attach re-check in commit()
  + No master pointer in invs; no lifetime concern.

  - invs walker pays streams_lock + rb_find(SID) per ATS entry on
    every invalidation. Measurable on ATS-heavy workloads.

(3) Per-master flag + inv->master pointer (v4)
    invs entry carries a master pointer; the invs walker reads
    cur->master->ats_broken directly.

  + invs walker is one READ_ONCE through a cached pointer.
  + disable_ats is one WRITE_ONCE.
  + atc_inv_master early-skip via one READ_ONCE.
  + attach gate + post-attach re-check, same as (2).

  - invs holds a master ptr, so release_device must synchronize_rcu()
    before freeing the master to drain walkers under rcu_read_lock().
    We dropped this from v4 for that reason.

Side-by-side:
                (1)      (2)      (3)
  invs walker   fastest  slow     fast
  disable_ats   slow     fast     fast
  atc_inv skip  slow     fast     fast
  attach race   slow     fast     fast
  master free   fast     fast     slow

FWIW, AI picks (3), but this would reintroduce synchronize_rcu()
that you dislike. My personal feeling is that (1) is the cleanest
shirt in the dirty laundry?

I'd like to hear your thought before finalizing the design.

Thanks
Nicolin



More information about the linux-arm-kernel mailing list