[PATCH v8 07/12] iommu/arm-smmu-v3: Add CMDQ_PROD_STOP_FLAG to gate CMDQ submissions

Tue Jun 9 11:20:52 PDT 2026

On Tue, Jun 9, 2026 at 3:05 AM Pranjal Shrivastava <praan at google.com> wrote:
> >
> > > Even if the worker CPU reorders the PTE write after the STOP_FLAG check,
> > > it is benign because the SMMU is incapable of fetching that (or any) PTE
> > > while the gate is closed. Because GATE_CLOSED == SMMUEN = 0, implying no
> > > access to any HW structures whatsoever.
> > >
> > > The real synchronization happens in the Resume Path:
> > >
> > > 1. arm_smmu_device_reset() clears all caches / TLBs.
> > >    (None of these can have entries before SMMUEN=1)
> > >
> > > 2. We execute a full smp_mb() before setting SMMUEN=1. (that's why we
> > >    need smp_mb before SMMUEN=1). This barrier ensures that any PTE
> > >    writes made by any thread—including those that were elided while the
> > >    gate was closed, are globally visible before the SMMU hardware starts
> > >    fetching into TLBs again. (This is why Jason suggested this in v6 [1])
> >
> > A barrier on one CPU has no bearing on whether writes by any other CPU
> > can be observed by any particular agent in the system.
> >
> > Let's compare this with the long comment in
> > arm_smmu_domain_inv_range() which is what I believe Jason was
> > referring to. In that example, you see smp_mb() in the code path on
> > CPU0 and dma_wmb() in the code path on CPU1. Hence, barriers exist on
> > both sides. If you compare the runtime PM design with
> > arm_smmu_domain_inv_range(), then smp_mb() belongs in the CPU thread
> > that performs the translation table updates not the one that performs
> > the suspend/resume operation.
> >
>
> I might be missing something here, so please bear with me. My
> understanding it that's needed because the IOMMU is live & actively
> caching, which is not true for our case.

I think the "invs" design (Per-domain invalidation array) is more
similar than you think! An SMMU being absent from invs is equivalent
to the STOP flag, and the STE pointing to TTB0 is roughly the
equivalent of SMMEN=1 i.e. the IOMMU is not actively caching a
particular translation domain until an STE (or CD) points to it.

> [Assuming we use non-relaxed semantics & ordering for the STOP flag,
> i.e. set STOP_FLAG + barrier & clear STOP_FLAG (implicit dma_wmb())]
>
> In our case, during the resume op, we first clear the STOP_FLAG before
> setting SMMUEN=1 in program order. Thus, any PTE invalidations occurring
> before SMMUEN=1 are executed, i.e. EVEN when the SMMU is guaranteed not
> to access any structures, we've resumed invalidations.

"[...] we first clear the STOP_FLAG before setting SMMUEN=1 in program
order." I think this should be modified to "we first clear the
STOP_FLAG and ensure that the cleared STOP_FLAG is observable by all
other CPUs before setting SMMUEN=1"

Re "Thus, any PTE invalidations occurring before SMMUEN=1 are
executed,": I think that "a PTE invalidation occurring" is not clearly
defined. Also, it's not clear to me what this statement implies. It's
paramount that invalidations are performed when SMMUEN=1. The fact
that we perform invalidations before SMMUEN=1 is more of a side effect
of our methodology.

I would define a set of invariants:

 * If an agent observes the STOP flag, it is guaranteed that SMMUEN=0
(with ABORT set) at the time of observation.
 * Any transition from a set STOP flag to SMMUEN=1 involves an
invalidate-all operation prior to setting SMMUEN=1

Hence, if a CPU observes the STOP flag, it is assured that (a)
transactions are blocked and (b) if the SMMU is ever re-enabled, an
invalidate-all is performed prior to it being enabled.

I would then argue that all operations support these invariants. For
example, we need proper barriers in the iommu_unmap path to ensure
that the STOP flag is only checked *after* the translation table
update is made. Hence, we need a memory barrier.

I look at it this way: Every elided invalidation creates an
"invalidation deficit", and this deficit is tolerable for two reasons:
(a) SMMU blocks all transactions while there is a deficit. (b) An
invalidate-all eliminates any deficit accrued while the STOP flag was
set.

> Let's consider a few examples:
>
> 1. SUSPEND (say CPU0 is suspending)
>
> [CPU0] SMMUEN = 0 ==> SMMU stops accessing HW structures (ABORT NOT set)

I thought we never disable the SMMU unless ABORT is set.

>                       HW structures not accessed means no TLB / CFG
>                       cache accesses as well according to the spec.
>
> [CPU1] ==> PTE update => Invalidate => Succeeds (although SMMUEN = 0)
>
> [CPU0] GBPA.Abort set ==> Txns are blocked
>
> [CPU2] => PTE update => Invalidate => Succeeds [Txns blocked + SMMUEN=0]
>
> [CPU0] ==> SET STOP_FLAG ==> Elision begins
>
> [CPU3] ==> PTE update ==> Invalidation ==> Elided [Txns blocked + SMMUEN=0]
>
> Hence, the races in the suspend sequence are handled correctly.

I'm not sure if this description demonstrates that every possible race
is handled correctly. If I compare this with Nicolin's presentation in
arm_smmu_domain_inv_range, I like that presentation, as it explicitly
mentions loads and barriers. For example, it has an smp_mb() followed
by "// load the updated invs". I think you should make have something
like "smp_mb() ; CHECK STOP_FLAG" in your presentation. Currently, the
STOP_FLAG checking is somehow implicit in "Invalidation".

>
> 2. RESUME (say CPU0 is resuming)
>
> [CPU1] ==> Update PTE ==> Invalidate ==> Elided [Txns blocked + SMMUEN=0]
>
> [CPU0] ==> Clear STOP_FLAGs [Txns still blocked + SMMUEN=0]
>
> [CPU2] ==> Update PTE ==> Invalidate ==> Succeeds [Txns blocked + SMMUEN=0]
>
> [CPU0] ==> Invalidate all TLB ==> Succeeds [Txns still blocked + SMMUEN=0]
> [CPU0] ==> Invalidate all CFG ==> Succeeds [Txns still blocked + SMMUEN=0]
>
> [CPU2] ==> Update PTE ==> Invalidate ==> Succeeds [Txns still blocked + SMMUEN=0]
>
> [CPU0] ==> Set SMMUEN = 1 [SMMU can now access in memory structures]
>            However, the TLBs and CFG caches are clean because everything
>            until this point couldn't have cached anything anyway.

My concern with this diagram is that it appears sequential, suggesting
operations happen in a specific order across CPUs when they, in fact,
occur in parallel. I find these diagrams more useful for describing
failure cases than for proving that every race is handled correctly.

>
> Hence, right after clearing the STOP_FLAG, we're taking in invalidations
> as normal in the resume, much before the real caching can begin.
>
> Thus, by resuming invalidations before SMMUEN=1, we guarantee a
> consistent state before the very first translation is performed.
>
> Apart from this, I guess I'll drop the can_elide check from all
> invalidation paths.
>
> Does that sound fine?

Dropping can_elide sounds fine. However, if you still use this
function, for example in the gerror handler, then you might consider
renaming it.