[PATCH v8 07/12] iommu/arm-smmu-v3: Add CMDQ_PROD_STOP_FLAG to gate CMDQ submissions
Daniel Mentz
danielmentz at google.com
Mon Jun 8 21:20:42 PDT 2026
On Sun, Jun 7, 2026 at 11:19 PM Pranjal Shrivastava <praan at google.com> wrote:
> > >
> > > static void queue_poll_init(struct arm_smmu_device *smmu,
> > > @@ -718,8 +719,25 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
> > > do {
> > > u64 old;
> > >
> > > + /*
> > > + * If the SMMU is suspended/suspending, any new CMDs are elided.
> > > + * This loop is the Point of Commitment. If we haven't cmpxchg'd
> > > + * our new indices yet, we can safely bail. Once the indices are
> > > + * committed, we MUST write valid commands to those slots to
> > > + * avoid indefinite polling in the drain function.
> > > + */
> > > + if (Q_STOP(llq.prod)) {
> > > + local_irq_restore(flags);
> > > + return 0;
> >
> > On second thought, I no longer believe that this is safe. I understand
> > that READ_ONCE(cmdq->q.llq.val) implies no ordering guarantees with
> > respect to any writes to translation tables. In the non-stopped case,
> > we can rely on the call to dma_wmb() further down in this function.
> > However, for the stopped case, I can't identify any barriers that
> > would ensure that the STOP flag is checked only after the writes are
> > visible to SMMU. Here is an example: Let's assume the following
> > program order:
> >
> > * Write to invalidate PTE
> > * Read from cmdq->q.llq.val, determine STOP flag is set, elide TLBI
> >
> > What prevents the CPU from reordering these operations as follows?
> >
> > * Read from cmdq->q.llq.val, determine STOP flag is set, elide TLBI
> > * Write to invalidate PTE
Now, I believe that the smp_mb() call in arm_smmu_domain_inv_range()
actually prevents this (if we remove arm_smmu_cmdq_can_elide or move
it to after the smp_mb() call)
> >
> > Can the following situation occur?
>
> Not really because:
>
> >
> > * Read from cmdq->q.llq.val, determine STOP flag is set, elide TLBI
> > * (Different CPU resumes SMMU)
>
> We do a full smp_mb() here to ensure all writes are seen before SMMUEN=1
> (note that the STOP_FLAG is already unset at that point, so we stop
> eliding commands much before the SMMU is physically able to access any
> config/xlation structures, I've explained everything below).
>
> > * SMMU loads old PTE value into TLB
> > * Write to invalidate PTE
> > * (stale PTE remains in TLB)
> >
> > I propose the following: If you find the STOP flag set, run dma_mb()
> > and check again. I'm afraid that running dma_mb() unconditionally
> > might incur too much of a performance penalty.
>
> IMHO, I think this might be overcomplicating things here..
> Here's why the current version works according to me:
>
> Till SMMUEN=0, the SMMU is spec-guaranteed to never access translation
> or configuration structures (Section 6.3.9.6).
>
> Our runtime_suspend callback sets SMMUEN=0 before setting the STOP_FLAG.
Does "before" refer to program order or the order in which other
agents can observe the STOP_FLAG relative to when SMMUEN=0 is set?
Your program might read "Set SMMUEN=0 then set STOP_FLAG". However, my
question is whether other observers might see the STOP_FLAG before
SMMUEN is cleared. I couldn't identify any barriers in
arm_smmu_runtime_suspend() that would prevent this type of
re-ordering.
> Even if the worker CPU reorders the PTE write after the STOP_FLAG check,
> it is benign because the SMMU is incapable of fetching that (or any) PTE
> while the gate is closed. Because GATE_CLOSED == SMMUEN = 0, implying no
> access to any HW structures whatsoever.
>
> The real synchronization happens in the Resume Path:
>
> 1. arm_smmu_device_reset() clears all caches / TLBs.
> (None of these can have entries before SMMUEN=1)
>
> 2. We execute a full smp_mb() before setting SMMUEN=1. (that's why we
> need smp_mb before SMMUEN=1). This barrier ensures that any PTE
> writes made by any thread—including those that were elided while the
> gate was closed, are globally visible before the SMMU hardware starts
> fetching into TLBs again. (This is why Jason suggested this in v6 [1])
A barrier on one CPU has no bearing on whether writes by any other CPU
can be observed by any particular agent in the system.
Let's compare this with the long comment in
arm_smmu_domain_inv_range() which is what I believe Jason was
referring to. In that example, you see smp_mb() in the code path on
CPU0 and dma_wmb() in the code path on CPU1. Hence, barriers exist on
both sides. If you compare the runtime PM design with
arm_smmu_domain_inv_range(), then smp_mb() belongs in the CPU thread
that performs the translation table updates not the one that performs
the suspend/resume operation.
Actually, now that I think of it: We could just piggy pack on the
existing smp_mb() in arm_smmu_domain_inv_range(). For your
arm_smmu_cmdq_can_elide() to work correctly, though, you would need to
move it to *after* the smp_mb() call. This being said, I would prefer
to remove arm_smmu_cmdq_can_elide() as I wrote on the other thread.
If we do take advantage of the existing smp_mb() in
arm_smmu_domain_inv_range(), then we should probably update that
comment to state that the runtime PM implementation now also depends
on it.
>
> Adding a dma_mb() to the elision path would be redundant given the
> SMMU can't access any structures and the resume barrier. We'd be
> needlessly injecting dma_mbs when no entity is actually accessing those
> structures while the STOP_FLAG is set.
> >
> > I have the same concerns with arm_smmu_cmdq_can_elide(): That
> > READ_ONCE in arm_smmu_cmdq_can_elide() provides no guarantees in
> > regards to ordering relative to PTE writes.
> >
>
> The same as above, No structures are accessed during SMMUEN=0 (spec) and
> during resume before setting SMMUEN=1, we do a full smp_mb() to ensure
> all concurrent PTE writes are acquired in our resume thread before we
> enable SMMUEN=1
I think the definition of "all concurrent PTE writes" is a bit vague.
I don't think the architecture allows you to wait for writes performed
by other CPUs to be observable by all other agents. The only thing you
can do is to ensure that other agents can observe that the STOP_FLAG
has been cleared *before* SMMUEN is set. However, this is already
achieved by the existing dma_wmb() in arm_smmu_cmdq_issue_cmdlist().
Hence, the extra smp_mb() is not required.
>
> Thanks,
> Praan
>
> [1] https://lore.kernel.org/all/20260424151639.GE3611611@ziepe.ca/
More information about the linux-arm-kernel
mailing list