[PATCH v8 07/12] iommu/arm-smmu-v3: Add CMDQ_PROD_STOP_FLAG to gate CMDQ submissions

Pranjal Shrivastava praan at google.com
Sun Jun 7 23:19:35 PDT 2026


On Sun, Jun 07, 2026 at 02:42:37PM -0700, Daniel Mentz wrote:
> On Mon, Jun 1, 2026 at 2:59 PM Pranjal Shrivastava <praan at google.com> wrote:
> >
> > Introduce a new bit flag, CMDQ_PROD_STOP_FLAG (bit 30), in the command
> > queue's producer index to safely gate command submissions during device
> > suspension.
> >
> > The flag embeds the suspend state directly into the existing global state
> > The flag checked in the cmpxchg loop in arm_smmu_cmdq_issue_cmdlist(),
> > which acts as a Point of Commitment, ensuring that no indices are
> > reserved or committed once the SMMU begins suspending.
> >
> > This prevents a situation of "abandoned batches" where indices are
> > incremented but commands are never written, which would otherwise
> > lead to timeout during the drain poll.
> >
> > Update queue_inc_prod_n() to preserve this flag during index
> > calculations, ensuring that any in-flight commands that successfully
> > passed the point of commitment can proceed to completion while the
> > flag remains set.

[...]

> >
> >  static void queue_poll_init(struct arm_smmu_device *smmu,
> > @@ -718,8 +719,25 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
> >         do {
> >                 u64 old;
> >
> > +               /*
> > +                * If the SMMU is suspended/suspending, any new CMDs are elided.
> > +                * This loop is the Point of Commitment. If we haven't cmpxchg'd
> > +                * our new indices yet, we can safely bail. Once the indices are
> > +                * committed, we MUST write valid commands to those slots to
> > +                * avoid indefinite polling in the drain function.
> > +                */
> > +               if (Q_STOP(llq.prod)) {
> > +                       local_irq_restore(flags);
> > +                       return 0;
> 
> On second thought, I no longer believe that this is safe. I understand
> that READ_ONCE(cmdq->q.llq.val) implies no ordering guarantees with
> respect to any writes to translation tables. In the non-stopped case,
> we can rely on the call to dma_wmb() further down in this function.
> However, for the stopped case, I can't identify any barriers that
> would ensure that the STOP flag is checked only after the writes are
> visible to SMMU. Here is an example: Let's assume the following
> program order:
> 
>  * Write to invalidate PTE
>  * Read from cmdq->q.llq.val, determine STOP flag is set, elide TLBI
> 
> What prevents the CPU from reordering these operations as follows?
> 
>  * Read from cmdq->q.llq.val, determine STOP flag is set, elide TLBI
>  * Write to invalidate PTE
> 
> Can the following situation occur?

Not really because:

> 
>  * Read from cmdq->q.llq.val, determine STOP flag is set, elide TLBI
>  * (Different CPU resumes SMMU)

We do a full smp_mb() here to ensure all writes are seen before SMMUEN=1
(note that the STOP_FLAG is already unset at that point, so we stop
eliding commands much before the SMMU is physically able to access any
config/xlation structures, I've explained everything below).

>  * SMMU loads old PTE value into TLB
>  * Write to invalidate PTE
>  * (stale PTE remains in TLB)
> 
> I propose the following: If you find the STOP flag set, run dma_mb()
> and check again. I'm afraid that running dma_mb() unconditionally
> might incur too much of a performance penalty.

IMHO, I think this might be overcomplicating things here.. 
Here's why the current version works according to me:

Till SMMUEN=0, the SMMU is spec-guaranteed to never access translation
or configuration structures (Section 6.3.9.6). 

Our runtime_suspend callback sets SMMUEN=0 before setting the STOP_FLAG.

Even if the worker CPU reorders the PTE write after the STOP_FLAG check,
it is benign because the SMMU is incapable of fetching that (or any) PTE
while the gate is closed. Because GATE_CLOSED == SMMUEN = 0, implying no
access to any HW structures whatsoever.

The real synchronization happens in the Resume Path:

1. arm_smmu_device_reset() clears all caches / TLBs.
   (None of these can have entries before SMMUEN=1)

2. We execute a full smp_mb() before setting SMMUEN=1. (that's why we
   need smp_mb before SMMUEN=1). This barrier ensures that any PTE 
   writes made by any thread—including those that were elided while the
   gate was closed, are globally visible before the SMMU hardware starts 
   fetching into TLBs again. (This is why Jason suggested this in v6 [1])

Adding a dma_mb() to the elision path would be redundant given the
SMMU can't access any structures and the resume barrier. We'd be
needlessly injecting dma_mbs when no entity is actually accessing those
structures while the STOP_FLAG is set. 
> 
> I have the same concerns with arm_smmu_cmdq_can_elide(): That
> READ_ONCE in arm_smmu_cmdq_can_elide() provides no guarantees in
> regards to ordering relative to PTE writes.
> 

The same as above, No structures are accessed during SMMUEN=0 (spec) and
during resume before setting SMMUEN=1, we do a full smp_mb() to ensure
all concurrent PTE writes are acquired in our resume thread before we
enable SMMUEN=1

Thanks,
Praan

[1] https://lore.kernel.org/all/20260424151639.GE3611611@ziepe.ca/



More information about the linux-arm-kernel mailing list