[PATCH] arm64/dma-mapping: Fix arch_sync_dma_for_device to respect dir parameter

Wed Aug 20 08:35:42 PDT 2025

On Wed, 20 Aug 2025 at 16:16, Catalin Marinas <catalin.marinas at arm.com> wrote:
>
> On Wed, Aug 20, 2025 at 03:43:16PM +0100, John Cox wrote:
> > On Wed, 20 Aug 2025 at 15:08, Robin Murphy <robin.murphy at arm.com> wrote:
> > >
> > > On 20/08/2025 2:25 pm, Catalin Marinas wrote:
> > > > On Wed, Aug 20, 2025 at 11:28:06AM +0100, John Cox via B4 Relay wrote:
> > > >> This patch makes the arch_sync_dma_for_device function on arm64
> > > >> do different things depending on the value of the dir parameter. In
> > > >> particular it does a cache invalidate operation if the dir flag is
> > > >> set to DMA_FROM_DEVICE. The current code does a writeback without
> > > >> invalidate under all circumstances. Nearly all other architectures do
> > > >> an invalidate if the direction is FROM_DEVICE which seems like the
> > > >> correct thing to do to me.
> > > >
> > > > So does arm64 but in the arch_sync_dma_for_cpu(). That's the correct
> > > > place to do it, otherwise after arch_sync_dma_for_device() you may have
> > > > speculative loads by the CPU populating the caches with stale data
> > > > before the device finished writing.
> > >
> > > Exactly, not only is it unnecessary, it's not even guaranteed to have
> > > any lasting effect. arch_sync_dma_for_device() has two jobs to do: 1)
> > > ensure that any new data going in the DMA_TO_DEVICE direction is visible
> > > to the device; a clean is sufficient for that. 2) ensure that no dirty
> > > cachelines may be written back over new DMA_FROM_DEVICE data; a clean is
> > > sufficient for that also. Adding an invalidate at this point serves no
> > > purpose since the CPU is still free to immediately speculatively fetch
> > > the same cleaned data back into the cache.
> >
> > An invalidate at this point for DMA_FROM_DEVICE does satisfy (2) at least as
> > well as clean and has the side benefit that any used cache lines are
> > now free for use by the CPU. (1) is unaffected by this patch.
> >
> > I believe that my patch is no less functional than the current code..
>
> It's no less functional but it does not address the problem, it simply
> makes it less likely. Both Robin and I explained the speculative loads
> potentially leading to inconsistent data if the for_cpu API is not
> called.
>
> So I'd rather have the current behaviour and we can spot problems early
> than hiding it and making it harder to debug.
>
> > > > An attempt to a udmabuf fix (untested):
> > > >
> > > > diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
> > > > index 40399c26e6be..9ab4a6c01143 100644
> > > > --- a/drivers/dma-buf/udmabuf.c
> > > > +++ b/drivers/dma-buf/udmabuf.c
> > > > @@ -256,10 +256,11 @@ static int begin_cpu_udmabuf(struct dma_buf *buf,
> > > >                       ret = PTR_ERR(ubuf->sg);
> > > >                       ubuf->sg = NULL;
> > > >               }
> > > > -     } else {
> > > > -             dma_sync_sgtable_for_cpu(dev, ubuf->sg, direction);
> > > >       }
> > > >
> > > > +     if (ubuf->sg)
> > > > +             dma_sync_sgtable_for_cpu(dev, ubuf->sg, direction);
> > > > +
> > > >       return ret;
> > > >   }
> >
> > Indeed, though that does have the annoyance that you run two sets
> > of cache ops over the entire buffer rather than one and for a decent
> > size of buffer (e.g. video frame) that is not free given that you have
> > to loop over every cache line.
>
> The udmabuf code is slightly dodgy anyway. I expect the DMA buffer to
> have been mapped already and a sync to_device issued long before
> begin_cpu_udmabuf() is called without needing an additional dma_map here
> just to populate the sg list. Otherwise it's quite late to do the sync
> for_device just before the CPU is about to start reading. I'm not
> familiar with this code but at this point we really should only have the
> for_cpu loop.
>
> Maybe the problem is elsewhere, I just spotted a corner case where the
> sync for_cpu is not called.
>
> > I take your points, but this patch should be no less functional and no slower
> > than the current code and does work around existing cases where other
> > drivers haven't got it right.
>
> It does not work around them, it just makes them rarer. Other
> architectures may get away with doing such maintenance in the for_device
> sync only if they don't have aggressive speculative loads into the
> cache. That's not the case for arm64.

Fair enough. I do understand your speculative load point and I do
understand that my patch just makes stuff rarer. I believe that
making such issues rarer is worthwhile when they are already
rare enough that they will often be simply dismissed as general
instability rather than leading to better debugging.

Having said that, I accept your point of view. Thank you for your
time.

John Cox

> --
> Catalin