[PATCH v2 5/8] dma-mapping: Support batch mode for dma_direct_sync_sg_for_*
Barry Song
21cnbao at gmail.com
Tue Jan 6 11:47:36 PST 2026
On Wed, Jan 7, 2026 at 8:12 AM Robin Murphy <robin.murphy at arm.com> wrote:
>
> On 2026-01-06 6:41 pm, Barry Song wrote:
> > On Mon, Dec 29, 2025 at 3:50 AM Leon Romanovsky <leon at kernel.org> wrote:
> >>
> >> On Sun, Dec 28, 2025 at 09:52:05AM +1300, Barry Song wrote:
> >>> On Sun, Dec 28, 2025 at 9:09 AM Leon Romanovsky <leon at kernel.org> wrote:
> >>>>
> >>>> On Sat, Dec 27, 2025 at 11:52:45AM +1300, Barry Song wrote:
> >>>>> From: Barry Song <baohua at kernel.org>
> >>>>>
> >>>>> Instead of performing a flush per SG entry, issue all cache
> >>>>> operations first and then flush once. This ultimately benefits
> >>>>> __dma_sync_sg_for_cpu() and __dma_sync_sg_for_device().
> >>>>>
> >>>>> Cc: Leon Romanovsky <leon at kernel.org>
> >>>>> Cc: Catalin Marinas <catalin.marinas at arm.com>
> >>>>> Cc: Will Deacon <will at kernel.org>
> >>>>> Cc: Marek Szyprowski <m.szyprowski at samsung.com>
> >>>>> Cc: Robin Murphy <robin.murphy at arm.com>
> >>>>> Cc: Ada Couprie Diaz <ada.coupriediaz at arm.com>
> >>>>> Cc: Ard Biesheuvel <ardb at kernel.org>
> >>>>> Cc: Marc Zyngier <maz at kernel.org>
> >>>>> Cc: Anshuman Khandual <anshuman.khandual at arm.com>
> >>>>> Cc: Ryan Roberts <ryan.roberts at arm.com>
> >>>>> Cc: Suren Baghdasaryan <surenb at google.com>
> >>>>> Cc: Tangquan Zheng <zhengtangquan at oppo.com>
> >>>>> Signed-off-by: Barry Song <baohua at kernel.org>
> >>>>> ---
> >>>>> kernel/dma/direct.c | 14 +++++++-------
> >>>>> 1 file changed, 7 insertions(+), 7 deletions(-)
> >>>>
> >>>> <...>
> >>>>
> >>>>> - if (!dev_is_dma_coherent(dev)) {
> >>>>> + if (!dev_is_dma_coherent(dev))
> >>>>> arch_sync_dma_for_device(paddr, sg->length,
> >>>>> dir);
> >>>>> - arch_sync_dma_flush();
> >>>>> - }
> >>>>> }
> >>>>> + if (!dev_is_dma_coherent(dev))
> >>>>> + arch_sync_dma_flush();
> >>>>
> >>>> This patch should be squashed into the previous one. You introduced
> >>>> arch_sync_dma_flush() there, and now you are placing it elsewhere.
> >>>
> >>> Hi Leon,
> >>>
> >>> The previous patch replaces all arch_sync_dma_for_* calls with
> >>> arch_sync_dma_for_* plus arch_sync_dma_flush(), without any
> >>> functional change. The subsequent patches then implement the
> >>> actual batching. I feel this is a better approach for reviewing
> >>> each change independently. Otherwise, the previous patch would
> >>> be too large.
> >>
> >> Don't worry about it. Your patches are small enough.
> >
> > My hardware does not require a bounce buffer, but I am concerned that
> > this patch may be incorrect for systems that do require one.
> >
> > Now it is:
> >
> > void dma_direct_sync_sg_for_cpu(struct device *dev,
> > struct scatterlist *sgl, int nents, enum dma_data_direction dir)
> > {
> > struct scatterlist *sg;
> > int i;
> >
> > for_each_sg(sgl, sg, nents, i) {
> > phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
> >
> > if (!dev_is_dma_coherent(dev))
> > arch_sync_dma_for_cpu(paddr, sg->length, dir);
> >
> > swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
> >
> > if (dir == DMA_FROM_DEVICE)
> > arch_dma_mark_clean(paddr, sg->length);
> > }
> >
> > if (!dev_is_dma_coherent(dev)) {
> > arch_sync_dma_flush();
> > arch_sync_dma_for_cpu_all();
> > }
> > }
> >
> > Should we call swiotlb_sync_single_for_cpu() and
> > arch_dma_mark_clean() after the flush to ensure the CPU sees the
> > latest data and that the memcpy is correct? I mean:
>
> Yes, this and the equivalents in the later patches are broken for all
> the sync_for_cpu and unmap paths which may end up bouncing (beware some
> of them get a bit fiddly) - any cache maintenance *must* be completed
> before calling SWIOTLB. As for mark_clean, IIRC that was an IA-64 thing,
> and appears to be entirely dead now.
Thanks, Robin. Personally, I would prefer an approach like the one below—
that is, not optimizing the bounce buffer cases, as they are already slow
due to hardware limitations with memcpy, and optimizing them would make
the code quite messy.
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 550a1a13148d..a4840f7e8722 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -423,8 +423,11 @@ void dma_direct_sync_sg_for_cpu(struct device *dev,
for_each_sg(sgl, sg, nents, i) {
phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
- if (!dev_is_dma_coherent(dev))
+ if (!dev_is_dma_coherent(dev)) {
arch_sync_dma_for_cpu(paddr, sg->length, dir);
+ if (unlikely(dev->dma_io_tlb_mem))
+ arch_sync_dma_flush();
+ }
swiotlb_sync_single_for_cpu(dev, paddr, sg->length, dir);
I’d like to check with you, Leon, and Marek on your views about this.
Thanks
Barry
More information about the linux-arm-kernel
mailing list