Why dma_alloc_coherent don't return direct mapped vaddr?

Fri Jul 22 04:06:35 PDT 2022

On Fri, Jul 22, 2022 at 12:31 PM Li Chen <me at linux.beauty> wrote:
>  ---- On Fri, 22 Jul 2022 17:06:36 +0800  Arnd Bergmann <arnd at arndb.de> wrote ---
>  >
>  > I'm not entirely sure, but I suspect that direct I/O on pages that are mapped
>  > uncacheable will cause data corruption somewhere as well: The direct i/o
>  > code expects normal page cache pages, but these are clearly not.
>
> direct I/O just bypasses page cache, so I think you want to say "normal pages"?

All normal memory available to user space is in the page cache. What you bypass
with direct I/O is just the copy into another page cache page.

>  > Also, the coherent DMA API is not actually meant for transferring large
>  > amounts of data.
>
>  Agree, that's why I also tried cma API like cma_alloc/dma_alloc_from_contigous
> and they also worked fine.

Those two interfaces just return a 'struct page', so if you convert them into
a kernel pointer or map them into user space, you get a cacheable mapping.
Is that what you do? If so, then your device appears to be cache coherent
with the CPU, and you can just mark it as coherent in the devicetree.

>  > My guess is that what you are doing here is to use the
>  > coherent API to map a large buffer uncached and then try to access the
>  > uncached data in user space, which is inherently slow. Using direct I/o
>  > appears to solve the problem by not actually using the uncached mapping
>  > when sending the data to another device, but this is not the right approach.
>
> My case is to do direct IO from this reserved memory to NVME, and the throughput is good, about
> 2.3GB/s, which is almost the same as fio's direct I/O seq write result (Of course, fio uses cached and non-reserved memory).
>
>  > Do you have an IOMMU, scatter/gather support or similar to back the
>  > device?
>
> No. My misc char device is simply a pseudo device and have no real hardware.
> Our dsp will writing raw data to this rmem, but that is another story, we can ignore it here.

It is the DSP that I'm talking about here, this is what makes all the
difference.
If the DSP is cache coherent and you mark it that way in DT, then everything
just becomes fast, and you don't have to use direct I/O. If the DSP is not
cache coherent, but you can program it to write into arbitrary memory page
cache pages allocated from user space, then you can use the streaming
mapping interface that does explicit cache management. This is of course
not as fast as coherent hardware, but it also allows accessing the data
through the CPU cache later.

>  > I think the only way to safely do what you want to achieve in  way
>  > that is both safe and efficient would be to use normal page cache pages
>  > allocated from user space, ideally using hugepage mappings, and then
>  > mapping those into the device using the streaming DMA API to assign
>  > them to the DMA master with get_user_pages_fast()/dma_map_sg()
>  > and dma_sync_sg_for_{device,cpu}.
>
> Thanks for your advice, but unfortunately, dsp can only write to contiguous
> physical memory(it doesn't know MMU), and pages allocated from
> userspace are not contiguous on physical memory.

Usually what you can do with a DSP is that it can run user-provided
software, so if you can pass it a scatter-gather list for the output data
in addition to the buffer that it uses for its code and intermediate
buffers. If the goal is to store this data in a file, you can even go as far
as calling mmap() on the file, and then letting the driver get the page
cache pages backing the file mapping, and then relying on the normal
file system writeback to store the data to disk.

         Arnd