Why dma_alloc_coherent don't return direct mapped vaddr?

Fri Jul 22 03:31:07 PDT 2022

 ---- On Fri, 22 Jul 2022 17:06:36 +0800  Arnd Bergmann <arnd at arndb.de> wrote --- 
 > On Fri, Jul 22, 2022 at 10:19 AM Li Chen <me at linux.beauty> wrote:
 > >  ---- On Fri, 22 Jul 2022 14:50:17 +0800  Arnd Bergmann <arnd at arndb.de> wrote ---
 > >  > On Fri, Jul 22, 2022 at 4:57 AM Li Chen <me at linux.beauty> wrote:
 > >  > >  ---- On Thu, 21 Jul 2022 15:06:57 +0800  Arnd Bergmann <arnd at arndb.de> wrote ---
 > >  > >  > in between.
 > >  > >
 > >  > > Thanks for your answer! My device is a misc character device, just like
 > >  > > https://lwn.net/ml/linux-kernel/20220711122459.13773-5-me@linux.beauty/
 > >  > > IIUC, its dma_addr is always the same with phy addr. If I want to alloc from
 > >  > > reserved memory and then mmap to userspace with vm_insert_pages, are
 > >  > > cma_alloc/dma_alloc_contigous/dma_alloc_from_contigous better choices?
 > >  >
 > >  > In the driver, you should only ever use dma_alloc_coherent() for getting
 > >  > a coherent DMA buffer, the other functions are just the implementation
 > >  > details behind that.
 > >  >
 > >  > To map this buffer to user space, your mmap() function should call
 > >  > dma_mmap_coherent(), which in turn does the correct translation
 > >  > from device specific dma_addr_t values into pages and uses the
 > >  > correct caching attributes.
 > >
 > > Yeah, dma_mmap_coherent() is best if I don't care about direct IO.
 > > But if we need **direct I/O**, dma_mmap_cohere cannot be used because it uses
 > > remap_pfn_range internally, which will set vma to be VM_IO and VM_PFNMAP,
 > > so I think I still have to go back to get struct page from rmem and use
 > > vm_insert_pages to insert pages into vma, right?
 > 
 > I'm not entirely sure, but I suspect that direct I/O on pages that are mapped
 > uncacheable will cause data corruption somewhere as well: The direct i/o
 > code expects normal page cache pages, but these are clearly not.

direct I/O just bypasses page cache, so I think you want to say "normal pages"?
At least from my hundreds of attempts on 512M rmem, the data doesn't get corrupted, crc32 of the resulted file
is always correct after direct I/O.

 > Also, the coherent DMA API is not actually meant for transferring large
 > amounts of data. 

 Agree, that's why I also tried cma API like cma_alloc/dma_alloc_from_contigous and they also worked fine.

 > My guess is that what you are doing here is to use the
 > coherent API to map a large buffer uncached and then try to access the
 > uncached data in user space, which is inherently slow. Using direct I/o
 > appears to solve the problem by not actually using the uncached mapping
 > when sending the data to another device, but this is not the right approach.

My case is to do direct IO from this reserved memory to NVME, and the throughput is good, about
2.3GB/s, which is almost the same as fio's direct I/O seq write result (Of course, fio uses cached and non-reserved memory).

 > Do you have an IOMMU, scatter/gather support or similar to back the
 > device? 

No. My misc char device is simply a pseudo device and have no real hardware.
Our dsp will writing raw data to this rmem, but that is another story, we can ignore it here.

 > I think the only way to safely do what you want to achieve in  way
 > that is both safe and efficient would be to use normal page cache pages
 > allocated from user space, ideally using hugepage mappings, and then
 > mapping those into the device using the streaming DMA API to assign
 > them to the DMA master with get_user_pages_fast()/dma_map_sg()
 > and dma_sync_sg_for_{device,cpu}.

Thanks for your advice, but unfortunately, dsp can only write to contiguous physical memory(it doesn't know MMU),
and pages allocated from userspace are not contiguous on physical memory.

Regards,
Li