Why dma_alloc_coherent don't return direct mapped vaddr?

Sun Jul 24 19:50:02 PDT 2022

Hi Arnd,
 ---- On Fri, 22 Jul 2022 20:06:35 +0900  Arnd Bergmann <arnd at arndb.de> wrote --- 
 > On Fri, Jul 22, 2022 at 12:31 PM Li Chen <me at linux.beauty> wrote:
 > >  ---- On Fri, 22 Jul 2022 17:06:36 +0800  Arnd Bergmann <arnd at arndb.de> wrote ---
 > >  >
 > >  > I'm not entirely sure, but I suspect that direct I/O on pages that are mapped
 > >  > uncacheable will cause data corruption somewhere as well: The direct i/o
 > >  > code expects normal page cache pages, but these are clearly not.
 > >
 > > direct I/O just bypasses page cache, so I think you want to say "normal pages"?
 > 
 > All normal memory available to user space is in the page cache. 

Just want to make sure that "all normal memory available to user space" come from functions
like malloc? If so, I think they are not in the page cache. malloc will invoke mmap, then:
sys_mmap()
└→ do_mmap_pgoff()
   └→ mmap_region()
      └→ generic_file_mmap() // file mapping, then 
      └→ vma_set_anonymous(vma); // anon vma path

IIUC, mmap coming from malloc set vma to be anonymous, and are not "page cache pages" because they
don't have files as the backing stores.

Please correct me if something I am missing.

 > What you bypass with direct I/O is just the copy into another page cache page.

 > >  > Also, the coherent DMA API is not actually meant for transferring large
 > >  > amounts of data.
 > >
 > >  Agree, that's why I also tried cma API like cma_alloc/dma_alloc_from_contigous
 > > and they also worked fine.
 > 
 > Those two interfaces just return a 'struct page', so if you convert them into
 > a kernel pointer or map them into user space, you get a cacheable mapping.
 > Is that what you do? 

Yes.

 > If so, then your device appears to be cache coherent
 > with the CPU, and you can just mark it as coherent in the devicetree.

Our DSP is not a cache coherent device, there is no CCI to manage cache coherence on all our SoCs, so
all of our peripherals are not cache coherent devices, so it's not a good idea to use cma alloc api to allocate 
cached pages, right?

 > >  > My guess is that what you are doing here is to use the
 > >  > coherent API to map a large buffer uncached and then try to access the
 > >  > uncached data in user space, which is inherently slow. Using direct I/o
 > >  > appears to solve the problem by not actually using the uncached mapping
 > >  > when sending the data to another device, but this is not the right approach.
 > >
 > > My case is to do direct IO from this reserved memory to NVME, and the throughput is good, about
 > > 2.3GB/s, which is almost the same as fio's direct I/O seq write result (Of course, fio uses cached and non-reserved memory).
 > >
 > >  > Do you have an IOMMU, scatter/gather support or similar to back the
 > >  > device?
 > >
 > > No. My misc char device is simply a pseudo device and have no real hardware.
 > > Our dsp will writing raw data to this rmem, but that is another story, we can ignore it here.
 > 
 > It is the DSP that I'm talking about here, this is what makes all the
 > difference.
 > If the DSP is cache coherent and you mark it that way in DT, then everything
 > just becomes fast, and you don't have to use direct I/O. If the DSP is not
 > cache coherent, but you can program it to write into arbitrary memory page
 > cache pages allocated from user space, then you can use the streaming
 > mapping interface that does explicit cache management. This is of course
 > not as fast as coherent hardware, but it also allows accessing the data
 > through the CPU cache later.

I'm afraid buffered IO also cannot meet our thourghput. From FIO results on our NVME, buffered I/O can only
reach around 500MB/s, while direct I/O can reach 2.3GB/s. FIO tells me CPU is nearly 100% when doing buffered I/O, so
I use perf to monitor functions and find copy_from_user and spinlock hog most CPU, our CPU performance is the bottleneck.
I think even if the cache is involved, buffered I/O throughput also cannot get much faster, we have too much
raw data to write and read.

 > >  > I think the only way to safely do what you want to achieve in  way
 > >  > that is both safe and efficient would be to use normal page cache pages
 > >  > allocated from user space, ideally using hugepage mappings, and then
 > >  > mapping those into the device using the streaming DMA API to assign
 > >  > them to the DMA master with get_user_pages_fast()/dma_map_sg()
 > >  > and dma_sync_sg_for_{device,cpu}.
 > >
 > > Thanks for your advice, but unfortunately, dsp can only write to contiguous
 > > physical memory(it doesn't know MMU), and pages allocated from
 > > userspace are not contiguous on physical memory.
 > 
 > Usually what you can do with a DSP is that it can run user-provided
 > software, so if you can pass it a scatter-gather list for the output data
 > in addition to the buffer that it uses for its code and intermediate
 > buffers. If the goal is to store this data in a file, you can even go as far
 > as calling mmap() on the file, and then letting the driver get the page
 > cache pages backing the file mapping, and then relying on the normal
 > file system writeback to store the data to disk.

Our DSP doesn't support scatter-gather lists.
What does "intermediate buffers" mean? 

Regards,
Li