Why dma_alloc_coherent don't return direct mapped vaddr?

Li Chen me at linux.beauty
Mon Jul 25 23:50:57 PDT 2022


 ---- On Mon, 25 Jul 2022 20:45:10 +0900  Arnd Bergmann <arnd at arndb.de> wrote --- 
 > On Mon, Jul 25, 2022 at 1:06 PM Li Chen <me at linux.beauty> wrote:
 > >  ---- On Mon, 25 Jul 2022 16:03:30 +0900  Arnd Bergmann <arnd at arndb.de> wrote ---
 > >  > - Using cached mappings from anywhere, and then flushing the caches
 > >  >    during ownership transfers with dma_map_sg()/dma_unmap_sg()/
 > >  >    dma_sync_sg_for_cpu()/dma_sync_sg_for_device().
 > >
 > > We just set up phy addr and other configures then send write command to dsp instead of
 > > using kernel dma engine api.
 > > So, our Linux dsp driver doesn't know if dsp uses a dma controller or anything else to transfer
 > > data. Userspace app will query before writing to file, and we will invalidate cache when "query"
 > > if the memory region is cache-able memory.
 > >
 > > To learn about how cache can affect buffered I/O throughput, I tried to alloc cached mappings via cma API dma_alloc_contiguous,
 > > then mmap this region to userspace via vm_insert_pages, so they are still cache-able page frames.
 > > But the throughput is still as low as dma_alloc_coherent non-cache memory, both around 300-500MB/s, much
 > > slower than direct I/O throughput.
 > > it seems weird?
 > 
 > I'm not sure what is actually meant to happen when you have both cacheable
 > and uncached mappings for the same data, this it not all that well defined and
 > it may be that you end up just getting uncached data in the end. You
 > clearly either
 > get a cache miss here, or stale data, and either way is not good.
 
It's not the same data with both cacheable and uncached mappings.
I replaced the kernel image for the two tests (one kernel driver uses dma_alloc_contiguous and the other one uses dma_alloc_coherent),
so, the pages should be either cacheable or non-cacheable, not both.

 > >  > copy_from_user() is particularly slow on uncached data. What throughput do you
 > >  > get if you mmap() /dev/null and write that to a file?
 > >
 > > mmap /dev/null return No such device, but I do have this device:
 > > # ls -l /dev/null
 > > crw-rw-rw-    1 root     root        1,   3 Nov 14 17:34 /dev/null
 > >
 > > Per https://stackoverflow.com/a/40300651/6949852, /dev/null is not allowed to be mmap-ed.
 > 
 > Sorry, I meant /dev/zero (or any other normal memory really).
 
mmap from /dev/zero, and then buffered-I/O write to file still gets very slow throughput. 
Is this zero-page cache-able page?

>From its mmap implementation:

static int mmap_zero(struct file *file, struct vm_area_struct *vma)
{
#ifndef CONFIG_MMU
	return -ENOSYS;
#endif
	if (vma->vm_flags & VM_SHARED)
		return shmem_zero_setup(vma);
	vma_set_anonymous(vma);
	return 0;
}

I tried both MAP_PRIVATE and MAP_SHARED, both still get slow throughput, no
notable change.

 > >  > >  > >  > I think the only way to safely do what you want to achieve in  way
 > >  > >  > >  > that is both safe and efficient would be to use normal page cache pages
 > >  > >  > >  > allocated from user space, ideally using hugepage mappings, and then
 > >  > >  > >  > mapping those into the device using the streaming DMA API to assign
 > >  > >  > >  > them to the DMA master with get_user_pages_fast()/dma_map_sg()
 > >  > >  > >  > and dma_sync_sg_for_{device,cpu}.
 > >  > >  > >
 > >  > >  > > Thanks for your advice, but unfortunately, dsp can only write to contiguous
 > >  > >  > > physical memory(it doesn't know MMU), and pages allocated from
 > >  > >  > > userspace are not contiguous on physical memory.
 > >  > >  >
 > >  > >  > Usually what you can do with a DSP is that it can run user-provided
 > >  > >  > software, so if you can pass it a scatter-gather list for the output data
 > >  > >  > in addition to the buffer that it uses for its code and intermediate
 > >  > >  > buffers. If the goal is to store this data in a file, you can even go as far
 > >  > >  > as calling mmap() on the file, and then letting the driver get the page
 > >  > >  > cache pages backing the file mapping, and then relying on the normal
 > >  > >  > file system writeback to store the data to disk.
 > >  > >
 > >  > > Our DSP doesn't support scatter-gather lists.
 > >  > > What does "intermediate buffers" mean?
 > >  >
 > >  > I meant using a statically allocated memory area at a fixed location for
 > >  > whatever the DSP does internally, and then copying it to the page cache
 > >  > page that gets written to disk using the DSP to avoid any extra copies
 > >  > on the CPU side.
 > >  >
 > >  > This obviously involves changes to the interface that the DSP program
 > >  > uses.
 > >
 > > Looks promising, but our DSP doesn't support sg list:-(
 > > So it seems it is impossible for our case. Do you know is there any open source
 > > DSP driver that using such dma sg-list and/or copy its data to "page cache page"?
 > 
 > I don't recall any other driver that uses the page cache to write into
 > a file-backed
 > mapping, but you can search the kernel sources for drivers that use
 > pin_user_pages() or a related function to get access to the user address space
 > and extract a page number of that to pass into a hardware buffer.

So, IIUC, this solution consists of the following steps:
step 1. alloc normal memory from userspace using functions like malloc.
step 2. find malloc-ed memory, then use pin_user_pages* function to pin this memory then pass this virtual contiguous but non-phy contiguous(so sg is needed) memory to DSP for writing.
step 3. unpin_user_pages* then let filesystem's writeback queue to write anon "page cache" back to files on disk.

Are these steps all right?

If they are, I have some noob questions:
for step 2, how can I find page frames allocated by malloc, hack brk/mmap to track?
for step 2, if sg is not supported(we don't operate on DSP's dma controller directly, but send command and phy addr and etc to it), is there any other way to do it? This is important, otherwise
                   I still have to reserve phy contiguous addr for DSP writing.
for step 3, I haven't seen trival way to do it. Anyway, they are still anon "page cache" and have no files as backing store(Of course, swap is its backing store, but
                  we hope it write back to real files instead of swap), so it's a little tricky: how to set target files as these anon “page cache"'s backing store?

Regards,
Li



More information about the linux-arm-kernel mailing list