Why dma_alloc_coherent don't return direct mapped vaddr?
Li Chen
me at linux.beauty
Mon Jul 25 04:06:02 PDT 2022
---- On Mon, 25 Jul 2022 16:03:30 +0900 Arnd Bergmann <arnd at arndb.de> wrote ---
> On Mon, Jul 25, 2022 at 4:50 AM Li Chen <me at linux.beauty> wrote:
> > ---- On Fri, 22 Jul 2022 20:06:35 +0900 Arnd Bergmann <arnd at arndb.de> wrote ---
> > > On Fri, Jul 22, 2022 at 12:31 PM Li Chen <me at linux.beauty> wrote:
> > > > ---- On Fri, 22 Jul 2022 17:06:36 +0800 Arnd Bergmann <arnd at arndb.de> wrote ---
> > > > >
> > > > > I'm not entirely sure, but I suspect that direct I/O on pages that are mapped
> > > > > uncacheable will cause data corruption somewhere as well: The direct i/o
> > > > > code expects normal page cache pages, but these are clearly not.
> > > >
> > > > direct I/O just bypasses page cache, so I think you want to say "normal pages"?
> > >
> > > All normal memory available to user space is in the page cache.
> >
> > Just want to make sure that "all normal memory available to user space" come from functions
> > like malloc? If so, I think they are not in the page cache. malloc will invoke mmap, then:
> > sys_mmap()
> > └→ do_mmap_pgoff()
> > └→ mmap_region()
> > └→ generic_file_mmap() // file mapping, then
> > └→ vma_set_anonymous(vma); // anon vma path
> >
> > IIUC, mmap coming from malloc set vma to be anonymous, and are not "page cache pages" because they
> > don't have files as the backing stores.
> >
> > Please correct me if something I am missing.
>
> I think both anonymous user space pages and file backed pages are commonly
> considered 'page cache'. Anonymous memory is eventually backed by swap space,
> which is similar to but not the same here.
>
> When I wrote 'page cache', I meant both of these, as opposed to memory allocated
> by a kernel driver.
Gotcha.
> > > What you bypass with direct I/O is just the copy into another page cache page.
> >
> > > > > Also, the coherent DMA API is not actually meant for transferring large
> > > > > amounts of data.
> > > >
> > > > Agree, that's why I also tried cma API like cma_alloc/dma_alloc_from_contigous
> > > > and they also worked fine.
> > >
> > > Those two interfaces just return a 'struct page', so if you convert them into
> > > a kernel pointer or map them into user space, you get a cacheable mapping.
> > > Is that what you do?
> >
> > Yes.
> >
> > > If so, then your device appears to be cache coherent
> > > with the CPU, and you can just mark it as coherent in the devicetree.
> >
> > Our DSP is not a cache coherent device, there is no CCI to manage cache coherence on all our SoCs, so
> > all of our peripherals are not cache coherent devices, so it's not a good idea to use cma alloc api to allocate
> > cached pages, right?
>
> Using CMA or not is not the problem here, what you have to do for correctness
> is to use the same mapping type in every place that maps the pages into
> a page table. The two options you have are:
>
> - Using uncached mappings from dma_alloc_coherent() in combination
> with dma_mmap_coherent(). You cannot use direct I/O on these, and
> any access through a pointer is slow.
Yes, very slow, around 300-500MB/s.
> - Using cached mappings from anywhere, and then flushing the caches
> during ownership transfers with dma_map_sg()/dma_unmap_sg()/
> dma_sync_sg_for_cpu()/dma_sync_sg_for_device().
We just set up phy addr and other configures then send write command to dsp instead of
using kernel dma engine api.
So, our Linux dsp driver doesn't know if dsp uses a dma controller or anything else to transfer
data. Userspace app will query before writing to file, and we will invalidate cache when "query"
if the memory region is cache-able memory.
To learn about how cache can affect buffered I/O throughput, I tried to alloc cached mappings via cma API dma_alloc_contiguous,
then mmap this region to userspace via vm_insert_pages, so they are still cache-able page frames.
But the throughput is still as low as dma_alloc_coherent non-cache memory, both around 300-500MB/s, much
slower than direct I/O throughput.
it seems weird?
> > > > > My guess is that what you are doing here is to use the
> > > > > coherent API to map a large buffer uncached and then try to access the
> > > > > uncached data in user space, which is inherently slow. Using direct I/o
> > > > > appears to solve the problem by not actually using the uncached mapping
> > > > > when sending the data to another device, but this is not the right approach.
> > > >
> > > > My case is to do direct IO from this reserved memory to NVME, and the throughput is good, about
> > > > 2.3GB/s, which is almost the same as fio's direct I/O seq write result (Of course, fio uses cached and non-reserved memory).
> > > >
> > > > > Do you have an IOMMU, scatter/gather support or similar to back the
> > > > > device?
> > > >
> > > > No. My misc char device is simply a pseudo device and have no real hardware.
> > > > Our dsp will writing raw data to this rmem, but that is another story, we can ignore it here.
> > >
> > > It is the DSP that I'm talking about here, this is what makes all the
> > > difference.
> > > If the DSP is cache coherent and you mark it that way in DT, then everything
> > > just becomes fast, and you don't have to use direct I/O. If the DSP is not
> > > cache coherent, but you can program it to write into arbitrary memory page
> > > cache pages allocated from user space, then you can use the streaming
> > > mapping interface that does explicit cache management. This is of course
> > > not as fast as coherent hardware, but it also allows accessing the data
> > > through the CPU cache later.
> >
> > I'm afraid buffered IO also cannot meet our thourghput. From FIO results on our NVME, buffered I/O can only
> > reach around 500MB/s, while direct I/O can reach 2.3GB/s. FIO tells me CPU is nearly 100% when doing buffered I/O, so
> > I use perf to monitor functions and find copy_from_user and spinlock hog most CPU, our CPU performance is the bottleneck.
> > I think even if the cache is involved, buffered I/O throughput also cannot get much faster, we have too much
> > raw data to write and read.
>
> copy_from_user() is particularly slow on uncached data. What throughput do you
> get if you mmap() /dev/null and write that to a file?
mmap /dev/null return No such device, but I do have this device:
# ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Nov 14 17:34 /dev/null
Per https://stackoverflow.com/a/40300651/6949852, /dev/null is not allowed to be mmap-ed.
> > > > > I think the only way to safely do what you want to achieve in way
> > > > > that is both safe and efficient would be to use normal page cache pages
> > > > > allocated from user space, ideally using hugepage mappings, and then
> > > > > mapping those into the device using the streaming DMA API to assign
> > > > > them to the DMA master with get_user_pages_fast()/dma_map_sg()
> > > > > and dma_sync_sg_for_{device,cpu}.
> > > >
> > > > Thanks for your advice, but unfortunately, dsp can only write to contiguous
> > > > physical memory(it doesn't know MMU), and pages allocated from
> > > > userspace are not contiguous on physical memory.
> > >
> > > Usually what you can do with a DSP is that it can run user-provided
> > > software, so if you can pass it a scatter-gather list for the output data
> > > in addition to the buffer that it uses for its code and intermediate
> > > buffers. If the goal is to store this data in a file, you can even go as far
> > > as calling mmap() on the file, and then letting the driver get the page
> > > cache pages backing the file mapping, and then relying on the normal
> > > file system writeback to store the data to disk.
> >
> > Our DSP doesn't support scatter-gather lists.
> > What does "intermediate buffers" mean?
>
> I meant using a statically allocated memory area at a fixed location for
> whatever the DSP does internally, and then copying it to the page cache
> page that gets written to disk using the DSP to avoid any extra copies
> on the CPU side.
>
> This obviously involves changes to the interface that the DSP program
> uses.
Looks promising, but our DSP doesn't support sg list:-(
So it seems it is impossible for our case. Do you know is there any open source
DSP driver that using such dma sg-list and/or copy its data to "page cache page"?
Thanks.
Regards,
Li
More information about the linux-arm-kernel
mailing list