Why dma_alloc_coherent don't return direct mapped vaddr?

Mon Jul 25 00:03:30 PDT 2022

On Mon, Jul 25, 2022 at 4:50 AM Li Chen <me at linux.beauty> wrote:
>  ---- On Fri, 22 Jul 2022 20:06:35 +0900  Arnd Bergmann <arnd at arndb.de> wrote ---
>  > On Fri, Jul 22, 2022 at 12:31 PM Li Chen <me at linux.beauty> wrote:
>  > >  ---- On Fri, 22 Jul 2022 17:06:36 +0800  Arnd Bergmann <arnd at arndb.de> wrote ---
>  > >  >
>  > >  > I'm not entirely sure, but I suspect that direct I/O on pages that are mapped
>  > >  > uncacheable will cause data corruption somewhere as well: The direct i/o
>  > >  > code expects normal page cache pages, but these are clearly not.
>  > >
>  > > direct I/O just bypasses page cache, so I think you want to say "normal pages"?
>  >
>  > All normal memory available to user space is in the page cache.
>
> Just want to make sure that "all normal memory available to user space" come from functions
> like malloc? If so, I think they are not in the page cache. malloc will invoke mmap, then:
> sys_mmap()
> └→ do_mmap_pgoff()
>    └→ mmap_region()
>       └→ generic_file_mmap() // file mapping, then
>       └→ vma_set_anonymous(vma); // anon vma path
>
> IIUC, mmap coming from malloc set vma to be anonymous, and are not "page cache pages" because they
> don't have files as the backing stores.
>
> Please correct me if something I am missing.

I think both anonymous user space pages and file backed pages are commonly
considered 'page cache'. Anonymous memory is eventually backed by swap space,
which is similar to but not the same here.

When I wrote 'page cache', I meant both of these, as opposed to memory allocated
by a kernel driver.

>  > What you bypass with direct I/O is just the copy into another page cache page.
>
>  > >  > Also, the coherent DMA API is not actually meant for transferring large
>  > >  > amounts of data.
>  > >
>  > >  Agree, that's why I also tried cma API like cma_alloc/dma_alloc_from_contigous
>  > > and they also worked fine.
>  >
>  > Those two interfaces just return a 'struct page', so if you convert them into
>  > a kernel pointer or map them into user space, you get a cacheable mapping.
>  > Is that what you do?
>
> Yes.
>
>  > If so, then your device appears to be cache coherent
>  > with the CPU, and you can just mark it as coherent in the devicetree.
>
> Our DSP is not a cache coherent device, there is no CCI to manage cache coherence on all our SoCs, so
> all of our peripherals are not cache coherent devices, so it's not a good idea to use cma alloc api to allocate
> cached pages, right?

Using CMA or not is not the problem here, what you have to do for correctness
is to use the same mapping type in every place that maps the pages into
a page table. The two options you have are:

- Using uncached mappings from dma_alloc_coherent() in combination
  with dma_mmap_coherent(). You cannot use direct I/O on these, and
  any access through a pointer is slow.

- Using cached mappings from anywhere, and then flushing the caches
   during ownership transfers with dma_map_sg()/dma_unmap_sg()/
   dma_sync_sg_for_cpu()/dma_sync_sg_for_device().

>  > >  > My guess is that what you are doing here is to use the
>  > >  > coherent API to map a large buffer uncached and then try to access the
>  > >  > uncached data in user space, which is inherently slow. Using direct I/o
>  > >  > appears to solve the problem by not actually using the uncached mapping
>  > >  > when sending the data to another device, but this is not the right approach.
>  > >
>  > > My case is to do direct IO from this reserved memory to NVME, and the throughput is good, about
>  > > 2.3GB/s, which is almost the same as fio's direct I/O seq write result (Of course, fio uses cached and non-reserved memory).
>  > >
>  > >  > Do you have an IOMMU, scatter/gather support or similar to back the
>  > >  > device?
>  > >
>  > > No. My misc char device is simply a pseudo device and have no real hardware.
>  > > Our dsp will writing raw data to this rmem, but that is another story, we can ignore it here.
>  >
>  > It is the DSP that I'm talking about here, this is what makes all the
>  > difference.
>  > If the DSP is cache coherent and you mark it that way in DT, then everything
>  > just becomes fast, and you don't have to use direct I/O. If the DSP is not
>  > cache coherent, but you can program it to write into arbitrary memory page
>  > cache pages allocated from user space, then you can use the streaming
>  > mapping interface that does explicit cache management. This is of course
>  > not as fast as coherent hardware, but it also allows accessing the data
>  > through the CPU cache later.
>
> I'm afraid buffered IO also cannot meet our thourghput. From FIO results on our NVME, buffered I/O can only
> reach around 500MB/s, while direct I/O can reach 2.3GB/s. FIO tells me CPU is nearly 100% when doing buffered I/O, so
> I use perf to monitor functions and find copy_from_user and spinlock hog most CPU, our CPU performance is the bottleneck.
> I think even if the cache is involved, buffered I/O throughput also cannot get much faster, we have too much
> raw data to write and read.

copy_from_user() is particularly slow on uncached data. What throughput do you
get if you mmap() /dev/null and write that to a file?

>  > >  > I think the only way to safely do what you want to achieve in  way
>  > >  > that is both safe and efficient would be to use normal page cache pages
>  > >  > allocated from user space, ideally using hugepage mappings, and then
>  > >  > mapping those into the device using the streaming DMA API to assign
>  > >  > them to the DMA master with get_user_pages_fast()/dma_map_sg()
>  > >  > and dma_sync_sg_for_{device,cpu}.
>  > >
>  > > Thanks for your advice, but unfortunately, dsp can only write to contiguous
>  > > physical memory(it doesn't know MMU), and pages allocated from
>  > > userspace are not contiguous on physical memory.
>  >
>  > Usually what you can do with a DSP is that it can run user-provided
>  > software, so if you can pass it a scatter-gather list for the output data
>  > in addition to the buffer that it uses for its code and intermediate
>  > buffers. If the goal is to store this data in a file, you can even go as far
>  > as calling mmap() on the file, and then letting the driver get the page
>  > cache pages backing the file mapping, and then relying on the normal
>  > file system writeback to store the data to disk.
>
> Our DSP doesn't support scatter-gather lists.
> What does "intermediate buffers" mean?

I meant using a statically allocated memory area at a fixed location for
whatever the DSP does internally, and then copying it to the page cache
page that gets written to disk using the DSP to avoid any extra copies
on the CPU side.

This obviously involves changes to the interface that the DSP program
uses.

        Arnd