[PATCH v7 00/17] Provide a new two step DMA mapping API
Chuck Lever
chuck.lever at oracle.com
Mon Mar 31 07:46:40 PDT 2025
On 3/21/25 8:41 PM, Jason Gunthorpe wrote:
> On Fri, Mar 21, 2025 at 12:52:30AM +0100, Marek Szyprowski wrote:
>>> Christoph's vision was to make a performance DMA API path that could
>>> be used to implement any scatterlist-like data structure very
>>> efficiently without having to teach the DMA API about all sorts of
>>> scatterlist-like things.
>>
>> Thanks for explaining one more motivation behind this patchset!
>
> Sure, no problem.
>
> To close the loop on the bigger picture here..
>
> When you put the parts together:
>
> 1) dma_map_sg is the only API that is both performant and fully
> functional
>
> 2) scatterlist is a horrible leaky design and badly misued all over
> the place. When Logan added SG_DMA_BUS_ADDRESS it became quite
> clear that any significant changes to scatterlist are infeasible,
> or at least we'd break a huge number of untestable legacy drivers
> in the process.
>
> 3) We really want to do full featured performance DMA *without* a
> struct page. This requires changing scatterlist, inventing a new
> scatterlist v2 and DMA map for it, or this idea here of a flexible
> lower level DMA API entry point.
>
> Matthew has been talking about struct-pageless for a long time now
> from the block/mm direction using folio & memdesc and this is
> meeting his work from the other end of the stack by starting to
> build a way to do DMA on future struct pageless things. This is
> going to be huge multi-year project but small parts like this need
> to be solved and agreed to make progress.
>
> 4) In the immediate moment we still have problems in VFIO, RDMA, and
> DRM managing P2P transfers because dma_map_resource/page() don't
> properly work, and we don't have struct pages to use
> dma_map_sg(). Hacks around the DMA API have been in the kernel for
> a long time now, we want to see a properly architected solution.
The in-kernel NFS stack, for example, already has a mechanism for
receiving and sending RPC messages using arrays of bio_vecs. The stack
can use bio_vecs natively for communicating with both the page cache and
the kernel socket API.
But NFS's RPC/RDMA transport still has to convert these pages into a
scatterlist so that they can be mapped and then handed to the RDMA core.
Instead, having a DMA mapping API that can take an array of bio_vecs
directly (and then, a similar API within the RDMA core) would make
NFS/RDMA a lot more CPU-efficient.
The lack of a bio_vec DMA mapping API has held up a full conversion of
the in-kernel NFS stack to use folios. That's the reason I tried my
own hand at adding a bio_vec DMA mapping API last summer.
Leon and Christoph have provided a clean step in the right direction
and it looks to me like they have thought carefully about next steps.
Robin pointed out some areas that might be lacking in v7, but IMHO
there is a plan to address many of these areas in subsequent work. I
don't see a reason not to proceed with this first step.
--
Chuck Lever
More information about the Linux-nvme
mailing list