[PATCH v1 00/17] Provide a new two step DMA mapping API
Christoph Hellwig
hch at lst.de
Mon Nov 4 01:58:31 PST 2024
On Thu, Oct 31, 2024 at 09:17:45PM +0000, Robin Murphy wrote:
> The hilarious amount of work that iommu_dma_map_sg() does is pretty much
> entirely for the benefit of v4l2 and dma-buf importers who *depend* on
> being able to linearise a scatterlist in DMA address space. TBH I doubt
> there are many actual scatter-gather-capable devices with significant
> enough limitations to meaningfully benefit from DMA segment combining these
> days - I've often thought that by now it might be a good idea to turn that
> behaviour off by default and add an attribute for callers to explicitly
> request it.
Even when devices are not limited they often perform significantly better
when IOVA space is not completely fragmented. While the dma_map_sg code
is a bit gross due to the fact that it has to deal with unaligned segments,
the coalescing itself often is a big win.
Note that dma_map_sg also has two other very useful features: batching
of the iotlb flushing, and support for P2P, which to be efficient also
requires batching the lookups.
>> This uniqueness has been a long standing pain point as the scatterlist API
>> is mandatory, but expensive to use.
>
> Huh? When and where has anything ever called it mandatory? Nobody's getting
> sent to DMA jail for open-coding:
You don't get sent to jail. But you do not get batched iotlb sync, you
don't get properly working P2P, and you don't get IOVA coalescing.
>> Several approaches have been explored to expand the DMA API with additional
>> scatterlist-like structures (BIO, rlist), instead split up the DMA API
>> to allow callers to bring their own data structure.
>
> And this line of reasoning is still "2 + 2 = Thursday" - what is to say
> those two notions in any way related? We literally already have one generic
> DMA operation which doesn't operate on struct page, yet needed nothing
> "split up" to be possible.
Yeah, I don't really get the struct page argument. In fact if we look
at the nitty-gritty details of dma_map_page it doesn't really need a
page at all. I've been looking at cleaning some of this up and providing
a dma_map_phys/paddr which would be quite handy in a few places. Note
because we don't have a struct page for the memory, but because converting
to/from it all the time is not very efficient.
>> 2. VFIO PCI live migration code is building a very large "page list"
>> for the device. Instead of allocating a scatter list entry per allocated
>> page it can just allocate an array of 'struct page *', saving a large
>> amount of memory.
>
> VFIO already assumes a coherent device with (realistically) an IOMMU which
> it explicitly manages - why is it even pretending to need a generic DMA
> API?
AFAIK that does isn't really vfio as we know it but the control device
for live migration. But Leon or Jason might fill in more.
The point is that quite a few devices have these page list based APIs
(RDMA where mlx5 comes from, NVMe with PRPs, AHCI, GPUs).
>
>> 3. NVMe PCI demonstrates how a BIO can be converted to a HW scatter
>> list without having to allocate then populate an intermediate SG table.
>
> As above, given that a bio_vec still deals in struct pages, that could
> seemingly already be done by just mapping the pages, so how is it proving
> any benefit of a fragile new interface?
Because we only need to preallocate the tiny constant sized dma_iova_state
as part of the request instead of an additional scatterlist that requires
sizeof(struct page *) + sizeof(dma_addr_t) + 3 * sizeof(unsigned int)
per segment, including a memory allocation per I/O for that.
> My big concern here is that a thin and vaguely-defined wrapper around the
> IOMMU API is itself a step which smells strongly of "abuse and design
> mistake", given that the basic notion of allocating DMA addresses in
> advance clearly cannot generalise. Thus it really demands some considered
> justification beyond "We must do something; This is something; Therefore we
> must do this." to be convincing.
At least for the block code we have a nice little core wrapper that is
very easy to use, and provides a great reduction of memory use and
allocations. The HMM use case I'll let others talk about.
More information about the Linux-nvme
mailing list