[RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Wed Mar 6 06:44:16 PST 2024

On Tue, Mar 05, 2024 at 02:29:35PM +0200, Leon Romanovsky wrote:
> > > These advanced DMA mapping APIs are needed to calculate IOVA size to
> > > allocate it as one chunk and some sort of offset calculations to know
> > > which part of IOVA to map.
> > 
> > I don't follow this part at all - at *some* point, something must know a
> > range of memory addresses involved in a DMA transfer, so that's where it
> > should map that range for DMA. 
> 
> In all presented cases in this series, the overall DMA size is known in
> advance. In RDMA case, it is known when user registers the memory, in
> VFIO, when live migration is happening and in NVMe, when BIO is created.
> 
> So once we allocated IOVA, we will need to link ranges, which si the
> same as map but without IOVA allocation.

But above you say:

"These advanced DMA mapping APIs are needed to calculate IOVA size to
allocate it as one chunk and some sort of offset calculations to know
which part of IOVA to map."

this suggests you need helpers to calculate the len and offset.  I
can't see where that would ever make sense.  The total transfer
size should just be passed in by the callers and be known, and
there should be no offset.

> > > Instead of teaching DMA to know these specific datatypes, let's separate
> > > existing DMA mapping routine to two steps and give an option to advanced
> > > callers (subsystems) perform all calculations internally in advance and
> > > map pages later when it is needed.
> > 
> > From a brief look, this is clearly an awkward reinvention of the IOMMU API.
> > If IOMMU-aware drivers/subsystems want to explicitly manage IOMMU address
> > spaces then they can and should use the IOMMU API. Perhaps there's room for
> > some quality-of-life additions to the IOMMU API to help with common usage
> > patterns, but the generic DMA mapping API is absolutely not the place for
> > it.
> 
> DMA mapping gives nice abstraction from IOMMU, and allows us to have
> same flow for IOMMU and non-IOMMU flows without duplicating code, while
> you suggest to teach almost every part in the kernel to know about IOMMU.

Except that the flows are fundamentally different for the "can coalesce"
vs "can't coalesce" case.  In the former we have one dma_addr_t range,
and in the latter as many as there are input vectors (this is ignoring
the weird iommu merging case where we we coalesce some but not all
segments, but I'd rather not have that in a new API).

So if we want to efficiently be able to handle these cases we need
two APIs in the driver and a good framework to switch between them.
Robins makes a point here that the iommu API handles the can coalesce
case and he has a point as that's exactly how the IOMMU API works.
I'd still prefer to wrap it with dma callers to handle things like
swiotlb and maybe Xen grant tables and to avoid the type confusion
between dma_addr_t and then untyped iova in the iommu layer, but
having this layer or not is probably worth a discussion.

> 
> In this series, we changed RDMA, VFIO and NVMe, and in all cases we
> removed more code than added. From what I saw, VDPA and virito-blk will
> benefit from proposed API too.
> 
> Even in this RFC, where Chaitanya did partial job and didn't convert
> whole driver, the gain is pretty obvious:
> https://lore.kernel.org/linux-rdma/016fc02cbfa9be3c156a6f74df38def1e09c08f1.1709635535.git.leon@kernel.org/T/#u
> 

I have no idea how that nvme patch is even supposed to work.  It removes
the PRP path in nvme-pci, which not only is the most common I/O path
but actually required for the admin queue as NVMe doesn't support
SGLs for the admin queue.