[PATCH] nvme: uring_cmd specific request_queue for SGLs

Wed Jun 25 15:08:28 PDT 2025

On Wed, Jun 25, 2025 at 08:09:15AM +0200, Christoph Hellwig wrote:
> > User space passthrough IO commands are committed to using the SGL
> > transfer types if the device supports it. The virt_boundary_mask is a
> > PRP specific constraint, and this limit causes kernel bounce buffers to
> > be used when a user vector could have been handled directly. Avoiding
> > unnecessary copies is important for uring_cmd usage as this is a high
> > performance interface.
> 
> Not really more high performance than the normal I/O path.

Right, that's why I said "a" performance path, not "the" performance
path.

If you send a readv/writev with a similar iovec to a O_DIRECT block
device, then it will just get split on the gapped virt boundaries but it
still uses it directly without bouncing. We can't split passthrough
requests though, so it'd be preferable to use the iovec in a single
command if the hardware supports it rather than bounce it.

> > For devices that support SGL, create a new request_queue that drops the
> > virt_boundary_mask so that vectored user requests can be used with
> > zero-copy performance. Normal read/write will still use the old boundary
> > mask, as we can't be sure if forcing all IO to use SGL over PRP won't
> > cause unexpected regressions for some devices.
> 
> Note that this directly conflict with the new DMA API.  There we do
> rely on the virt boundary to gurantee that the IOMMU path can always
> coalesce the entire request into a single IOVA mapping.  We could still
> do it for the direct mapping path, where it makes a difference, but
> we really should do that everywhere, i.e. revist the default
> sgl_threshold and see if we could reduce it to 2 * PAGE_SIZE or so
> so that we'd only use PRPs for the simple path where we can trivially
> do the virt_boundary check right in NVMe.

Sure, that sounds okay if you mean 2 * NVME_CTRL_PAGE_SIZE.

It looks straight forward to add merging while we iterate for the direct
mapping result if it returns mergable iova's, but I think we'd have to
commit to using SGL over PRP for everything but the simple case, and
drop the PRP imposed virt boundary. The downside might be we'd lose that
iova pre-allocation optimization (dma_iova_try_alloc) you have going on,
but I'm not sure how important that is. Could the direct mapping get too
fragmented to consistently produce contiguous iova's in this path?