[PATCH v4 00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

Alex Williamson alex.williamson at redhat.com
Tue May 8 14:40:39 PDT 2018


On Tue, 8 May 2018 17:25:24 -0400
Don Dutile <ddutile at redhat.com> wrote:

> On 05/08/2018 12:57 PM, Alex Williamson wrote:
> > On Mon, 7 May 2018 18:23:46 -0500
> > Bjorn Helgaas <helgaas at kernel.org> wrote:
> >   
> >> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:  
> >>> Hi Everyone,
> >>>
> >>> Here's v4 of our series to introduce P2P based copy offload to NVMe
> >>> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> >>> is here:
> >>>
> >>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> >>> ...  
> >>  
> >>> Logan Gunthorpe (14):
> >>>    PCI/P2PDMA: Support peer-to-peer memory
> >>>    PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >>>    PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >>>    PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >>>    docs-rst: Add a new directory for PCI documentation
> >>>    PCI/P2PDMA: Add P2P DMA driver writer's documentation
> >>>    block: Introduce PCI P2P flags for request and request queue
> >>>    IB/core: Ensure we map P2P memory correctly in
> >>>      rdma_rw_ctx_[init|destroy]()
> >>>    nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >>>    nvme-pci: Add support for P2P memory in requests
> >>>    nvme-pci: Add a quirk for a pseudo CMB
> >>>    nvmet: Introduce helper functions to allocate and free request SGLs
> >>>    nvmet-rdma: Use new SGL alloc/free helper for requests
> >>>    nvmet: Optionally use PCI P2P memory
> >>>
> >>>   Documentation/ABI/testing/sysfs-bus-pci    |  25 +
> >>>   Documentation/PCI/index.rst                |  14 +
> >>>   Documentation/driver-api/index.rst         |   2 +-
> >>>   Documentation/driver-api/pci/index.rst     |  20 +
> >>>   Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
> >>>   Documentation/driver-api/{ => pci}/pci.rst |   0
> >>>   Documentation/index.rst                    |   3 +-
> >>>   block/blk-core.c                           |   3 +
> >>>   drivers/infiniband/core/rw.c               |  13 +-
> >>>   drivers/nvme/host/core.c                   |   4 +
> >>>   drivers/nvme/host/nvme.h                   |   8 +
> >>>   drivers/nvme/host/pci.c                    | 118 +++--
> >>>   drivers/nvme/target/configfs.c             |  67 +++
> >>>   drivers/nvme/target/core.c                 | 143 ++++-
> >>>   drivers/nvme/target/io-cmd.c               |   3 +
> >>>   drivers/nvme/target/nvmet.h                |  15 +
> >>>   drivers/nvme/target/rdma.c                 |  22 +-
> >>>   drivers/pci/Kconfig                        |  26 +
> >>>   drivers/pci/Makefile                       |   1 +
> >>>   drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
> >>>   drivers/pci/pci.c                          |   6 +
> >>>   include/linux/blk_types.h                  |  18 +-
> >>>   include/linux/blkdev.h                     |   3 +
> >>>   include/linux/memremap.h                   |  19 +
> >>>   include/linux/pci-p2pdma.h                 | 118 +++++
> >>>   include/linux/pci.h                        |   4 +
> >>>   26 files changed, 1579 insertions(+), 56 deletions(-)
> >>>   create mode 100644 Documentation/PCI/index.rst
> >>>   create mode 100644 Documentation/driver-api/pci/index.rst
> >>>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> >>>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> >>>   create mode 100644 drivers/pci/p2pdma.c
> >>>   create mode 100644 include/linux/pci-p2pdma.h  
> >>
> >> How do you envison merging this?  There's a big chunk in drivers/pci, but
> >> really no opportunity for conflicts there, and there's significant stuff in
> >> block and nvme that I don't really want to merge.
> >>
> >> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> >> merge it elsewhere?  
> > 
> > AIUI from previously questioning this, the change is hidden behind a
> > build-time config option and only custom kernels or distros optimized
> > for this sort of support would enable that build option.  I'm more than
> > a little dubious though that we're not going to have a wave of distros
> > enabling this only to get user complaints that they can no longer make
> > effective use of their devices for assignment due to the resulting span
> > of the IOMMU groups, nor is there any sort of compromise, configure
> > the kernel for p2p or device assignment, not both.  Is this really such
> > a unique feature that distro users aren't going to be asking for both
> > features?  Thanks,
> > 
> > Alex  
> At least 1/2 the cases presented to me by existing customers want it in a tunable kernel,
> and tunable btwn two points, if the hw allows it to be 'contained' in that manner, which
> a (layer of) switch(ing) provides.
> To me, that means a kernel cmdline parameter to _enable_, and another sysfs (configfs? ... i'm not a configfs afficionato to say which is best),
> method to make two points p2p dma capable.

That's not what's done here AIUI.  There are also some complications to
making IOMMU groups dynamic, for instance could a downstream endpoint
already be in use by a userspace tool as ACS is being twiddled in
sysfs?  Probably the easiest solution would be that all devices
affected by the ACS change are soft unplugged before and re-added after
the ACS change.  Note that "affected" is not necessarily only the
downstream devices if the downstream port at which we're playing with
ACS is part of a multifunction device.  Thanks,

Alex



More information about the Linux-nvme mailing list