[RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

Jason Gunthorpe jgg at ziepe.ca
Wed Dec 7 07:07:11 PST 2022


On Wed, Dec 07, 2022 at 02:52:03PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 07, 2022 at 09:34:14AM -0400, Jason Gunthorpe wrote:
> > The VFIO design assumes that the "vfio migration driver" will talk to
> > both functions under the hood, and I don't see a fundamental problem
> > with this beyond it being awkward with the driver core.
> 
> And while that is a fine concept per see, the current incarnation of
> that is fundamentally broken is it centered around the controlled
> VM.  Which really can't work.

I don't see why you keep saying this. It is centered around the struct
vfio_device object in the kernel, which is definately NOT the VM.

The struct vfio_device is the handle for the hypervisor to control
the physical assigned device - and it is the hypervisor that controls
the migration.

We do not need the hypervisor userspace to have a handle to the hidden
controlling function. It provides no additional functionality,
security or insight to what qemu needs to do. Keeping that
relationship abstracted inside the kernel is a reasonable choice and
is not "fundamentally broken".

> > Even the basic assumption that there would be a controlling/controlled
> > relationship is not universally true. The mdev type drivers, and 
> > SIOV-like devices are unlikely to have that. Once you can use PASID
> > the reasons to split things at the HW level go away, and a VF could
> > certainly self-migrate.
> 
> Even then you need a controlling and a controlled entity.  The
> controlling entity even in SIOV remains a PCIe function.  The
> controlled entity might just be a bunch of hardware resoures and
> a PASID.  Making it important again that all migration is driven
> by the controlling entity.

If they are the same driver implementing vfio_device you may be able
to claim they conceptually exist, but it is pretty artificial to draw
this kind of distinction inside a single driver.

> Also the whole concept that only VFIO can do live migration is
> a little bogus.  With checkpoint and restart it absolutely
> does make sense to live migrate a container, and with that
> the hardware interface (e.g. nvme controller) assigned to it.

I agree people may want to do this, but it is very unclear how SRIOV
live migration can help do this.

SRIOV live migration is all about not disturbing the kernel driver,
assuming it is the same kernel driver on both sides. If you have two
different kernel's there is nothing worth migrating. There isn't even
an assurance the dma API will have IOMMU mapped the same objects to
the same IOVAs. eg so you have re-establish your admin queue, IO
queues, etc after migration anyhow.

Let alone how to solve the security problems of allow userspace to
load arbitary FW blobs into a device with potentially insecure DMA
access..

At that point it isn't really the same kind of migration.

Jason



More information about the Linux-nvme mailing list