[RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver.

Max Gurtovoy mgurtovoy at nvidia.com
Tue Dec 6 18:30:20 PST 2022


On 12/6/2022 9:15 PM, Jason Gunthorpe wrote:
> On Tue, Dec 06, 2022 at 05:55:03PM +0100, Christoph Hellwig wrote:
>> On Tue, Dec 06, 2022 at 11:51:23AM -0400, Jason Gunthorpe wrote:
>>> That is a big deviation from where VFIO is right now, the controlled
>>> function is the one with the VFIO driver, it should be the one that
>>> drives the migration uAPI components.
>> Well, that is one way to see it, but I think the more natural
>> way to deal with it is to drive everyting from the controlling
>> function, because that is by definition much more in control.
> Sure, the controlling function should (and does in mlx5) drive
> everything here.
>
> What the kernel is doing is providing the abstraction to link the
> controlling function to the VFIO device in a general way.
>
> We don't want to just punt this problem to user space and say 'good
> luck finding the right cdev for migration control'. If the kernel
> struggles to link them then userspace will not fare better on its own.
>
> Especially, we do not want every VFIO device to have its own crazy way
> for userspace to link the controlling/controlled functions
> together. This is something the kernel has to abstract away.
>
> So, IMHO, we must assume the kernel is aware of the relationship,
> whatever algorithm it uses to become aware.
>
> It just means the issue is doing the necessary cross-subsystem
> locking.

Agree. This is not the first time we have cross-subsystem interactions 
in the kernel, right ?

>
> That combined with the fact they really are two halfs of the same
> thing - operations on the controlling function have to be sequenced
> with operations on the VFIO device - makes me prefer the single uAPI.
>
>> More importantly any sane design will have easy ways to list and
>> manipulate all the controlled functions from the controlling
>> functions, while getting from the controlled function to the
>> controlling one is extremely awkward, as anything that can be
>> used for that is by definition and information leak.
> We have spend some time looking at this for mlx5. It is a hard
> problem. The VFIO driver cannot operate the device, eg it cannot do
> MMIO and things, so it is limited to items in the PCI config space to
> figure out the device identity.

Christoph,

I'm not sure how awkward is for migration driver to ask the controlling 
device driver to operate a migration action.

The controlling device driver can expose limited API for that matter.

I don't see why is it wrong to submit some admin commands between 
subsystems - we already have a way to send admin commands from linux cli 
or from nvmet drivers to an NVMe device.

Also the concept of primary controller that control it's secondary 
controllers is already in the SPEC so we can use it. It's not introduced 
in this RFC but we're here to fix it.

In our case the primary controller is the PF and the secondary 
controllers are the VFs.

If we'll think of  a concept of new optional "Migration Management" 
admin commands that will be supported by primary controllers to manage 
the migration process of its secondary controllers - we can at least 
solve the initial step of migrating VFs by its parent PF.

>
>> It seems mlx5 just gets away with that by saying controlled
>> functions are always VFs, and the controlling function is a PF, but
>> that will break down very easily,
> Yes, that is part of the current mlx5 model. It is not inherent to the
> device design, but the problems with arbitary nesting were hard enough
> they were not tackled at this point.

Agree. There are a lot of challenges in the non-nested migration that we 
can just limit the NVMe SPEC to it and extend later on - like we did in 
other features such as copy-Offload.

Most of the infrastructure work for LM was done in the VFIO subsystem in 
the last year so we can now focus on the Architecture aspects of NVMe.

>
> Jason



More information about the Linux-nvme mailing list