[PATCH RFC 00/11] nvmet: Add NVMe target mdev/vfio driver

Wed Mar 12 22:32:14 PDT 2025

On 3/13/25 14:18, Mike Christie wrote:
> The following patches were made over Linus's tree. They implement
> a virtual PCI NVMe device using mdev/vfio. The device can be used
> by QEMU and in the guest will look like a normal old local PCI
> NVMe drive.
> 
> They are based on Maxim Levitsky's mdev patches:
> 
> https://lore.kernel.org/lkml/20190506125752.GA5288@lst.de/t/
> 
> but instead of trying to export a physical NVMe device to a guest, they
> are only focused on exporting a virtual device using the nvmet layer.
> 
> Why another driver when we have so many? Performance.
> =====================================================
> Without any tuning and major locks still in the main IO path, 4K IOPS for
> a single controller with a single namespace are higher than the kernel
> vhost-scsi driver and SPDK vhost-scsi/blk user when using lower number
> of queues/cpus/jobs. At just 2 queues, we are able to hit 1M IOPS:
> 
> Note: the nvme mdev values below have the shadow doorbell enabled
> 
>         mdev vhost-scsi vhost-scsi-usr vhost-blk-usr
> numjobs
> 1       518K    198K        332K        301K
> 2       1037K   363K        609K        664K
> 4       974K    633K        1369K       1383K
> 8       813K    1788K       1358K       1363K
> 
> However, by default we can't scale. But, tuning mdev to pre-pin pages
> (this requires patches to the vfio layer to support) then it also performs
> better at lower and higher number of queues/cpus/jobs used with it
> reaching 2.3M IOPS woth only 4 cpus/queues used:
> 
>         mdev
> numjobs
> 1       505K
> 2       1037K
> 4       2375K
> 8       2162K
> 
> If we agree on a new virtual NVMe driver being ok, why mdev vs vhost?
> =====================================================================
> The problem with a vhost nvme is:
> 
> 2.1. If we do a fully vhost nvmet solution, it will require new guest
> drivers that present NVMe interfaces to userspace then perform the
> vhost spec on the backend like how vhost-scsi does.
> 
> I don't want to implement a windows or even a linux nvme vhost
> driver. I don't think anyone wants the extra headache.
> 
> 2.2. We can do a hybrid approach where in the guest it looks like we
> are a normal old local NVMe drive and use the guest's native NVMe driver.
> However in QEMU we would have a vhost nvme module that instead of using
> vhost virtqueues handles virtual PCI memory accesses as well as a vhost
> nvme kernel or user driver to process IO.
> 
> So not as much extra code as option 1 since we don't have to worry about
> the guest but still extra QEMU code.
> 
> 3. The mdev based solution does not have these drawbacks as it can
> look like a normal old local NVMe drive to the guest and can use QEMU's
> existing vfio layer. So it just requires the kernel driver.
> 
> Why not a new blk driver or why not vdpa blk?
> =============================================
> Applications want standardized interfaces for things like persistent
> reservations. They have to support them with SCSI and NVMe already
> and don't want to have to support a new virtio block interface.
> 
> Also the nvmet-mdev-pci driver in this patchset can perform was well
> as SPDK vhost blk so that doesn't have the perf advantage like it
> used to.
> 
> Status
> ======
> This patchset is RFC quality only. You can discover a drive and do
> IO but it's not stable. There's several TODO items mentioned in the
> last patch. However, I think the patches are at the point where I
> wanted to get some feedback about if this even acceptable because
> the last time they were posted some people did not like how
> they hooked into drivers/nvme/host (this has been fixed in this
> posting). There's some other issues like:
> 
> 1. Should the driver integrate with pci-epf (the drivers work very
> differently but could share some code)?

Will have a look.

> 
> 2. Should it try to fit into the existing configfs interface or implement
> it's own like how pci-epf did? I did an attempt for this but it feels
> wrong.

Note that the configfs for pci-epf is supported by the PCI endpoint
infrastructure. It is not all implemented by the driver alone.

-- 
Damien Le Moal
Western Digital Research