nvme/pcie hot plug results in /dev name change

Tue Jan 31 18:33:15 PST 2023

On Tue, Jan 31, 2023 at 09:38:47AM -0700, Keith Busch wrote:
> On Sun, Jan 29, 2023 at 06:28:05PM +0800, Ming Lei wrote:
> > On Fri, Jan 20, 2023 at 11:01:53PM -0800, Christoph Hellwig wrote:
> > > On Fri, Jan 20, 2023 at 02:42:23PM -0700, Keith Busch wrote:
> > > > That is correct. We don't know the identity of the device at the point
> > > > we have to assign it an instance number, so the hot added one will just
> > > > get the first available unique number. If you need a consistent name, we
> > > > have the persistent naming rules that should create those links in
> > > > /dev/disk/by-id/.
> > > 
> > > Note that this a bit of a problem under a file system or stacking driver
> > > that handles failing drives (e.g. btrfs or md raid), that holds ontop
> > > the "old" device file, and then fails to find the new one.  I had a
> > > customer complaint for that as well :)
> > > 
> > > The first hack was to force run the multipath code that can keep the
> > > node alive.  That works, but is really ugly especially when dealing
> > > with corner cases such as overlapping nsids between different
> > > controllers.
> > > 
> > > In the long run I think we'll need to:
> > >  - send a notification to the holder if a device is hot removed from
> > >    the block layer so that it can clean up
> > 
> > When the disk is deleted, the notification has been sent to userspace
> > via udev/kobj uevent, so user can umount the original FS or
> > DM/MD userspace can handle the device removal.
> > 
> > >  - make the upper layers look for the replugged devie
> > > 
> > > I've been working on some of this for a while but haven't made much
> > > progress due to other committments.
> > 
> > block device persistent name is supposed to be supported by userspace,
> > such as udev rule.
> 
> Come to think of it, I actually have heard many complaints about this behavior.
> Requiring user space deal with the teardown and restore of their open files and
> mount points on a transient link loss can be inconvenient. Example use cases

If IO error is returned to FS, I guess umount may have to be done since it might
be one meta IO. But if userspace has persistent device name, it is easy for
userspace to handle the umount and re-setup.

> are firmware activation requiring a Subsystem Reset, or a PCIe error
> containment event. Those cause the links to bounce, which can trigger hot plug
> events in some platforms.

The above isn't unique for nvme, and it is just easier for nvme-pci to
handle timeout/err by removing device, IMO.

> The native nvme multipath looks like it could be leveraged to improving that
> user experience if we wanted to make that layer an option for non-multipath
> devices.

Can you share the basic idea? Will nvme mulitpath hold the io error and
not propagate to upper layer until new device is probed? What if the
new device is probed late, and IO has been timed out and err is returned
to upper layer?

Thanks, 
Ming