[PATCHv3] nvme-mpath: delete disk after last connection

Mon May 10 14:01:56 BST 2021

On 5/10/21 8:23 AM, Christoph Hellwig wrote:
> On Fri, May 07, 2021 at 07:02:52PM +0200, Hannes Reinecke wrote:
>> On 5/7/21 8:46 AM, Christoph Hellwig wrote:
>>> On Thu, May 06, 2021 at 05:54:29PM +0200, Hannes Reinecke wrote:
>>>> PCI and fabrics have different defaults; for PCI the device goes away if
>>>> the last path (ie the controller) goes away, for fabrics it doesn't if the
>>>> device is mounted.
>>>
>>> Err, no.  For fabrics we reconnect a while, but otherwise the behavior
>>> is the same right now.
>>>
>> No, that is not the case.
>>
>> When a PCI nvme device with CMIC=0 is removed (via pci hotplug, say), the 
>> nvme device is completely removed, irrespective on whether it's mounted or 
>> not.
>> When the _same_ PCI device with CMIC=1 is removed, the nvme device (ie the 
>> nsnhead) will _stay_ when mounted (as the refcount is not zero).
> 
> Yes.  But that has nothing to do with fabrics as you claimed above, but
> with the fact if the subsystem supports multiple controller (and thus
> shared namespaces) or not.
> 
It's still broken, though.

I've setup a testbed to demonstrate what I mean.

I have created a qemu instance with 3 NVMe devices, one for booting and
two for MD RAID.
After boot, MD RAID says this:

 # cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 nvme2n1[1] nvme1n1[0]
      4189184 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

Now I detach the PCI device for /dev/nvme3:

(qemu) device_del nvme-rp90
(qemu) [  183.512585] pcieport 0000:00:09.0: pciehp: Slot(0-2):
Attention button pressed
[  183.515462] pcieport 0000:00:09.0: pciehp: Slot(0-2): Powering off
due to button press

And validate that the device is gone:

# lspci
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM
Controller
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:02.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:07.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:08.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:09.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface
Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6
port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller
(rev 02)
01:00.0 Non-Volatile memory controller: Red Hat, Inc. QEMU NVM Express
Controller (rev 02)
02:00.0 Non-Volatile memory controller: Red Hat, Inc. QEMU NVM Express
Controller (rev 02)

Checking MD I still get:
# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 nvme2n1[1] nvme1n1[0]
      4189184 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

IE MD hasn't registered _anything_, even though the device is physically
not present anymore.
And to make matters worse, 'nvme list' says:

# nvme list
Node             SN                   Model
       Namespace Usage                      Format           FW Rev
---------------- --------------------
---------------------------------------- ---------
-------------------------- ---------------- --------
/dev/nvme0n1     SLESNVME1            QEMU NVMe Ctrl
       1          17.18  GB /  17.18  GB    512   B +  0 B   1.0
/dev/nvme1n1     SLESNVME2            QEMU NVMe Ctrl
       1           4.29  GB /   4.29  GB    512   B +  0 B   1.0
/dev/nvme2n1     |U                   b��||U
       -1          0.00   B /   0.00   B      1   B +  0 B   ���||U

which arguably is a bug in itself, as we shouldn't display weird strings
here. But no matter.

Now I'm reattaching the PCI device:

(qemu) device_add nvme,bus=rp90,id=nvme-rp90,subsys=subsys3
[   49.261163] pcieport 0000:00:09.0: pciehp: Slot(0-2): Attention
button pressed
[   49.263915] pcieport 0000:00:09.0: pciehp: Slot(0-2) Powering on due
to button press
[   49.267188] pcieport 0000:00:09.0: pciehp: Slot(0-2): Card present
[   49.269505] pcieport 0000:00:09.0: pciehp: Slot(0-2): Link Up
[   49.406035] pci 0000:03:00.0: [1b36:0010] type 00 class 0x010802
[   49.411585] pci 0000:03:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[   49.417627] pci 0000:03:00.0: BAR 0: assigned [mem
0xc1000000-0xc1003fff 64bit]
[   49.421057] pcieport 0000:00:09.0: PCI bridge to [bus 03]
[   49.424071] pcieport 0000:00:09.0:   bridge window [io  0x6000-0x6fff]
[   49.428157] pcieport 0000:00:09.0:   bridge window [mem
0xc1000000-0xc11fffff]
[   49.431379] pcieport 0000:00:09.0:   bridge window [mem
0x804000000-0x805ffffff 64bit pref]
[   49.436591] nvme nvme3: pci function 0000:03:00.0
[   49.438303] nvme 0000:03:00.0: enabling device (0000 -> 0002)
[   49.446746] nvme nvme3: 1/0/0 default/read/poll queues
(qemu) device_add nvme-ns,bus=nvme-rp90,drive=nvme-3,nsid=1
[   64.781295] nvme nvme3: rescanning namespaces.
[   64.806720] block nvme2n1: no available path - failing I/O

And I'm ending up with _4_ namespaces:
# nvme list
Node             SN                   Model
       Namespace Usage                      Format           FW Rev
---------------- --------------------
---------------------------------------- ---------
-------------------------- ---------------- --------
/dev/nvme0n1     SLESNVME1            QEMU NVMe Ctrl
       1          17.18  GB /  17.18  GB    512   B +  0 B   1.0
/dev/nvme1n1     SLESNVME2            QEMU NVMe Ctrl
       1           4.29  GB /   4.29  GB    512   B +  0 B   1.0
/dev/nvme2n1     SLESNVME3            QEMU NVMe Ctrl
       -1          0.00   B /   0.00   B      1   B +  0 B   1.0
/dev/nvme2n2     SLESNVME3            QEMU NVMe Ctrl
       1           4.29  GB /   4.29  GB    512   B +  0 B   1.0

and MD is still referencing the original one:

# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 nvme2n1[1] nvme1n1[0]
      4189184 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

when doing I/O MD will finally figure out that something is amiss:

[  152.636007] block nvme2n1: no available path - failing I/O
[  152.641562] block nvme2n1: no available path - failing I/O
[  152.645454] md: super_written gets error=-5
[  152.648799] md/raid1:md1: Disk failure on nvme2n1, disabling device.
[  152.648799] md/raid1:md1: Operation continuing on 1 devices.

but we're left with the problem that the re-attached namespace now has a
different device name (/dev/nvme2n2 vs /dev/nvme2n1), so MD cannot
reattach the device seamlessly and needs manual intervention.

Note: this has been tested with latest nvme-5.13, and the situation has
improved somewhat (as compared to previous versions) by the fact that MD
is now able to recover after manual interaction.
But we still end up with a namespace with the wrong name.

Cheers,

Hannes