[PATCHv3] nvme-mpath: delete disk after last connection
Hannes Reinecke
hare at suse.de
Mon May 10 14:01:56 BST 2021
On 5/10/21 8:23 AM, Christoph Hellwig wrote:
> On Fri, May 07, 2021 at 07:02:52PM +0200, Hannes Reinecke wrote:
>> On 5/7/21 8:46 AM, Christoph Hellwig wrote:
>>> On Thu, May 06, 2021 at 05:54:29PM +0200, Hannes Reinecke wrote:
>>>> PCI and fabrics have different defaults; for PCI the device goes away if
>>>> the last path (ie the controller) goes away, for fabrics it doesn't if the
>>>> device is mounted.
>>>
>>> Err, no. For fabrics we reconnect a while, but otherwise the behavior
>>> is the same right now.
>>>
>> No, that is not the case.
>>
>> When a PCI nvme device with CMIC=0 is removed (via pci hotplug, say), the
>> nvme device is completely removed, irrespective on whether it's mounted or
>> not.
>> When the _same_ PCI device with CMIC=1 is removed, the nvme device (ie the
>> nsnhead) will _stay_ when mounted (as the refcount is not zero).
>
> Yes. But that has nothing to do with fabrics as you claimed above, but
> with the fact if the subsystem supports multiple controller (and thus
> shared namespaces) or not.
>
It's still broken, though.
I've setup a testbed to demonstrate what I mean.
I have created a qemu instance with 3 NVMe devices, one for booting and
two for MD RAID.
After boot, MD RAID says this:
# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 nvme2n1[1] nvme1n1[0]
4189184 blocks super 1.2 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk
Now I detach the PCI device for /dev/nvme3:
(qemu) device_del nvme-rp90
(qemu) [ 183.512585] pcieport 0000:00:09.0: pciehp: Slot(0-2):
Attention button pressed
[ 183.515462] pcieport 0000:00:09.0: pciehp: Slot(0-2): Powering off
due to button press
And validate that the device is gone:
# lspci
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM
Controller
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:02.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:07.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:08.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:09.0 PCI bridge: Red Hat, Inc. QEMU PCIe Root port
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface
Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6
port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller
(rev 02)
01:00.0 Non-Volatile memory controller: Red Hat, Inc. QEMU NVM Express
Controller (rev 02)
02:00.0 Non-Volatile memory controller: Red Hat, Inc. QEMU NVM Express
Controller (rev 02)
Checking MD I still get:
# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 nvme2n1[1] nvme1n1[0]
4189184 blocks super 1.2 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk
IE MD hasn't registered _anything_, even though the device is physically
not present anymore.
And to make matters worse, 'nvme list' says:
# nvme list
Node SN Model
Namespace Usage Format FW Rev
---------------- --------------------
---------------------------------------- ---------
-------------------------- ---------------- --------
/dev/nvme0n1 SLESNVME1 QEMU NVMe Ctrl
1 17.18 GB / 17.18 GB 512 B + 0 B 1.0
/dev/nvme1n1 SLESNVME2 QEMU NVMe Ctrl
1 4.29 GB / 4.29 GB 512 B + 0 B 1.0
/dev/nvme2n1 |U b��||U
-1 0.00 B / 0.00 B 1 B + 0 B ���||U
which arguably is a bug in itself, as we shouldn't display weird strings
here. But no matter.
Now I'm reattaching the PCI device:
(qemu) device_add nvme,bus=rp90,id=nvme-rp90,subsys=subsys3
[ 49.261163] pcieport 0000:00:09.0: pciehp: Slot(0-2): Attention
button pressed
[ 49.263915] pcieport 0000:00:09.0: pciehp: Slot(0-2) Powering on due
to button press
[ 49.267188] pcieport 0000:00:09.0: pciehp: Slot(0-2): Card present
[ 49.269505] pcieport 0000:00:09.0: pciehp: Slot(0-2): Link Up
[ 49.406035] pci 0000:03:00.0: [1b36:0010] type 00 class 0x010802
[ 49.411585] pci 0000:03:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[ 49.417627] pci 0000:03:00.0: BAR 0: assigned [mem
0xc1000000-0xc1003fff 64bit]
[ 49.421057] pcieport 0000:00:09.0: PCI bridge to [bus 03]
[ 49.424071] pcieport 0000:00:09.0: bridge window [io 0x6000-0x6fff]
[ 49.428157] pcieport 0000:00:09.0: bridge window [mem
0xc1000000-0xc11fffff]
[ 49.431379] pcieport 0000:00:09.0: bridge window [mem
0x804000000-0x805ffffff 64bit pref]
[ 49.436591] nvme nvme3: pci function 0000:03:00.0
[ 49.438303] nvme 0000:03:00.0: enabling device (0000 -> 0002)
[ 49.446746] nvme nvme3: 1/0/0 default/read/poll queues
(qemu) device_add nvme-ns,bus=nvme-rp90,drive=nvme-3,nsid=1
[ 64.781295] nvme nvme3: rescanning namespaces.
[ 64.806720] block nvme2n1: no available path - failing I/O
And I'm ending up with _4_ namespaces:
# nvme list
Node SN Model
Namespace Usage Format FW Rev
---------------- --------------------
---------------------------------------- ---------
-------------------------- ---------------- --------
/dev/nvme0n1 SLESNVME1 QEMU NVMe Ctrl
1 17.18 GB / 17.18 GB 512 B + 0 B 1.0
/dev/nvme1n1 SLESNVME2 QEMU NVMe Ctrl
1 4.29 GB / 4.29 GB 512 B + 0 B 1.0
/dev/nvme2n1 SLESNVME3 QEMU NVMe Ctrl
-1 0.00 B / 0.00 B 1 B + 0 B 1.0
/dev/nvme2n2 SLESNVME3 QEMU NVMe Ctrl
1 4.29 GB / 4.29 GB 512 B + 0 B 1.0
and MD is still referencing the original one:
# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 nvme2n1[1] nvme1n1[0]
4189184 blocks super 1.2 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk
when doing I/O MD will finally figure out that something is amiss:
[ 152.636007] block nvme2n1: no available path - failing I/O
[ 152.641562] block nvme2n1: no available path - failing I/O
[ 152.645454] md: super_written gets error=-5
[ 152.648799] md/raid1:md1: Disk failure on nvme2n1, disabling device.
[ 152.648799] md/raid1:md1: Operation continuing on 1 devices.
but we're left with the problem that the re-attached namespace now has a
different device name (/dev/nvme2n2 vs /dev/nvme2n1), so MD cannot
reattach the device seamlessly and needs manual intervention.
Note: this has been tested with latest nvme-5.13, and the situation has
improved somewhat (as compared to previous versions) by the fact that MD
is now able to recover after manual interaction.
But we still end up with a namespace with the wrong name.
Cheers,
Hannes
More information about the Linux-nvme
mailing list