[PATCHv3] nvme-mpath: delete disk after last connection

Thu May 6 07:13:51 BST 2021

On 5/5/21 10:40 PM, Sagi Grimberg wrote:
> 
>>>>>> As stated in the v3 review this is an incompatible change.  We'll 
>>>>>> need
>>>>>> the queue_if_no_path attribute first, and default it to on to keep
>>>>>> compatability.
>>>>>>
>>>>>
>>>>> That is what I tried the last time, but the direction I got was to 
>>>>> treat
>>>>> both, NVMe-PCI and NVMe-oF identically:
>>>>> (https://lore.kernel.org/linux-nvme/34e5c178-8bc4-68d3-8374-fbc1b451b6e8@grimberg.me/) 
>>>>>
>>>>
>>>> Yes, I'm not sure I understand your comment Christoph. This 
>>>> addresses an
>>>> issue with mdraid where hot unplug+replug does not restore the 
>>>> device to
>>>> the raid group (pci and fabrics alike), where before multipath this 
>>>> used
>>>> to work.
>>>>
>>>> queue_if_no_path is a dm-multipath feature so I'm not entirely clear
>>>> what is the concern? mdraid on nvme (pci/fabrics) used to work a 
>>>> certain
>>>> way, with the introduction of nvme-mpath the behavior was broken (as 
>>>> far
>>>> as I understand from Hannes).
>>>>
>>>> My thinking is that if we want queue_if_no_path functionality in nvme
>>>> mpath we should have it explicitly stated properly such that people
>>>> that actually need it will use it and have mdraid function correctly
>>>> again. Also, queue_if_no_path applies really if all the paths are
>>>> gone in the sense they are completely removed, and doesn't apply
>>>> to controller reset/reconnect.
>>>>
>>>> I agree we should probably have queue_if_no_path attribute on the
>>>> mpath device, but it doesn't sound right to default it to true given
>>>> that it breaks mdraid stacking on top of it..
>>>
>>> If you want "queue_if_no_path" behavior, can't you just set really high
>>> reconnect_delay and ctrl_loss_tmo values? That prevents the path from
>>> being deleted while it is unreachable, then restart IO on the existing
>>> path once connection is re-established.
>>>
>> Precisely my thinking.
>> We _could_ add a queue_if_no_path attribute, but we can also achieve the
>> same behaviour by setting the ctrl_loss_tmo value to infinity.
>> Provided we can change it on the fly, though; but it not that's easily
>> fixed.
>>
>> In fact, that's what we recommend to our customers to avoid the bug
>> fixed by this patch.
> 
> You can change ctrl_loss_tmo on the fly. How does that address the
> issue? the original issue is ctrl_loss_tmo expires for fabrics? or
> pci unplug (which ctrl_loss_tmo does not apply to it)?

Yes. It becomes particularly noticeable in TCP fabrics where the link 
can go down for an extended time.
The system will try to reconnect until ctrl_loss_tmo kicks in; if the 
link gets reestablished after that time your system is hosed.

With this patch I/O is still killed, but at least you can then 
re-establish the connection by just calling

nvme connect

and the nvme device will be reconnected such that you can call

mdadm --re-add

to resync the device.
With the current implementation you are out of luck as I/O is pending on 
the disconnected original nvme device, and you have no chance to flush 
that. Consequently you can't detach it from the MD array, and, again, 
your system is hosed.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare at suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer