[RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param

Tue Apr 29 00:15:49 PDT 2025


On 4/29/25 12:31 PM, Hannes Reinecke wrote:
> On 4/29/25 08:24, Nilay Shroff wrote:
>>
>>
>> On 4/29/25 11:19 AM, Hannes Reinecke wrote:
>>> On 4/28/25 09:39, Nilay Shroff wrote:
>>>>
>>>>
>>>> On 4/28/25 12:27 PM, Hannes Reinecke wrote:
>>>>> On 4/25/25 12:33, Nilay Shroff wrote:
>>>>>> Currently, a multipath head disk node is not created for single-ported
>>>>>> NVMe adapters or private namespaces. However, creating a head node in
>>>>>> these cases can help transparently handle transient PCIe link failures.
>>>>>> Without a head node, features like delayed removal cannot be leveraged,
>>>>>> making it difficult to tolerate such link failures. To address this,
>>>>>> this commit introduces nvme_core module parameter multipath_head_always.
>>>>>>
>>>>>> When this param is set to true, it forces the creation of a multipath
>>>>>> head node regardless NVMe disk or namespace type. So this option allows
>>>>>> the use of delayed removal of head node functionality even for single-
>>>>>> ported NVMe disks and private namespaces and thus helps transparently
>>>>>> handling transient PCIe link failures.
>>>>>>
>>>>>> By default multipath_head_always is set to false, thus preserving the
>>>>>> existing behavior. Setting it to true enables improved fault tolerance
>>>>>> in PCIe setups. Moreover, please note that enabling this option would
>>>>>> also implicitly enable nvme_core.multipath.
>>>>>>
>>>>>> Signed-off-by: Nilay Shroff <nilay at linux.ibm.com>
>>>>>> ---
>>>>>>     drivers/nvme/host/multipath.c | 70 +++++++++++++++++++++++++++++++----
>>>>>>     1 file changed, 63 insertions(+), 7 deletions(-)
>>>>>>
>>>>> I really would model this according to dm-multipath where we have the
>>>>> 'fail_if_no_path' flag.
>>>>> This can be set for PCIe devices to retain the current behaviour
>>>>> (which we need for things like 'md' on top of NVMe) whenever the
>>>>> this flag is set.
>>>>>
>>>> Okay so you meant that when sysfs attribute "delayed_removal_secs"
>>>> under head disk node is _NOT_ configured (or delayed_removal_secs
>>>> is set to zero) we have internal flag "fail_if_no_path" is set to
>>>> true. However in other case when "delayed_removal_secs" is set to
>>>> a non-zero value we set "fail_if_no_path" to false. Is that correct?
>>>>
>>> Don't make it overly complicated.
>>> 'fail_if_no_path' (and the inverse 'queue_if_no_path') can both be
>>> mapped onto delayed_removal_secs; if the value is '0' then the head
>>> disk is immediately removed (the 'fail_if_no_path' case), and if it's
>>> -1 it is never removed (the 'queue_if_no_path' case).
>>>
>> Yes if the value of delayed_removal_secs is 0 then the head is immediately
>> removed, however if value of delayed_removal_secs is anything but zero
>> (i.e. greater than zero as delayed_removal_secs is unsigned) then head
>> is removed only after delayed_removal_secs is elapsed and hence disk
>> couldn't recover from transient link failure. We never pin head node
>> indefinitely.
>>
>>> Question, though: How does it interact with the existing 'ctrl_loss_tmo'? Both describe essentially the same situation...
>>>
>> The delayed_removal_secs is modeled for NVMe PCIe adapter. So it really
>> doesn't interact or interfere with ctrl_loss_tmo which is fabric controller
>> option.
>>
> Not so sure here.
> You _could_ expand the scope for ctrl_loss_tmo to PCI, too;
> as most PCI devices will only ever have one controller 'ctrl_loss_tmo'
> will be identical to 'delayed_removal_secs'.
> 
> So I guess my question is: is there a value for fabrics to control
> the lifetime of struct ns_head independent on the lifetime of the
> controller?
> 
The ctrl_loss_tmo option doesn't actually controls the lifetime of
ns_head. In fact, the ctrl_loss_tmo allows the fabric I/O commands to
fail fast so that I/O commands don't get stuck while host NVMe-oF 
controller is in reconnect state. User may not want to wait longer
while the fabric controller enters into reconnect state when it
loses connection with the target. Typically, the default reconnect
timeout is 10 minutes which is way longer than the expected timeout
of 30 seconds for any I/O command to fail. 
You may find more details in this commit 8c4dfea97f15 ("nvme-fabrics:
reject I/O to offline device") which implements the ctrl_loss_tmo.

Thanks,
--Nilay