[PATCH 3/3] nvme-multipath: stuck partition scan on inaccessible paths

Sagi Grimberg sagi at grimberg.me
Thu Sep 12 03:29:01 PDT 2024




On 11/09/2024 19:10, Hannes Reinecke wrote:
> On 9/11/24 17:20, Sagi Grimberg wrote:
>>
>>
>>
>> On 11/09/2024 15:46, Hannes Reinecke wrote:
>>> When a path is switched to 'inaccessible' during partition scan
>>> triggered via device_add_disk() and we only have one path the
>>> system will be stuck as nvme_available_path() will always return
>>> 'true'. So I/O will never be completed and the system is stuck
>>> in device_add_disk():
>>>
>>>      [<0>] folio_wait_bit_common+0x12a/0x310
>>>      [<0>] filemap_read_folio+0x97/0xd0
>>>      [<0>] do_read_cache_folio+0x108/0x390
>>>      [<0>] read_part_sector+0x31/0xa0
>>>      [<0>] read_lba+0xc5/0x160
>>>      [<0>] efi_partition+0xd9/0x8f0
>>>      [<0>] bdev_disk_changed+0x23d/0x6d0
>>>      [<0>] blkdev_get_whole+0x78/0xc0
>>>      [<0>] bdev_open+0x2c6/0x3b0
>>>      [<0>] bdev_file_open_by_dev+0xcb/0x120
>>>      [<0>] disk_scan_partitions+0x5d/0x100
>>>      [<0>] device_add_disk+0x402/0x420
>>>      [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>>>      [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>>>      [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>>>      [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>>>      [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>>
>>> This patch introduces a flag NVME_NSHEAD_FAIL_ON_LAST_PATH to
>>> cause nvme_available_path() to always return NULL, and with
>>> that I/O to be failed if the last path is unavailable. But
>>> we also need to requeue all pending I/Os whenever we have
>>> changed the ANA state as now even inaccessible ANA states
>>> influence I/O behaviour.
>>>
>>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>>> ---
>>>   drivers/nvme/host/multipath.c | 15 +++++++++++++++
>>>   drivers/nvme/host/nvme.h      |  1 +
>>>   2 files changed, 16 insertions(+)
>>>
>>> diff --git a/drivers/nvme/host/multipath.c 
>>> b/drivers/nvme/host/multipath.c
>>> index f72c5a6a2d8e..0373b4043eea 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -422,6 +422,10 @@ static bool nvme_available_path(struct 
>>> nvme_ns_head *head)
>>>       struct nvme_ns *ns;
>>>       if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>> +        return NULL;
>>> +
>>> +    if (test_bit(NVME_NSHEAD_FAIL_ON_LAST_PATH, &head->flags) &&
>>> +        list_is_singular(&head->list))
>>>           return NULL;
>>
>> This is wrong too Hannes.
>>
>> If you added a live path, and when partition scan triggers your path 
>> becomes
>> inaccessible, but then after anatt it becomes optimized again. You 
>> just ended up without discovering partitions...
>
> But we will be getting an ANA state change AEN in that case, seeing 
> that we have enabled ANA state change events and the spec states that 
> we should be getting them (NVMe 1.3, section 8.21.4 Host ANA 
> Transition Operation explicitly has an 'or' between point a) and b) ...).

That is too late, you already failed the IO...

> Which is what we rely upon anyway, otherwise we would have had to 
> check for INACCESSIBLE in nvme_update_ana_state, and not just for 
> CHANGE as we do now.

I am referring to the fact that you fail the io and not simply queueing 
it for future requeue.

> Plus the situation will be exactly the same if one uses 'persistent 
> loss' instead of 'inaccessible'...

That is a side point, that needs to be addressed separately.

>
> And incidentally, the same issue happens when you disable the 
> namespace in nvmet just after the namespace got scanned, but before 
> device_add_disk() is called. So really it's not about ANA, but rather 
> about any path-related error here.

Right, which is transient semantically. You are making a decision here 
to FAIL the IO, and it needs a strong justification. We can
just sporadically fail IO...

The way that I see it, for as long as there is a possibility for an IO 
to be executed in the future (i.e. a path exists) we queue up
the IO. IMO if you want to fail it before, fast_io_fail_tmo is designed 
to address it.



More information about the Linux-nvme mailing list