[PATCH 3/3] nvme-multipath: stuck partition scan on inaccessible paths

Thu Sep 12 07:30:03 PDT 2024

On 9/12/24 12:29, Sagi Grimberg wrote:
> 
> 
> 
> On 11/09/2024 19:10, Hannes Reinecke wrote:
>> On 9/11/24 17:20, Sagi Grimberg wrote:
>>>
>>>
>>>
>>> On 11/09/2024 15:46, Hannes Reinecke wrote:
>>>> When a path is switched to 'inaccessible' during partition scan
>>>> triggered via device_add_disk() and we only have one path the
>>>> system will be stuck as nvme_available_path() will always return
>>>> 'true'. So I/O will never be completed and the system is stuck
>>>> in device_add_disk():
>>>>
>>>>      [<0>] folio_wait_bit_common+0x12a/0x310
>>>>      [<0>] filemap_read_folio+0x97/0xd0
>>>>      [<0>] do_read_cache_folio+0x108/0x390
>>>>      [<0>] read_part_sector+0x31/0xa0
>>>>      [<0>] read_lba+0xc5/0x160
>>>>      [<0>] efi_partition+0xd9/0x8f0
>>>>      [<0>] bdev_disk_changed+0x23d/0x6d0
>>>>      [<0>] blkdev_get_whole+0x78/0xc0
>>>>      [<0>] bdev_open+0x2c6/0x3b0
>>>>      [<0>] bdev_file_open_by_dev+0xcb/0x120
>>>>      [<0>] disk_scan_partitions+0x5d/0x100
>>>>      [<0>] device_add_disk+0x402/0x420
>>>>      [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>>>>      [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>>>>      [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>>>>      [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>>>>      [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>>>
>>>> This patch introduces a flag NVME_NSHEAD_FAIL_ON_LAST_PATH to
>>>> cause nvme_available_path() to always return NULL, and with
>>>> that I/O to be failed if the last path is unavailable. But
>>>> we also need to requeue all pending I/Os whenever we have
>>>> changed the ANA state as now even inaccessible ANA states
>>>> influence I/O behaviour.
>>>>
>>>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>>>> ---
>>>>   drivers/nvme/host/multipath.c | 15 +++++++++++++++
>>>>   drivers/nvme/host/nvme.h      |  1 +
>>>>   2 files changed, 16 insertions(+)
>>>>
>>>> diff --git a/drivers/nvme/host/multipath.c 
>>>> b/drivers/nvme/host/multipath.c
>>>> index f72c5a6a2d8e..0373b4043eea 100644
>>>> --- a/drivers/nvme/host/multipath.c
>>>> +++ b/drivers/nvme/host/multipath.c
>>>> @@ -422,6 +422,10 @@ static bool nvme_available_path(struct 
>>>> nvme_ns_head *head)
>>>>       struct nvme_ns *ns;
>>>>       if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>>> +        return NULL;
>>>> +
>>>> +    if (test_bit(NVME_NSHEAD_FAIL_ON_LAST_PATH, &head->flags) &&
>>>> +        list_is_singular(&head->list))
>>>>           return NULL;
>>>
>>> This is wrong too Hannes.
>>>
>>> If you added a live path, and when partition scan triggers your path 
>>> becomes
>>> inaccessible, but then after anatt it becomes optimized again. You 
>>> just ended up without discovering partitions...
>>
>> But we will be getting an ANA state change AEN in that case, seeing 
>> that we have enabled ANA state change events and the spec states that 
>> we should be getting them (NVMe 1.3, section 8.21.4 Host ANA 
>> Transition Operation explicitly has an 'or' between point a) and b) ...).
> 
> That is too late, you already failed the IO...
> 
Which wouldn't matter if we were able to properly rescan the namespace
once we get an ANA event. But device_add_disk() completely ignores I/O
errors, so no chance there ...

>> Which is what we rely upon anyway, otherwise we would have had to 
>> check for INACCESSIBLE in nvme_update_ana_state, and not just for 
>> CHANGE as we do now.
> 
> I am referring to the fact that you fail the io and not simply queueing 
> it for future requeue.
> 
Yes.

>> Plus the situation will be exactly the same if one uses 'persistent 
>> loss' instead of 'inaccessible'...
> 
> That is a side point, that needs to be addressed separately.
> 
Not sure if I agree, but anyway...

>>
>> And incidentally, the same issue happens when you disable the 
>> namespace in nvmet just after the namespace got scanned, but before 
>> device_add_disk() is called. So really it's not about ANA, but rather 
>> about any path-related error here.
> 
> Right, which is transient semantically. You are making a decision here 
> to FAIL the IO, and it needs a strong justification. We can
> just sporadically fail IO...
> 
> The way that I see it, for as long as there is a possibility for an IO 
> to be executed in the future (i.e. a path exists) we queue up
> the IO. IMO if you want to fail it before, fast_io_fail_tmo is designed 
> to address it.

So we'd need a timer (or modifying to ANATT timer) to terminate all 
pending I/O on inaccessible paths, and requeue/keep it until then.
Okay, let's see how this pans out.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich