[PATCH 3/3] nvme-multipath: stuck partition scan on inaccessible paths
Hannes Reinecke
hare at suse.de
Thu Sep 12 07:30:03 PDT 2024
On 9/12/24 12:29, Sagi Grimberg wrote:
>
>
>
> On 11/09/2024 19:10, Hannes Reinecke wrote:
>> On 9/11/24 17:20, Sagi Grimberg wrote:
>>>
>>>
>>>
>>> On 11/09/2024 15:46, Hannes Reinecke wrote:
>>>> When a path is switched to 'inaccessible' during partition scan
>>>> triggered via device_add_disk() and we only have one path the
>>>> system will be stuck as nvme_available_path() will always return
>>>> 'true'. So I/O will never be completed and the system is stuck
>>>> in device_add_disk():
>>>>
>>>> [<0>] folio_wait_bit_common+0x12a/0x310
>>>> [<0>] filemap_read_folio+0x97/0xd0
>>>> [<0>] do_read_cache_folio+0x108/0x390
>>>> [<0>] read_part_sector+0x31/0xa0
>>>> [<0>] read_lba+0xc5/0x160
>>>> [<0>] efi_partition+0xd9/0x8f0
>>>> [<0>] bdev_disk_changed+0x23d/0x6d0
>>>> [<0>] blkdev_get_whole+0x78/0xc0
>>>> [<0>] bdev_open+0x2c6/0x3b0
>>>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>>>> [<0>] disk_scan_partitions+0x5d/0x100
>>>> [<0>] device_add_disk+0x402/0x420
>>>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>>>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>>>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>>>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>>>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>>>
>>>> This patch introduces a flag NVME_NSHEAD_FAIL_ON_LAST_PATH to
>>>> cause nvme_available_path() to always return NULL, and with
>>>> that I/O to be failed if the last path is unavailable. But
>>>> we also need to requeue all pending I/Os whenever we have
>>>> changed the ANA state as now even inaccessible ANA states
>>>> influence I/O behaviour.
>>>>
>>>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>>>> ---
>>>> drivers/nvme/host/multipath.c | 15 +++++++++++++++
>>>> drivers/nvme/host/nvme.h | 1 +
>>>> 2 files changed, 16 insertions(+)
>>>>
>>>> diff --git a/drivers/nvme/host/multipath.c
>>>> b/drivers/nvme/host/multipath.c
>>>> index f72c5a6a2d8e..0373b4043eea 100644
>>>> --- a/drivers/nvme/host/multipath.c
>>>> +++ b/drivers/nvme/host/multipath.c
>>>> @@ -422,6 +422,10 @@ static bool nvme_available_path(struct
>>>> nvme_ns_head *head)
>>>> struct nvme_ns *ns;
>>>> if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>>> + return NULL;
>>>> +
>>>> + if (test_bit(NVME_NSHEAD_FAIL_ON_LAST_PATH, &head->flags) &&
>>>> + list_is_singular(&head->list))
>>>> return NULL;
>>>
>>> This is wrong too Hannes.
>>>
>>> If you added a live path, and when partition scan triggers your path
>>> becomes
>>> inaccessible, but then after anatt it becomes optimized again. You
>>> just ended up without discovering partitions...
>>
>> But we will be getting an ANA state change AEN in that case, seeing
>> that we have enabled ANA state change events and the spec states that
>> we should be getting them (NVMe 1.3, section 8.21.4 Host ANA
>> Transition Operation explicitly has an 'or' between point a) and b) ...).
>
> That is too late, you already failed the IO...
>
Which wouldn't matter if we were able to properly rescan the namespace
once we get an ANA event. But device_add_disk() completely ignores I/O
errors, so no chance there ...
>> Which is what we rely upon anyway, otherwise we would have had to
>> check for INACCESSIBLE in nvme_update_ana_state, and not just for
>> CHANGE as we do now.
>
> I am referring to the fact that you fail the io and not simply queueing
> it for future requeue.
>
Yes.
>> Plus the situation will be exactly the same if one uses 'persistent
>> loss' instead of 'inaccessible'...
>
> That is a side point, that needs to be addressed separately.
>
Not sure if I agree, but anyway...
>>
>> And incidentally, the same issue happens when you disable the
>> namespace in nvmet just after the namespace got scanned, but before
>> device_add_disk() is called. So really it's not about ANA, but rather
>> about any path-related error here.
>
> Right, which is transient semantically. You are making a decision here
> to FAIL the IO, and it needs a strong justification. We can
> just sporadically fail IO...
>
> The way that I see it, for as long as there is a possibility for an IO
> to be executed in the future (i.e. a path exists) we queue up
> the IO. IMO if you want to fail it before, fast_io_fail_tmo is designed
> to address it.
So we'd need a timer (or modifying to ANATT timer) to terminate all
pending I/O on inaccessible paths, and requeue/keep it until then.
Okay, let's see how this pans out.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
More information about the Linux-nvme
mailing list