[PATCH 3/3] nvme-multipath: stuck partition scan on inaccessible paths
Sagi Grimberg
sagi at grimberg.me
Thu Sep 12 03:29:01 PDT 2024
On 11/09/2024 19:10, Hannes Reinecke wrote:
> On 9/11/24 17:20, Sagi Grimberg wrote:
>>
>>
>>
>> On 11/09/2024 15:46, Hannes Reinecke wrote:
>>> When a path is switched to 'inaccessible' during partition scan
>>> triggered via device_add_disk() and we only have one path the
>>> system will be stuck as nvme_available_path() will always return
>>> 'true'. So I/O will never be completed and the system is stuck
>>> in device_add_disk():
>>>
>>> [<0>] folio_wait_bit_common+0x12a/0x310
>>> [<0>] filemap_read_folio+0x97/0xd0
>>> [<0>] do_read_cache_folio+0x108/0x390
>>> [<0>] read_part_sector+0x31/0xa0
>>> [<0>] read_lba+0xc5/0x160
>>> [<0>] efi_partition+0xd9/0x8f0
>>> [<0>] bdev_disk_changed+0x23d/0x6d0
>>> [<0>] blkdev_get_whole+0x78/0xc0
>>> [<0>] bdev_open+0x2c6/0x3b0
>>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>>> [<0>] disk_scan_partitions+0x5d/0x100
>>> [<0>] device_add_disk+0x402/0x420
>>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>>
>>> This patch introduces a flag NVME_NSHEAD_FAIL_ON_LAST_PATH to
>>> cause nvme_available_path() to always return NULL, and with
>>> that I/O to be failed if the last path is unavailable. But
>>> we also need to requeue all pending I/Os whenever we have
>>> changed the ANA state as now even inaccessible ANA states
>>> influence I/O behaviour.
>>>
>>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>>> ---
>>> drivers/nvme/host/multipath.c | 15 +++++++++++++++
>>> drivers/nvme/host/nvme.h | 1 +
>>> 2 files changed, 16 insertions(+)
>>>
>>> diff --git a/drivers/nvme/host/multipath.c
>>> b/drivers/nvme/host/multipath.c
>>> index f72c5a6a2d8e..0373b4043eea 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -422,6 +422,10 @@ static bool nvme_available_path(struct
>>> nvme_ns_head *head)
>>> struct nvme_ns *ns;
>>> if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>> + return NULL;
>>> +
>>> + if (test_bit(NVME_NSHEAD_FAIL_ON_LAST_PATH, &head->flags) &&
>>> + list_is_singular(&head->list))
>>> return NULL;
>>
>> This is wrong too Hannes.
>>
>> If you added a live path, and when partition scan triggers your path
>> becomes
>> inaccessible, but then after anatt it becomes optimized again. You
>> just ended up without discovering partitions...
>
> But we will be getting an ANA state change AEN in that case, seeing
> that we have enabled ANA state change events and the spec states that
> we should be getting them (NVMe 1.3, section 8.21.4 Host ANA
> Transition Operation explicitly has an 'or' between point a) and b) ...).
That is too late, you already failed the IO...
> Which is what we rely upon anyway, otherwise we would have had to
> check for INACCESSIBLE in nvme_update_ana_state, and not just for
> CHANGE as we do now.
I am referring to the fact that you fail the io and not simply queueing
it for future requeue.
> Plus the situation will be exactly the same if one uses 'persistent
> loss' instead of 'inaccessible'...
That is a side point, that needs to be addressed separately.
>
> And incidentally, the same issue happens when you disable the
> namespace in nvmet just after the namespace got scanned, but before
> device_add_disk() is called. So really it's not about ANA, but rather
> about any path-related error here.
Right, which is transient semantically. You are making a decision here
to FAIL the IO, and it needs a strong justification. We can
just sporadically fail IO...
The way that I see it, for as long as there is a possibility for an IO
to be executed in the future (i.e. a path exists) we queue up
the IO. IMO if you want to fail it before, fast_io_fail_tmo is designed
to address it.
More information about the Linux-nvme
mailing list