[PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces
Sagi Grimberg
sagi at grimberg.me
Thu Sep 5 00:06:29 PDT 2024
On 04/09/2024 11:59, Hannes Reinecke wrote:
> On 9/4/24 10:20, Hannes Reinecke wrote:
>> On 9/3/24 21:38, Sagi Grimberg wrote:
>>>
>>>
>>>
>>> On 03/09/2024 21:03, Hannes Reinecke wrote:
>>>> During repetitive namespace remapping operations (ie removing a
>>>> namespace and
>>>> provision a different namespace with the same NSID) on the target the
>>>> namespace might have changed between the time the initial scan
>>>> was performed, and partition scan was invoked by device_add_disk()
>>>> in nvme_mpath_set_live(). We then end up with a stuck scanning
>>>> process:
>>>>
>>>> [<0>] folio_wait_bit_common+0x12a/0x310
>>>> [<0>] filemap_read_folio+0x97/0xd0
>>>> [<0>] do_read_cache_folio+0x108/0x390
>>>> [<0>] read_part_sector+0x31/0xa0
>>>> [<0>] read_lba+0xc5/0x160
>>>> [<0>] efi_partition+0xd9/0x8f0
>>>> [<0>] bdev_disk_changed+0x23d/0x6d0
>>>> [<0>] blkdev_get_whole+0x78/0xc0
>>>> [<0>] bdev_open+0x2c6/0x3b0
>>>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>>>> [<0>] disk_scan_partitions+0x5d/0x100
>>>> [<0>] device_add_disk+0x402/0x420
>>>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>>>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>>>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>>>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>>>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>>>
>>>> This happens when we have several paths, some of which are
>>>> inaccessible,
>>>> and the active paths are removed first. Then nvme_find_path() will
>>>> requeue
>>>> I/O in the ns_head (as paths are present), but the requeue list is
>>>> never
>>>> triggered as all remaining paths are inactive.
>>>> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path,
>>>> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
>>>> the last path has been removed to properly terminate pending I/O.
>>>>
>>>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>>>> ---
>>>> drivers/nvme/host/multipath.c | 14 ++++++++++++--
>>>> 1 file changed, 12 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/multipath.c
>>>> b/drivers/nvme/host/multipath.c
>>>> index c9d23b1b8efc..1b1deb0450ab 100644
>>>> --- a/drivers/nvme/host/multipath.c
>>>> +++ b/drivers/nvme/host/multipath.c
>>>> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct
>>>> nvme_ns_head *head)
>>>> inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>>> {
>>>> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>>> + return NULL;
>>>> +
>>>> switch (READ_ONCE(head->subsys->iopolicy)) {
>>>> case NVME_IOPOLICY_QD:
>>>> return nvme_queue_depth_path(head);
>>>> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct
>>>> nvme_ns_head *head)
>>>> {
>>>> struct nvme_ns *ns;
>>>> + if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>>> + return NULL;
>>>> +
>>>> list_for_each_entry_rcu(ns, &head->list, siblings) {
>>>> if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
>>>> continue;
>>>> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct
>>>> nvme_ns_head *head)
>>>> {
>>>> if (!head->disk)
>>>> return;
>>>> - kblockd_schedule_work(&head->requeue_work);
>>>> - if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>> + if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>> nvme_cdev_del(&head->cdev, &head->cdev_device);
>>>> del_gendisk(head->disk);
>>>> }
>>>> + /*
>>>> + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
>>>> + * to allow multipath to fail all I/O.
>>>> + */
>>>> + kblockd_schedule_work(&head->requeue_work);
>>>
>>> Not sure how this helps given that you don't wait for srcu to
>>> synchronize
>>> before you kick the requeue.
>>>
>> It certainly is helping in my testcase. But having a synchronize_srcu
>> here is probably not a bad idea.
>>
>>>> }
>>>> void nvme_mpath_remove_disk(struct nvme_ns_head *head)
>>>
>>> Why do you need to clear NVME_NSHEAD_DISK_LIVE ? In the last posting
>>> you mentioned that ns_remove is stuck on srcu_synchronize? Can you
>>> explain why nvme_find_path is able to find a path given that it is
>>> already cleared NVME_NS_READY ? oris it nvme_available_path that is
>>> missing a check? Maybe can do with checking NVME_NS_READY instead?
>>
>> Turned out that the reasoning in the previous revision wasn't quite
>> correct; since then I have seen several test-run where the above stack
>> trace was the _only_ one in the system, so the stall in removing
>> namespaces is more a side-effect. The ns_head was still visible
>> in sysfs while in that state, with exactly one path left:
>>
>> # ls /sys/block
>> nvme0c4n1 nvme0c4n3 nvme0n1 nvme0n3 nvme0c4n2 nvme0c4n5 nvme0n2
>> nvme0n5
>>
>> (whereas there had been 6 controllers with 6 namespaces).
>> So we fail to trigger a requeue to restart I/O on the stuck scanning
>> process; the actual path state really don't matter as never get this
>> far.
>> This can happen when the partition scan triggered by
>> device_add_disk() (from one controller) interleaves with
>> nvme_ns_remove() from another controller. Both processes are running
>> lockless wrt to ns_head at that
>> time, so if the partition scan issues I/O after the schedule_work
>> in nvme_mpath_shutdown_disk():
>>
>> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
>> {
>> if (!head->disk)
>> return;
>> kblockd_schedule_work(&head->requeue_work);
>> if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>> nvme_cdev_del(&head->cdev, &head->cdev_device);
>> del_gendisk(head->disk);
>> }
>> }
>>
>> _and_ that last path happens to be an 'inaccessible' one, I/O will be
>> requeued in the ns_head but never restarted, leaving to a hung process.
>> Note, that I/O might also be triggered by userspace (eg udev); the
>> device node is still present at that time. And that's also what I see
>> in my test runs; occasionally I get additional stuck udev processes:
>> [<0>] __folio_lock+0x114/0x1f0
>> [<0>] truncate_inode_pages_range+0x3c0/0x3e0
>> [<0>] blkdev_flush_mapping+0x45/0xe0
>> [<0>] blkdev_put_whole+0x2e/0x40
>> [<0>] bdev_release+0x129/0x1b0
>> [<0>] blkdev_release+0xd/0x20
>> [<0>] __fput+0xf7/0x2d0
>> also waiting on I/O.
>>
>> You might be right checking for NS_READY might be sufficient, I'll be
>> checking. But we definitely need to requeue I/O after we called
>> del_gendisk().
>>
> Turns out that we don't check NVME_NS_READY in all places; we would
> need this patch:
>
> diff --git a/drivers/nvme/host/multipath.c
> b/drivers/nvme/host/multipath.c
> index c9d23b1b8efc..d8a6f51896fd 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -424,6 +424,8 @@ static bool nvme_available_path(struct
> nvme_ns_head *head)
> list_for_each_entry_rcu(ns, &head->list, siblings) {
> if (test_bit(NVME_CTRL_FAILFAST_EXPIRED,
> &ns->ctrl->flags))
> continue;
> + if (test_bit(NVME_NS_READY, &ns->flags))
> + continue;
> switch (nvme_ctrl_state(ns->ctrl)) {
> case NVME_CTRL_LIVE:
> case NVME_CTRL_RESETTING:
>
> in addition to moving of kblockd_schedule.
>
> So what do you prefer, checking NVME_NS_HEAD_LIVE or NVME_NS_READY?
Well, NVME_NS_READY is cleared in nvme_ns_remove (which is fine) but also
in nvme_mpath_revalidate_paths() if the capacity changed, so theoretically,
if the last remaining path changed its capacity, it will make
nvme_available_path()
fail, and in turn, mpath IO will fail and not added to the requeue_list,
which would be
wrong. So I think that nvme_available_path should check
NVME_NSHEAD_DISK_LIVE,
but nvme_find_path is fine in its current form.
More information about the Linux-nvme
mailing list