[PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces

Wed Sep 4 01:59:20 PDT 2024

On 9/4/24 10:20, Hannes Reinecke wrote:
> On 9/3/24 21:38, Sagi Grimberg wrote:
>>
>>
>>
>> On 03/09/2024 21:03, Hannes Reinecke wrote:
>>> During repetitive namespace remapping operations (ie removing a 
>>> namespace and
>>> provision a different namespace with the same NSID) on the target the
>>> namespace might have changed between the time the initial scan
>>> was performed, and partition scan was invoked by device_add_disk()
>>> in nvme_mpath_set_live(). We then end up with a stuck scanning process:
>>>
>>> [<0>] folio_wait_bit_common+0x12a/0x310
>>> [<0>] filemap_read_folio+0x97/0xd0
>>> [<0>] do_read_cache_folio+0x108/0x390
>>> [<0>] read_part_sector+0x31/0xa0
>>> [<0>] read_lba+0xc5/0x160
>>> [<0>] efi_partition+0xd9/0x8f0
>>> [<0>] bdev_disk_changed+0x23d/0x6d0
>>> [<0>] blkdev_get_whole+0x78/0xc0
>>> [<0>] bdev_open+0x2c6/0x3b0
>>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>>> [<0>] disk_scan_partitions+0x5d/0x100
>>> [<0>] device_add_disk+0x402/0x420
>>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>>
>>> This happens when we have several paths, some of which are inaccessible,
>>> and the active paths are removed first. Then nvme_find_path() will 
>>> requeue
>>> I/O in the ns_head (as paths are present), but the requeue list is never
>>> triggered as all remaining paths are inactive.
>>> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path,
>>> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
>>> the last path has been removed to properly terminate pending I/O.
>>>
>>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>>> ---
>>>   drivers/nvme/host/multipath.c | 14 ++++++++++++--
>>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/multipath.c 
>>> b/drivers/nvme/host/multipath.c
>>> index c9d23b1b8efc..1b1deb0450ab 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct 
>>> nvme_ns_head *head)
>>>   inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>>   {
>>> +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>> +        return NULL;
>>> +
>>>       switch (READ_ONCE(head->subsys->iopolicy)) {
>>>       case NVME_IOPOLICY_QD:
>>>           return nvme_queue_depth_path(head);
>>> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct 
>>> nvme_ns_head *head)
>>>   {
>>>       struct nvme_ns *ns;
>>> +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>> +        return NULL;
>>> +
>>>       list_for_each_entry_rcu(ns, &head->list, siblings) {
>>>           if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
>>>               continue;
>>> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct 
>>> nvme_ns_head *head)
>>>   {
>>>       if (!head->disk)
>>>           return;
>>> -    kblockd_schedule_work(&head->requeue_work);
>>> -    if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>> +    if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>           nvme_cdev_del(&head->cdev, &head->cdev_device);
>>>           del_gendisk(head->disk);
>>>       }
>>> +    /*
>>> +     * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
>>> +     * to allow multipath to fail all I/O.
>>> +     */
>>> +    kblockd_schedule_work(&head->requeue_work);
>>
>> Not sure how this helps given that you don't wait for srcu to synchronize
>> before you kick the requeue.
>>
> It certainly is helping in my testcase. But having a synchronize_srcu 
> here is probably not a bad idea.
> 
>>>   }
>>>   void nvme_mpath_remove_disk(struct nvme_ns_head *head)
>>
>> Why do you need to clear NVME_NSHEAD_DISK_LIVE ? In the last posting 
>> you mentioned that ns_remove is stuck on srcu_synchronize? Can you 
>> explain why nvme_find_path is able to find a path given that it is 
>> already cleared NVME_NS_READY ? oris it nvme_available_path that is 
>> missing a check? Maybe can do with checking NVME_NS_READY instead?
> 
> Turned out that the reasoning in the previous revision wasn't quite 
> correct; since then I have seen several test-run where the above stack
> trace was the _only_ one in the system, so the stall in removing 
> namespaces is more a side-effect. The ns_head was still visible
> in sysfs while in that state, with exactly one path left:
> 
> # ls /sys/block
> nvme0c4n1  nvme0c4n3  nvme0n1  nvme0n3  nvme0c4n2  nvme0c4n5  nvme0n2 
> nvme0n5
> 
> (whereas there had been 6 controllers with 6 namespaces).
> So we fail to trigger a requeue to restart I/O on the stuck scanning
> process; the actual path state really don't matter as never get this far.
> This can happen when the partition scan triggered by device_add_disk() 
> (from one controller) interleaves with nvme_ns_remove() from another 
> controller. Both processes are running lockless wrt to ns_head at that
> time, so if the partition scan issues I/O after the schedule_work
> in nvme_mpath_shutdown_disk():
> 
> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
> {
>      if (!head->disk)
>          return;
>      kblockd_schedule_work(&head->requeue_work);
>      if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>          nvme_cdev_del(&head->cdev, &head->cdev_device);
>          del_gendisk(head->disk);
>      }
> }
> 
> _and_ that last path happens to be an 'inaccessible' one, I/O will be 
> requeued in the ns_head but never restarted, leaving to a hung process.
> Note, that I/O might also be triggered by userspace (eg udev); the 
> device node is still present at that time. And that's also what I see
> in my test runs; occasionally I get additional stuck udev processes:
> [<0>] __folio_lock+0x114/0x1f0
> [<0>] truncate_inode_pages_range+0x3c0/0x3e0
> [<0>] blkdev_flush_mapping+0x45/0xe0
> [<0>] blkdev_put_whole+0x2e/0x40
> [<0>] bdev_release+0x129/0x1b0
> [<0>] blkdev_release+0xd/0x20
> [<0>] __fput+0xf7/0x2d0
> also waiting on I/O.
> 
> You might be right checking for NS_READY might be sufficient, I'll be
> checking. But we definitely need to requeue I/O after we called 
> del_gendisk().
> 
Turns out that we don't check NVME_NS_READY in all places; we would need 
this patch:

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index c9d23b1b8efc..d8a6f51896fd 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -424,6 +424,8 @@ static bool nvme_available_path(struct nvme_ns_head 
*head)
         list_for_each_entry_rcu(ns, &head->list, siblings) {
                 if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, &ns->ctrl->flags))
                         continue;
+               if (test_bit(NVME_NS_READY, &ns->flags))
+                       continue;
                 switch (nvme_ctrl_state(ns->ctrl)) {
                 case NVME_CTRL_LIVE:
                 case NVME_CTRL_RESETTING:

in addition to moving of kblockd_schedule.

So what do you prefer, checking NVME_NS_HEAD_LIVE or NVME_NS_READY?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich