[PATCH 2/2] nvme-multipath: fix I/O stall when remapping namespaces

Thu Sep 5 02:06:05 PDT 2024

On 05/09/2024 11:39, Hannes Reinecke wrote:
> On 9/5/24 09:06, Sagi Grimberg wrote:
>>
>>
>>
>> On 04/09/2024 11:59, Hannes Reinecke wrote:
>>> On 9/4/24 10:20, Hannes Reinecke wrote:
>>>> On 9/3/24 21:38, Sagi Grimberg wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 03/09/2024 21:03, Hannes Reinecke wrote:
>>>>>> During repetitive namespace remapping operations (ie removing a 
>>>>>> namespace and
>>>>>> provision a different namespace with the same NSID) on the target 
>>>>>> the
>>>>>> namespace might have changed between the time the initial scan
>>>>>> was performed, and partition scan was invoked by device_add_disk()
>>>>>> in nvme_mpath_set_live(). We then end up with a stuck scanning 
>>>>>> process:
>>>>>>
>>>>>> [<0>] folio_wait_bit_common+0x12a/0x310
>>>>>> [<0>] filemap_read_folio+0x97/0xd0
>>>>>> [<0>] do_read_cache_folio+0x108/0x390
>>>>>> [<0>] read_part_sector+0x31/0xa0
>>>>>> [<0>] read_lba+0xc5/0x160
>>>>>> [<0>] efi_partition+0xd9/0x8f0
>>>>>> [<0>] bdev_disk_changed+0x23d/0x6d0
>>>>>> [<0>] blkdev_get_whole+0x78/0xc0
>>>>>> [<0>] bdev_open+0x2c6/0x3b0
>>>>>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>>>>>> [<0>] disk_scan_partitions+0x5d/0x100
>>>>>> [<0>] device_add_disk+0x402/0x420
>>>>>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>>>>>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>>>>>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>>>>>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>>>>>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>>>>>
>>>>>> This happens when we have several paths, some of which are 
>>>>>> inaccessible,
>>>>>> and the active paths are removed first. Then nvme_find_path() 
>>>>>> will requeue
>>>>>> I/O in the ns_head (as paths are present), but the requeue list 
>>>>>> is never
>>>>>> triggered as all remaining paths are inactive.
>>>>>> This patch checks for NVME_NSHEAD_DISK_LIVE when selecting a path,
>>>>>> and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once
>>>>>> the last path has been removed to properly terminate pending I/O.
>>>>>>
>>>>>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>>>>>> ---
>>>>>>   drivers/nvme/host/multipath.c | 14 ++++++++++++--
>>>>>>   1 file changed, 12 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/nvme/host/multipath.c 
>>>>>> b/drivers/nvme/host/multipath.c
>>>>>> index c9d23b1b8efc..1b1deb0450ab 100644
>>>>>> --- a/drivers/nvme/host/multipath.c
>>>>>> +++ b/drivers/nvme/host/multipath.c
>>>>>> @@ -407,6 +407,9 @@ static struct nvme_ns *nvme_numa_path(struct 
>>>>>> nvme_ns_head *head)
>>>>>>   inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>>>>>   {
>>>>>> +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>>>>> +        return NULL;
>>>>>> +
>>>>>>       switch (READ_ONCE(head->subsys->iopolicy)) {
>>>>>>       case NVME_IOPOLICY_QD:
>>>>>>           return nvme_queue_depth_path(head);
>>>>>> @@ -421,6 +424,9 @@ static bool nvme_available_path(struct 
>>>>>> nvme_ns_head *head)
>>>>>>   {
>>>>>>       struct nvme_ns *ns;
>>>>>> +    if (!test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags))
>>>>>> +        return NULL;
>>>>>> +
>>>>>>       list_for_each_entry_rcu(ns, &head->list, siblings) {
>>>>>>           if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, 
>>>>>> &ns->ctrl->flags))
>>>>>>               continue;
>>>>>> @@ -967,11 +973,15 @@ void nvme_mpath_shutdown_disk(struct 
>>>>>> nvme_ns_head *head)
>>>>>>   {
>>>>>>       if (!head->disk)
>>>>>>           return;
>>>>>> -    kblockd_schedule_work(&head->requeue_work);
>>>>>> -    if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>>>> +    if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>>>>           nvme_cdev_del(&head->cdev, &head->cdev_device);
>>>>>>           del_gendisk(head->disk);
>>>>>>       }
>>>>>> +    /*
>>>>>> +     * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
>>>>>> +     * to allow multipath to fail all I/O.
>>>>>> +     */
>>>>>> +    kblockd_schedule_work(&head->requeue_work);
>>>>>
>>>>> Not sure how this helps given that you don't wait for srcu to 
>>>>> synchronize
>>>>> before you kick the requeue.
>>>>>
>>>> It certainly is helping in my testcase. But having a 
>>>> synchronize_srcu here is probably not a bad idea.
>>>>
>>>>>>   }
>>>>>>   void nvme_mpath_remove_disk(struct nvme_ns_head *head)
>>>>>
>>>>> Why do you need to clear NVME_NSHEAD_DISK_LIVE ? In the last 
>>>>> posting you mentioned that ns_remove is stuck on srcu_synchronize? 
>>>>> Can you explain why nvme_find_path is able to find a path given 
>>>>> that it is already cleared NVME_NS_READY ? oris it 
>>>>> nvme_available_path that is missing a check? Maybe can do with 
>>>>> checking NVME_NS_READY instead?
>>>>
>>>> Turned out that the reasoning in the previous revision wasn't quite 
>>>> correct; since then I have seen several test-run where the above stack
>>>> trace was the _only_ one in the system, so the stall in removing 
>>>> namespaces is more a side-effect. The ns_head was still visible
>>>> in sysfs while in that state, with exactly one path left:
>>>>
>>>> # ls /sys/block
>>>> nvme0c4n1  nvme0c4n3  nvme0n1  nvme0n3  nvme0c4n2  nvme0c4n5 
>>>> nvme0n2 nvme0n5
>>>>
>>>> (whereas there had been 6 controllers with 6 namespaces).
>>>> So we fail to trigger a requeue to restart I/O on the stuck scanning
>>>> process; the actual path state really don't matter as never get 
>>>> this far.
>>>> This can happen when the partition scan triggered by 
>>>> device_add_disk() (from one controller) interleaves with 
>>>> nvme_ns_remove() from another controller. Both processes are 
>>>> running lockless wrt to ns_head at that
>>>> time, so if the partition scan issues I/O after the schedule_work
>>>> in nvme_mpath_shutdown_disk():
>>>>
>>>> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
>>>> {
>>>>      if (!head->disk)
>>>>          return;
>>>>      kblockd_schedule_work(&head->requeue_work);
>>>>      if (test_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
>>>>          nvme_cdev_del(&head->cdev, &head->cdev_device);
>>>>          del_gendisk(head->disk);
>>>>      }
>>>> }
>>>>
>>>> _and_ that last path happens to be an 'inaccessible' one, I/O will 
>>>> be requeued in the ns_head but never restarted, leaving to a hung 
>>>> process.
>>>> Note, that I/O might also be triggered by userspace (eg udev); the 
>>>> device node is still present at that time. And that's also what I see
>>>> in my test runs; occasionally I get additional stuck udev processes:
>>>> [<0>] __folio_lock+0x114/0x1f0
>>>> [<0>] truncate_inode_pages_range+0x3c0/0x3e0
>>>> [<0>] blkdev_flush_mapping+0x45/0xe0
>>>> [<0>] blkdev_put_whole+0x2e/0x40
>>>> [<0>] bdev_release+0x129/0x1b0
>>>> [<0>] blkdev_release+0xd/0x20
>>>> [<0>] __fput+0xf7/0x2d0
>>>> also waiting on I/O.
>>>>
>>>> You might be right checking for NS_READY might be sufficient, I'll be
>>>> checking. But we definitely need to requeue I/O after we called 
>>>> del_gendisk().
>>>>
>>> Turns out that we don't check NVME_NS_READY in all places; we would 
>>> need this patch:
>>>
>>> diff --git a/drivers/nvme/host/multipath.c 
>>> b/drivers/nvme/host/multipath.c
>>> index c9d23b1b8efc..d8a6f51896fd 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -424,6 +424,8 @@ static bool nvme_available_path(struct 
>>> nvme_ns_head *head)
>>>         list_for_each_entry_rcu(ns, &head->list, siblings) {
>>>                 if (test_bit(NVME_CTRL_FAILFAST_EXPIRED, 
>>> &ns->ctrl->flags))
>>>                         continue;
>>> +               if (test_bit(NVME_NS_READY, &ns->flags))
>>> +                       continue;
>>>                 switch (nvme_ctrl_state(ns->ctrl)) {
>>>                 case NVME_CTRL_LIVE:
>>>                 case NVME_CTRL_RESETTING:
>>>
>>> in addition to moving of kblockd_schedule.
>>>
>>> So what do you prefer, checking NVME_NS_HEAD_LIVE or NVME_NS_READY?
>>
>> Well, NVME_NS_READY is cleared in nvme_ns_remove (which is fine) but 
>> also
>> in nvme_mpath_revalidate_paths() if the capacity changed, so 
>> theoretically,
>> if the last remaining path changed its capacity, it will make 
>> nvme_available_path()
>> fail, and in turn, mpath IO will fail and not added to the 
>> requeue_list, which would be
>> wrong. So I think that nvme_available_path should check 
>> NVME_NSHEAD_DISK_LIVE,
>> but nvme_find_path is fine in its current form.
>
> Weelll ... not quite.
> I never get this far, as the only place where nvme_remove_ns() can be
> called is from the scanning process, yet this is stuck waiting for I/O.
>
> My testcase then simulated a namespace replacement by setting a new 
> UUID, and then enabling the namespace again. The target will return
> PATH ERROR during that time, causing I/O to be retried (while disabled)
> or failed over to the same path (as it's the only one left).
>
> Of course we could argue that the testcase is invalid, and the target
> should return INVALID_NS if the UUID changed.
> But one could also argue that retrying on the same path is pointless,
> seeing that we'll always get the same error.
> Hmm?

Hannes, I was referring to the general case where we have more than one 
path.

What I said is that I think that if you drop the NSHEAD_DISK_LIVE check 
in nvme_find_path
(it already checks for NS_READY) but have nvme_available_path check for 
NSHEAD_DISK_LIVE
and both your test case should work and the general case where there is 
more than a single
path.