Deadlock on failure to read NVMe namespace

Sagi Grimberg sagi at grimberg.me
Tue Oct 19 13:13:06 PDT 2021


>>>>>> 481:~ # cat /proc/15761/stack
>>>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>>>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>>>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>>>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>>>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>>>>>> [<0>] new_sync_write+0x122/0x1b0
>>>>>> [<0>] vfs_write+0x1eb/0x250
>>>>>> [<0>] ksys_write+0xa1/0xe0
>>>>>> [<0>] do_syscall_64+0x3a/0x80
>>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>>> c481:~ # cat /proc/14965/stack
>>>>>> [<0>] do_read_cache_page+0x49b/0x790
>>>>>> [<0>] read_part_sector+0x39/0xe0
>>>>>> [<0>] read_lba+0xf9/0x1d0
>>>>>> [<0>] efi_partition+0xf1/0x7f0
>>>>>> [<0>] bdev_disk_changed+0x1ee/0x550
>>>>>> [<0>] blkdev_get_whole+0x81/0x90
>>>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>>>>>> [<0>] device_add_disk+0x377/0x3c0
>>>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>>>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>>>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>>>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>>>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>>>>>> [<0>] process_one_work+0x231/0x420
>>>>>> [<0>] worker_thread+0x2d/0x3f0
>>>>>> [<0>] kthread+0x11a/0x140
>>>>>> [<0>] ret_from_fork+0x22/0x30
>>
>> ...
>>
>>> I think this sequence is familiar and was addressed by a fix from Anton
>>> (CC'd) which still has some pending review comments.
>>>
>>> Can you lookup and try:
>>> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect
>>
>> Actually, I see the trace is going from nvme_alloc_ns, no the ANA
>> update path, so that is unlikely to address the issue.
>>
>> Looking at nvme_mpath_clear_ctrl_paths, I don't think it should
>> take the scan_lock anymore. IIRC the reason why it needed the
>> scan_lock in the first place was the fact that ctrl->namespaces
>> was added and then sorted in scan_work (taking the namespaces_rwsem
>> twice).
>>
>> But now that ctrl->namespaces is always sorted, and accessed with
>> namespaces_rwsem, I think that scan_lock is no longer needed
>> here and namespaces_rwsem is sufficient.
>>
> ... which was precisely what my initial patch did.
> While it worked in the sense the 'nvme disconnect' completed, we did not 
> terminate the outstanding I/O as no current path is set and hence this:
> 
>      down_read(&ctrl->namespaces_rwsem);
>      list_for_each_entry(ns, &ctrl->namespaces, list)
>          if (nvme_mpath_clear_current_path(ns))
>              kblockd_schedule_work(&ns->head->requeue_work);
>      up_read(&ctrl->namespaces_rwsem);
> 
> doesn't do anything, in particular does _not_ flush the requeue work.

So you get I/O hung? Why does it need to flush it?



More information about the Linux-nvme mailing list