Deadlock on failure to read NVMe namespace

Tue Oct 19 08:41:12 PDT 2021

On 10/19/21 5:06 PM, Sagi Grimberg wrote:
> 
> 
> On 10/19/21 5:27 PM, Sagi Grimberg wrote:
>>
>>>>> 481:~ # cat /proc/15761/stack
>>>>> [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
>>>>> [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
>>>>> [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
>>>>> [<0>] nvme_sysfs_delete+0x42/0x60 [nvme_core]
>>>>> [<0>] kernfs_fop_write_iter+0x12f/0x1a0
>>>>> [<0>] new_sync_write+0x122/0x1b0
>>>>> [<0>] vfs_write+0x1eb/0x250
>>>>> [<0>] ksys_write+0xa1/0xe0
>>>>> [<0>] do_syscall_64+0x3a/0x80
>>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>>> c481:~ # cat /proc/14965/stack
>>>>> [<0>] do_read_cache_page+0x49b/0x790
>>>>> [<0>] read_part_sector+0x39/0xe0
>>>>> [<0>] read_lba+0xf9/0x1d0
>>>>> [<0>] efi_partition+0xf1/0x7f0
>>>>> [<0>] bdev_disk_changed+0x1ee/0x550
>>>>> [<0>] blkdev_get_whole+0x81/0x90
>>>>> [<0>] blkdev_get_by_dev+0x128/0x2e0
>>>>> [<0>] device_add_disk+0x377/0x3c0
>>>>> [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
>>>>> [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
>>>>> [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
>>>>> [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
>>>>> [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
>>>>> [<0>] process_one_work+0x231/0x420
>>>>> [<0>] worker_thread+0x2d/0x3f0
>>>>> [<0>] kthread+0x11a/0x140
>>>>> [<0>] ret_from_fork+0x22/0x30
> 
> ...
> 
>> I think this sequence is familiar and was addressed by a fix from Anton
>> (CC'd) which still has some pending review comments.
>>
>> Can you lookup and try:
>> [PATCH] nvme/mpath: fix hang when disk goes live over reconnect
> 
> Actually, I see the trace is going from nvme_alloc_ns, no the ANA
> update path, so that is unlikely to address the issue.
> 
> Looking at nvme_mpath_clear_ctrl_paths, I don't think it should
> take the scan_lock anymore. IIRC the reason why it needed the
> scan_lock in the first place was the fact that ctrl->namespaces
> was added and then sorted in scan_work (taking the namespaces_rwsem
> twice).
> 
> But now that ctrl->namespaces is always sorted, and accessed with
> namespaces_rwsem, I think that scan_lock is no longer needed
> here and namespaces_rwsem is sufficient.
> 
... which was precisely what my initial patch did.
While it worked in the sense the 'nvme disconnect' completed, we did not 
terminate the outstanding I/O as no current path is set and hence this:

	down_read(&ctrl->namespaces_rwsem);
	list_for_each_entry(ns, &ctrl->namespaces, list)
		if (nvme_mpath_clear_current_path(ns))
			kblockd_schedule_work(&ns->head->requeue_work);
	up_read(&ctrl->namespaces_rwsem);

doesn't do anything, in particular does _not_ flush the requeue work.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare at suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer