[PATCH 3/3] nvme: 'nvme disconnect' hangs after remapping namespaces

Sun Sep 8 23:59:20 PDT 2024

On 9/9/24 08:22, Hannes Reinecke wrote:
> On 9/8/24 09:21, Sagi Grimberg wrote:
>>
>>
>>
>> On 06/09/2024 13:16, Hannes Reinecke wrote:
>>> During repetitive namespace map and unmap operations on the target
>>> (disabling the namespace, changing the UUID, enabling it again)
>>> the initial scan will hang as the target will be returning
>>> PATH_ERROR and the I/O is constantly retried:
>>>
>>> [<0>] folio_wait_bit_common+0x12a/0x310
>>> [<0>] filemap_read_folio+0x97/0xd0
>>> [<0>] do_read_cache_folio+0x108/0x390
>>> [<0>] read_part_sector+0x31/0xa0
>>> [<0>] read_lba+0xc5/0x160
>>> [<0>] efi_partition+0xd9/0x8f0
>>> [<0>] bdev_disk_changed+0x23d/0x6d0
>>> [<0>] blkdev_get_whole+0x78/0xc0
>>> [<0>] bdev_open+0x2c6/0x3b0
>>> [<0>] bdev_file_open_by_dev+0xcb/0x120
>>> [<0>] disk_scan_partitions+0x5d/0x100
>>> [<0>] device_add_disk+0x402/0x420
>>> [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core]
>>> [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core]
>>> [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core]
>>> [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core]
>>> [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
>>>
>>> Calling 'nvme disconnect' on controllers with these namespaces
>>> will hang as the disconnect operation tries to flush scan_work:
>>>
>>> [<0>] __flush_work+0x389/0x4b0
>>> [<0>] nvme_remove_namespaces+0x4b/0x130 [nvme_core]
>>> [<0>] nvme_do_delete_ctrl+0x72/0x90 [nvme_core]
>>> [<0>] nvme_delete_ctrl_sync+0x2e/0x40 [nvme_core]
>>> [<0>] nvme_sysfs_delete+0x35/0x40 [nvme_core]
>>> [<0>] kernfs_fop_write_iter+0x13d/0x1b0
>>> [<0>] vfs_write+0x404/0x510
>>>
>>> before the namespaces are removed.
>>>
>>> This patch sets the 'failfast_expired' bit for the controller
>>> to cause all pending I/O to be failed, and the disconnect process
>>> to complete.
>>
>> I don't know if I agree with this approach. Seems too indirect.
>> Can you please explain (with tracing) what is preventing the scan_work
>> to complete? because the controller state should be DELETING, and perhaps
>> we are missing somewhere a check for this?
>>
>> nvme_remove_namespaces() at the point you are injecting here is designed
>> to allow any writeback to complete (if possible).
> 
> Looks like we're forgetting a 'nvme_kick_requeue_lists()' when setting 
> the state to DELETING. So requeue is never triggered, causing stuck I/O.
> I'll be checking.
> 
Turns out not to be sufficient.
Problem here is that nvmet will always return PATH_ERROR when the 
namespace is disabled. So that I/O will _continue_ to be retried,
as nvme_find_path() calls nvme_path_is_disabled(), which
_deliberately_ forwards I/O even if the controller is in DELETING:

  /*
   * We don't treat NVME_CTRL_DELETING as a disabled path as I/O should
   * still be able to complete assuming that the controller is connected.
   * Otherwise it will fail immediately and return to the requeue list.
   */

with no way out, and a stuck process. We can surely try to failover to
another path, but if this is the last path we're stuck forever.
Seems like we need a 'fail_if_no_path' mechanism here...

Hmm.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich