[PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs

Chao Leng lengchao at huawei.com
Wed Mar 17 02:55:57 GMT 2021



On 2021/3/17 7:51, Sagi Grimberg wrote:
> 
>>>> These patches on their own are correct because they fixed a controller reset
>>>> regression.
>>>>
>>>> When we reset/teardown a controller, we must freeze and quiesce the namespaces
>>>> request queues to make sure that we safely stop inflight I/O submissions.
>>>> Freeze is mandatory because if our hctx map changed between reconnects,
>>>> blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and
>>>> if it still has pending submissions (that are still quiesced) it will hang.
>>>> This is what the above patches fixed.
>>>>
>>>> However, by freezing the namespaces request queues, and only unfreezing them
>>>> when we successfully reconnect, inflight submissions that are running
>>>> concurrently can now block grabbing the nshead srcu until either we successfully
>>>> reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected).
>>>>
>>>> This caused a deadlock [1] when a different controller (different path on the
>>>> same subsystem) became live (i.e. optimized/non-optimized). This is because
>>>> nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O
>>>> in order to make sure that current_path is visible to future (re)submisions.
>>>> However the srcu lock is taken by a blocked submission on a frozen request
>>>> queue, and we have a deadlock.
>>>>
>>>> For multipath, we obviously cannot allow that as we want to failover I/O asap.
>>>> However for non-mpath, we do not want to fail I/O (at least until controller
>>>> FASTFAIL expires, and that is disabled by default btw).
>>>>
>>>> This creates a non-symmetric behavior of how the driver should behave in the
>>>> presence or absence of multipath.
>>>>
>>>> [1]:
>>>> Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
>>>> Call Trace:
>>>>    __schedule+0x293/0x730
>>>>    schedule+0x33/0xa0
>>>>    schedule_timeout+0x1d3/0x2f0
>>>>    wait_for_completion+0xba/0x140
>>>>    __synchronize_srcu.part.21+0x91/0xc0
>>>>    synchronize_srcu_expedited+0x27/0x30
>>>>    synchronize_srcu+0xce/0xe0
>>>>    nvme_mpath_set_live+0x64/0x130 [nvme_core]
>>>>    nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
>>>>    nvme_update_ana_state+0xcd/0xe0 [nvme_core]
>>>>    nvme_parse_ana_log+0xa1/0x180 [nvme_core]
>>>>    nvme_read_ana_log+0x76/0x100 [nvme_core]
>>>>    nvme_mpath_init+0x122/0x180 [nvme_core]
>>>>    nvme_init_identify+0x80e/0xe20 [nvme_core]
>>>>    nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp]
>>>>    nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp]
>>>>
>>>>
>>>> In order to fix this, we recognize the different behavior a driver needs to take
>>>> in error recovery scenarios for mpath and non-mpath scenarios and expose this
>>>> awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs
>>>> to be done.
>>>
>>> Christoph, Keith,
>>>
>>> Any thoughts on this? The RFC part is getting the transport driver to
>>> behave differently for mpath vs. non-mpath.
>>
>> Will it work if nvme mpath used request NOWAIT flag for its submit_bio()
>> call, and add the bio to the requeue_list if blk_queue_enter() fails? I
>> think that looks like another way to resolve the deadlock, but we need
>> the block layer to return a failed status to the original caller.
> 
> But who would kick the requeue list? and that would make near-tag-exhaust performance stink...
moving nvme_start_freeze from nvme_rdma_teardown_io_queues to nvme_rdma_configure_io_queues can fix it.
It can also avoid I/O hang long time if reconnection failed.
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> .



More information about the Linux-nvme mailing list