[PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs

Chao Leng lengchao at huawei.com
Tue Mar 16 03:24:32 GMT 2021


Does the problem exist on the latest version?

We also found Similar deadlocks in the older version.
However, with the latest code, it do not block grabbing the nshead srcu
when ctrl is freezed.
related patches:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=fe2008640ae36e3920cf41507a84fb5d3227435a
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=ed00aabd5eb9fb44d6aff1173234a2e911b9fead
I am not sure they are the same problem.


On 2021/3/16 6:27, Sagi Grimberg wrote:
> The below patches caused a regression in a multipath setup:
> Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
> Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")
> 
> These patches on their own are correct because they fixed a controller reset
> regression.
> 
> When we reset/teardown a controller, we must freeze and quiesce the namespaces
> request queues to make sure that we safely stop inflight I/O submissions.
> Freeze is mandatory because if our hctx map changed between reconnects,
> blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and
> if it still has pending submissions (that are still quiesced) it will hang.
> This is what the above patches fixed.
> 
> However, by freezing the namespaces request queues, and only unfreezing them
> when we successfully reconnect, inflight submissions that are running
> concurrently can now block grabbing the nshead srcu until either we successfully
> reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected).
> 
> This caused a deadlock [1] when a different controller (different path on the
> same subsystem) became live (i.e. optimized/non-optimized). This is because
> nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O
> in order to make sure that current_path is visible to future (re)submisions.
> However the srcu lock is taken by a blocked submission on a frozen request
> queue, and we have a deadlock.
> 
> For multipath, we obviously cannot allow that as we want to failover I/O asap.
> However for non-mpath, we do not want to fail I/O (at least until controller
> FASTFAIL expires, and that is disabled by default btw).
> 
> This creates a non-symmetric behavior of how the driver should behave in the
> presence or absence of multipath.
> 
> [1]:
> Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
> Call Trace:
>   __schedule+0x293/0x730
>   schedule+0x33/0xa0
>   schedule_timeout+0x1d3/0x2f0
>   wait_for_completion+0xba/0x140
>   __synchronize_srcu.part.21+0x91/0xc0
>   synchronize_srcu_expedited+0x27/0x30
>   synchronize_srcu+0xce/0xe0
>   nvme_mpath_set_live+0x64/0x130 [nvme_core]
>   nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
>   nvme_update_ana_state+0xcd/0xe0 [nvme_core]
>   nvme_parse_ana_log+0xa1/0x180 [nvme_core]
>   nvme_read_ana_log+0x76/0x100 [nvme_core]
>   nvme_mpath_init+0x122/0x180 [nvme_core]
>   nvme_init_identify+0x80e/0xe20 [nvme_core]
>   nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp]
>   nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp]
> 
> 
> In order to fix this, we recognize the different behavior a driver needs to take
> in error recovery scenarios for mpath and non-mpath scenarios and expose this
> awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs
> to be done.
> 
> Sagi Grimberg (3):
>    nvme: introduce nvme_ctrl_is_mpath helper
>    nvme-tcp: fix possible hang when trying to set a live path during I/O
>    nvme-rdma: fix possible hang when trying to set a live path during I/O
> 
>   drivers/nvme/host/multipath.c |  5 +++--
>   drivers/nvme/host/nvme.h      | 15 +++++++++++++++
>   drivers/nvme/host/rdma.c      | 29 +++++++++++++++++------------
>   drivers/nvme/host/tcp.c       | 30 +++++++++++++++++-------------
>   4 files changed, 52 insertions(+), 27 deletions(-)
> 



More information about the Linux-nvme mailing list