[PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs

Mon Mar 15 22:27:11 GMT 2021

The below patches caused a regression in a multipath setup:
Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")

These patches on their own are correct because they fixed a controller reset
regression.

When we reset/teardown a controller, we must freeze and quiesce the namespaces
request queues to make sure that we safely stop inflight I/O submissions.
Freeze is mandatory because if our hctx map changed between reconnects,
blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and
if it still has pending submissions (that are still quiesced) it will hang.
This is what the above patches fixed.

However, by freezing the namespaces request queues, and only unfreezing them
when we successfully reconnect, inflight submissions that are running
concurrently can now block grabbing the nshead srcu until either we successfully
reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected).

This caused a deadlock [1] when a different controller (different path on the
same subsystem) became live (i.e. optimized/non-optimized). This is because
nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O
in order to make sure that current_path is visible to future (re)submisions.
However the srcu lock is taken by a blocked submission on a frozen request
queue, and we have a deadlock.

For multipath, we obviously cannot allow that as we want to failover I/O asap.
However for non-mpath, we do not want to fail I/O (at least until controller
FASTFAIL expires, and that is disabled by default btw).

This creates a non-symmetric behavior of how the driver should behave in the
presence or absence of multipath.

[1]:
Workqueue: nvme-wq nvme_tcp_reconnect_ctrl_work [nvme_tcp]
Call Trace:
 __schedule+0x293/0x730
 schedule+0x33/0xa0
 schedule_timeout+0x1d3/0x2f0
 wait_for_completion+0xba/0x140
 __synchronize_srcu.part.21+0x91/0xc0
 synchronize_srcu_expedited+0x27/0x30
 synchronize_srcu+0xce/0xe0
 nvme_mpath_set_live+0x64/0x130 [nvme_core]
 nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
 nvme_update_ana_state+0xcd/0xe0 [nvme_core]
 nvme_parse_ana_log+0xa1/0x180 [nvme_core]
 nvme_read_ana_log+0x76/0x100 [nvme_core]
 nvme_mpath_init+0x122/0x180 [nvme_core]
 nvme_init_identify+0x80e/0xe20 [nvme_core]
 nvme_tcp_setup_ctrl+0x359/0x660 [nvme_tcp]
 nvme_tcp_reconnect_ctrl_work+0x24/0x70 [nvme_tcp]

In order to fix this, we recognize the different behavior a driver needs to take
in error recovery scenarios for mpath and non-mpath scenarios and expose this
awareness with a new helper nvme_ctrl_is_mpath and use that to know what needs
to be done.

Sagi Grimberg (3):
  nvme: introduce nvme_ctrl_is_mpath helper
  nvme-tcp: fix possible hang when trying to set a live path during I/O
  nvme-rdma: fix possible hang when trying to set a live path during I/O

 drivers/nvme/host/multipath.c |  5 +++--
 drivers/nvme/host/nvme.h      | 15 +++++++++++++++
 drivers/nvme/host/rdma.c      | 29 +++++++++++++++++------------
 drivers/nvme/host/tcp.c       | 30 +++++++++++++++++-------------
 4 files changed, 52 insertions(+), 27 deletions(-)

-- 
2.27.0