[PATCH v3 0/9] fix possible controller reset hangs in nvme-tcp/nvme-rdma

Thu Aug 20 01:36:42 EDT 2020

When a controller reset runs during I/O we may hang if the controller
suddenly becomes unresponsive during the reset and/or the reconnection
stages. This is due to how the timeout handler did not fail inflight
commands properly and also not being able to abort the controller reset
sequence when the controller becomes unresponsive (hence can't ever
recover even if the controller ever becomes responsive again).

This set fixes nvme-tcp and nvme-rdma for exactly the same scenarios.

Changes from v2:
- move NVME_CTRL_NEW state check in __nvme_check_ready to a separate patch
- various comment phrasing fixes
- fixed change log descriptions
- changed patches nvme-tcp/nvme-rdma: fix timeout handler to restore the
  timed out requests cancellation for all the non-LIVE states as the request
  is going to be cancelled anyways. The change is now purely fixes how
  we serialize and fence against error recovery (as pointed out by James).

Changes from v1:
- added patches 3,6 to protect against possible (but rare) double
  completions for timed out requests.

Sagi Grimberg (9):
  nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
  nvme-fabrics: allow to queue requests for live queues
  nvme: have nvme_wait_freeze_timeout return if it timed out
  nvme-tcp: serialize controller teardown sequences
  nvme-tcp: fix timeout handler
  nvme-tcp: fix reset hang if controller died in the middle of a reset
  nvme-rdma: serialize controller teardown sequences
  nvme-rdma: fix timeout handler
  nvme-rdma: fix reset hang if controller died in the middle of a reset

 drivers/nvme/host/core.c    |  3 +-
 drivers/nvme/host/fabrics.c | 13 +++---
 drivers/nvme/host/nvme.h    |  2 +-
 drivers/nvme/host/rdma.c    | 68 +++++++++++++++++++++++--------
 drivers/nvme/host/tcp.c     | 80 ++++++++++++++++++++++++++-----------
 5 files changed, 119 insertions(+), 47 deletions(-)

-- 
2.25.1