[PATCH 0/6] fix possible controller reset hangs in nvme-tcp/nvme-rdma

Mon Aug 3 02:58:46 EDT 2020

When a controller reset runs during I/O we may hang if the controller
suddenly becomes unresponsive during the reset and/or the reconnection
stages. This is due to how the timeout handler did not fail inflight
commands properly and also not being able to abort the controller reset
sequence when the controller becomes unresponsive (hence can't ever
recover even if the controller ever becomes responsive again).

This set fixes nvme-tcp and nvme-rdma for exactly the same scenarios.

Patch 1 prevents commands being queued fora  live queued, making
commands mistakenly getting requeued forever while we are either
resetting or connecting to a controller.

Patches 2,4,6 address the case when a controller stops responding when
we are in the middle of a connection establishment stage (tcp and rdma).

Patches 3,5 rework the timeout handler to fail commands (and allow them
to either requeue or fail) in case the controller is not responsive when
we are in the middle of reset (teardown) or establishment (connect
sequence).

James, please have a look to patch 1, this relates to the discussions
we had recently. We still keep the admin commands with a guard, but
that would be addressed in a follow-up set.

Sagi Grimberg (6):
  nvme-fabrics: allow to queue requests for live queues
  nvme: have nvme_wait_freeze_timeout return if it timed out
  nvme-tcp: fix timeout handler
  nvme-tcp: fix reset hang if controller died in the middle of a reset
  nvme-rdma: fix timeout handler
  nvme-rdma: fix reset hang if controller died in the middle of a reset

 drivers/nvme/host/core.c    |  3 +-
 drivers/nvme/host/fabrics.c | 13 +++---
 drivers/nvme/host/nvme.h    |  2 +-
 drivers/nvme/host/rdma.c    | 78 ++++++++++++++++++++++++++---------
 drivers/nvme/host/tcp.c     | 81 +++++++++++++++++++++++++++----------
 5 files changed, 130 insertions(+), 47 deletions(-)

-- 
2.25.1