[PATCH v2 8/8] nvme-rdma: fix reset hang if controller died in the middle of a reset
Christoph Hellwig
hch at lst.de
Fri Aug 14 02:53:46 EDT 2020
On Thu, Aug 06, 2020 at 12:11:27PM -0700, Sagi Grimberg wrote:
> If the controller becomes unresponsive in the middle of a reset, we
> will hang because we are waiting for the freeze to complete, but that
> cannot happen since we have commands that are inflight holding the
> q_usage_counter, and we can't blindly fail requests that times out.
>
> So give a timeout and if we cannot wait for queue freeze before
> unfreezing, fail and have the error handling take care how to
> proceed (either schedule a reconnect of remove the controller).
>
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> ---
> drivers/nvme/host/rdma.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 30b401fcc06a..4ca53b864636 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -976,7 +976,13 @@ static int nvme_rdma_configure_io_queues(struct nvme_rdma_ctrl *ctrl, bool new)
>
> if (!new) {
> nvme_start_queues(&ctrl->ctrl);
> - nvme_wait_freeze(&ctrl->ctrl);
> + if (!nvme_wait_freeze_timeout(&ctrl->ctrl, NVME_IO_TIMEOUT)) {
> + /* if we timed out waiting for freeze we are
> + * likely stuck, fail just to be safe
> + */
/*
* If we timed out waiting for freeze we are likely to
* be stuck. Fail the controller initialization just
* to be safe.
*/
More information about the Linux-nvme
mailing list