Another wierd deadlock with nvme-tcp
Sagi Grimberg
sagi at grimberg.me
Sun Oct 31 04:55:51 PDT 2021
> Hi Sagi,
Hey Hannes, thanks for reporting.
> and I've run into another weird deadlock; this time it's nvme-tcp not
> flushing timed out commands when deleting the controller:
>
> [ 1685.982355] nvme nvme0: Removing ctrl: NQN
> "nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d"
> [ 1688.533746] nvme nvme0: queue 2: timeout request 0x72 type 4
So in this case, nvme_tcp_timeout() should complete the request
as the ctrl->state is for sure not LIVE.
In this case we should complete the requests with
NVME_SC_HOST_ABORTED_CMD - worth checking.
Also, this means that in the completion path it is expected
that nvme_decide_disposition() will return FAILOVER as
REQ_NVME_MPATH is set and the status should make nvme_is_path_error()
eval to true - worth checking.
> [ 1688.533781] nvme nvme0: failed to send request -104
In this case (-EPIPE and -ECONNRESET), nvme-tcp will complete the
command with NVME_SC_HOST_PATH_ERROR, which is also a path error
so the same behavior should happen - worth checking.
More information about the Linux-nvme
mailing list