Another wierd deadlock with nvme-tcp

Sun Oct 31 04:55:51 PDT 2021

> Hi Sagi,

Hey Hannes, thanks for reporting.

> and I've run into another weird deadlock; this time it's nvme-tcp not
> flushing timed out commands when deleting the controller:
> 
> [ 1685.982355] nvme nvme0: Removing ctrl: NQN
> "nqn.2014-08.org.nvmexpress:uuid:62f37f51-0cc7-46d5-9865-4de22e81bd9d"
> [ 1688.533746] nvme nvme0: queue 2: timeout request 0x72 type 4

So in this case, nvme_tcp_timeout() should complete the request
as the ctrl->state is for sure not LIVE.

In this case we should complete the requests with
NVME_SC_HOST_ABORTED_CMD - worth checking.

Also, this means that in the completion path it is expected
that nvme_decide_disposition() will return FAILOVER as
REQ_NVME_MPATH is set and the status should make nvme_is_path_error()
eval to true - worth checking.

> [ 1688.533781] nvme nvme0: failed to send request -104

In this case (-EPIPE and -ECONNRESET), nvme-tcp will complete the
command with NVME_SC_HOST_PATH_ERROR, which is also a path error
so the same behavior should happen - worth checking.