[PATCH] nvme-rdma: fix deadlock when delete ctrl due to reconnect fail

Sagi Grimberg sagi at grimberg.me
Tue Jul 28 12:27:26 EDT 2020


>>> The io will do not fail. If work with native multipath or dm-multipath,
>>> nvme_rdma_queue_rq will return io error, and then multipath will
>>> fail over to other path and retry io, this is we expected. If work
>>> without multipath, nvme_rdma_queue_rq will return BLK_STS_RESOURCE,
>>> and then the upper layer will requeue and retry. Surely there is a
>>> weakness:the io will retry repeated every BLK_MQ_RESOURCE_DELAY(3ms)
>>> while reconnecting. Because controller reset may need long time,
>>> and nvme over roce is mainly used with multipath software, so when
>>> controller reset we expect fail over to other path and retry io,, just
>>> like error recovery. If work without multipath, we tolerate repeated
>>> I/O retries during error recovery or controller reset.
>>
>> I/O should not fail during reset, mpath or not, period.
> 
> except when marked as an internal io (one used for reconnect, or maybe 
> an ioctl) or marked for mpath.

I meant normal fs I/O, from the user perspective.



More information about the Linux-nvme mailing list