[PATCH] blk-flush: fix possibe deadlock when process nvme_timeout()
Ye Bin
yebin at huaweicloud.com
Mon Jun 8 04:39:23 PDT 2026
From: Ye Bin <yebin10 at huawei.com>
There's when process nvme_timeout():
[ 206.734601][ T8184] nvme nvme0: I/O tag 512 (1200) opcode 0x0 (I/O Cmd) QID 3 timeout, aborting req_op:FLUSH(2) size:0
[ 206.736112][ C0] nvme nvme0: Abort status: 0x0
[ 208.094637][ T8184] nvme nvme0: I/O tag 512 (1200) opcode 0x0 (I/O Cmd) QID 3 timeout, reset controller
[root at localhost ~]# cat /proc/8184/stack
[<0>] msleep+0x37/0x50
[<0>] blk_mq_tagset_wait_completed_request+0x6f/0xe0
[<0>] nvme_cancel_tagset+0x79/0xa0
[<0>] nvme_dev_disable+0x55c/0x7e0
[<0>] nvme_timeout+0x25b/0x1530
[<0>] blk_mq_handle_expired+0x210/0x2c0
[<0>] bt_iter+0x2bb/0x360
[<0>] blk_mq_queue_tag_busy_iter+0x9f8/0x1f30
[<0>] blk_mq_timeout_work+0x5dc/0x7d0
[<0>] process_one_work+0xa08/0x1d00
[<0>] worker_thread+0x698/0xeb0
[<0>] kthread+0x408/0x540
[<0>] ret_from_fork+0xa4d/0xdd0
[<0>] ret_from_fork_asm+0x1a/0x30
Above issue may happen as follows:
nvme_timeout // tag 512 request's flush request the first timeout
iod->aborted = 1;
abort_req = nvme_alloc_request(dev->ctrl.admin_q, &cmd,
BLK_MQ_REQ_NOWAIT, NVME_QID_ANY); // Abort tag 512 flush request
blk_execute_rq_nowait(abort_req->q, NULL, abort_req, 0, abort_endio);
// Abort request completion, will no wait
....
****'abort_req' not complete***
....
nvme_timeout // tag 512 request's flush request the second timeout
if (!nvmeq->qid || (iod->flags & IOD_ABORTED))
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
goto disable;
...
**** tag 512 request's flush request end ****
nvme_try_complete_req
blk_mq_complete_request_remote(req);
WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
...
nvme_end_req(req);
blk_mq_end_request(req, status);
__blk_mq_end_request(rq, error);
if (rq->end_io)
rq->end_io(rq, error);
flush_end_io(rq, error);
// The timeout process holds the reference count.
// so request keep MQ_RQ_COMPLETE state
if (!refcount_dec_and_test(&flush_rq->ref))
fq->rq_status = error;
return;
**** tag 512 flush request is MQ_RQ_COMPLETE state ****
disable:
nvme_dev_disable(dev, false);
nvme_cancel_tagset(&dev->ctrl);
blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request,
&dev->ctrl);
nvme_cancel_request
if (blk_mq_request_completed(req))
return true;
blk_mq_tagset_wait_completed_request(&dev->tagset);
while (true)
blk_mq_tagset_busy_iter(tagset,
blk_mq_tagset_count_completed_rqs, &count);
blk_mq_tagset_count_completed_rqs();
// request is MQ_RQ_COMPLETE state
if (blk_mq_request_completed(rq)) // return true
(*count)++;
if (!count) // So the value of 'count' is never 0, loop endless
break;
msleep(5);
The preceding problem occurs because the timeout processing flow holds
the reference count of the request, and the flush request is always in
the MQ_RQ_COMPLETE state due to the special nature of the flush request.
As a result, a dead loop occurs in the nvme_dev_disable() process.
To solve the preceding problem, if only the timeout processing flow holds
the reference count when the flush request times out, the request status
must be changed to MQ_RQ_IDLE in advance. In this way, it is safe to call
blk_mq_tagset_wait_completed_request () during the timeout processing.
Fixes: e1569a16180a ("nvme: do not restart the request timeout if we're resetting the controller")
Signed-off-by: Ye Bin <yebin10 at huawei.com>
---
block/blk-flush.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 403a46c86411..d12839b1fcb5 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -213,6 +213,18 @@ static enum rq_end_io_ret flush_end_io(struct request *flush_rq,
if (!req_ref_put_and_test(flush_rq)) {
fq->rq_status = error;
+
+ /*
+ * The timeout processing flow holds the reference count
+ * of flush_rq. If the last reference count is held by the
+ * timeout processing flow, the status of flush_rq must be
+ * changed to MQ_RQ_IDLE in advance. Otherwise, a deadlock
+ * occurs when blk_mq_tagset_wait_completed_request() is
+ * called in the timeout processing flow.
+ */
+ if (req_ref_read(flush_rq) == 1 &&
+ flush_rq->rq_flags & RQF_TIMED_OUT)
+ WRITE_ONCE(flush_rq->state, MQ_RQ_IDLE);
spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
return RQ_END_IO_NONE;
}
--
2.34.1
More information about the Linux-nvme
mailing list