[PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset
Shinichiro Kawasaki
shinichiro.kawasaki at wdc.com
Wed Jun 25 21:45:11 PDT 2025
On Jun 13, 2025 / 12:56, Ming Lei wrote:
> On Wed, Jun 04, 2025 at 11:17:05AM +0000, Shinichiro Kawasaki wrote:
> > Cc+: Ming,
> >
> > Hi Sagi, thanks for the background explanation and the suggestion.
> >
> > On Jun 04, 2025 / 10:10, Sagi Grimberg wrote:
> > ...
> > > > This is a problem. We are flushing a work that is IO bound, which can
> > > > take a long time to complete.
> > > > Up until now, we have deliberately avoided introducing dependency
> > > > between reset forward progress
> > > > and scan work IO to completely finish.
> > > >
> > > > I would like to keep it this way.
> > > >
> > > > BTW, this is not TCP specific.
> >
> > I see. The blktests test case nvme/063 is dedicated to tcp transport, so that's
> > why it was reported for the TCP transport.
> >
> > >
> > > blk_mq_unquiesce_queue is still very much safe to call as many times as we
> > > want.
> > > The only thing that comes in the way is this pesky WARN. How about we make
> > > it go away and have
> > > a debug print instead?
> > >
> > > My preference would be to allow nvme to unquiesce queues that were not
> > > previously quiesced (just
> > > like it historically was) instead of having to block a controller reset
> > > until the scan_work is completed (which
> > > is admin I/O dependent, and may get stuck until admin timeout, which can be
> > > changed by the user for 60
> > > minutes or something arbitrarily long btw).
> > >
> > > How about something like this patch instead:
> > > --
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index c2697db59109..74f3ad16e812 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -327,8 +327,10 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
> > > bool run_queue = false;
> > >
> > > spin_lock_irqsave(&q->queue_lock, flags);
> > > - if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
> > > - ;
> > > + if (q->quiesce_depth <= 0) {
> > > + printk(KERN_DEBUG
> > > + "dev %s: unquiescing a non-quiesced queue,
> > > expected?\n",
> > > + q->disk ? q->disk->disk_name : "?", );
> > > } else if (!--q->quiesce_depth) {
> > > blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
> > > run_queue = true;
> > > --
> >
> > The WARN was introduced with the commit e70feb8b3e68 ("blk-mq: support
> > concurrent queue quiesce/unquiesce") that Ming authored. Ming, may I
> > ask your comment on the suggestion by Sagi?
>
> I think it is bad to use one standard block layer API in this unbalanced way,
> that is why WARN_ON_ONCE() is added. We shouldn't encourage driver to use
> APIs in this way.
>
> Question is why nvme have to unquiesce one non-quiesced queue?
Ming, thanks for the comment. I think the call trace in the commit message of my
patch will answer your question. blk_mq_add_queue_tag_set() is called between
blk_mq_quiesce_tagset() and blk_mq_unquiesce_tagset() calls. Let me show the
call trace again, as below.
Anyway, the WARN disappeared with v6.16-rc1 kernel, so this problem does not
have high priority at this moment from my point of view.
Reset work Scan work
------------------------------------------------------------------------
nvme_reset_ctrl_work()
nvme_tcp_teardown_ctrl()
nvme_tcp_teardown_io_queues()
nvme_quiesce_io_queues()
blk_mq_quiesce_tagset()
list_for_each_entry() .. N queues
blk_mq_quiesce_queue()
nvme_scan_work()
nvme_scan_ns_*()
nvme_scan_ns()
nvme_alloc_ns()
blk_mq_alloc_disk()
__blk_mq_alloc_disk()
blk_mq_alloc_queue()
blk_mq_init_allocate_queue()
blk_mq_add_queue_tag_set()
list_add_tail() .. N+1 queues
nvme_tcp_setup_ctrl()
nvme_start_ctrl()
nvme_unquiesce_io_queues()
blk_mq_unquiesce_tagset()
list_for_each_entry() .. N+1 queues
blk_mq_unquiesce_queue()
WARN_ON_ONCE(q->quiesce_depth <= 0)
More information about the Linux-nvme
mailing list