[PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset

Wed Jun 25 21:45:11 PDT 2025

On Jun 13, 2025 / 12:56, Ming Lei wrote:
> On Wed, Jun 04, 2025 at 11:17:05AM +0000, Shinichiro Kawasaki wrote:
> > Cc+: Ming,
> > 
> > Hi Sagi, thanks for the background explanation and the suggestion.
> > 
> > On Jun 04, 2025 / 10:10, Sagi Grimberg wrote:
> > ...
> > > > This is a problem. We are flushing a work that is IO bound, which can
> > > > take a long time to complete.
> > > > Up until now, we have deliberately avoided introducing dependency
> > > > between reset forward progress
> > > > and scan work IO to completely finish.
> > > > 
> > > > I would like to keep it this way.
> > > > 
> > > > BTW, this is not TCP specific.
> > 
> > I see. The blktests test case nvme/063 is dedicated to tcp transport, so that's
> > why it was reported for the TCP transport.
> > 
> > > 
> > > blk_mq_unquiesce_queue is still very much safe to call as many times as we
> > > want.
> > > The only thing that comes in the way is this pesky WARN. How about we make
> > > it go away and have
> > > a debug print instead?
> > > 
> > > My preference would be to allow nvme to unquiesce queues that were not
> > > previously quiesced (just
> > > like it historically was) instead of having to block a controller reset
> > > until the scan_work is completed (which
> > > is admin I/O dependent, and may get stuck until admin timeout, which can be
> > > changed by the user for 60
> > > minutes or something arbitrarily long btw).
> > > 
> > > How about something like this patch instead:
> > > --
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index c2697db59109..74f3ad16e812 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -327,8 +327,10 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
> > >         bool run_queue = false;
> > > 
> > >         spin_lock_irqsave(&q->queue_lock, flags);
> > > -       if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
> > > -               ;
> > > +       if (q->quiesce_depth <= 0) {
> > > +               printk(KERN_DEBUG
> > > +                       "dev %s: unquiescing a non-quiesced queue,
> > > expected?\n",
> > > +                       q->disk ? q->disk->disk_name : "?", );
> > >         } else if (!--q->quiesce_depth) {
> > >                 blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
> > >                 run_queue = true;
> > > --
> > 
> > The WARN was introduced with the commit e70feb8b3e68 ("blk-mq: support
> > concurrent queue quiesce/unquiesce") that Ming authored. Ming, may I
> > ask your comment on the suggestion by Sagi?
> 
> I think it is bad to use one standard block layer API in this unbalanced way,
> that is why WARN_ON_ONCE() is added. We shouldn't encourage driver to use
> APIs in this way.
> 
> Question is why nvme have to unquiesce one non-quiesced queue?

Ming, thanks for the comment. I think the call trace in the commit message of my
patch will answer your question. blk_mq_add_queue_tag_set() is called between
blk_mq_quiesce_tagset() and blk_mq_unquiesce_tagset() calls. Let me show the
call trace again, as below.

Anyway, the WARN disappeared with v6.16-rc1 kernel, so this problem does not
have high priority at this moment from my point of view.

Reset work                     Scan work
------------------------------------------------------------------------
nvme_reset_ctrl_work()
 nvme_tcp_teardown_ctrl()
  nvme_tcp_teardown_io_queues()
   nvme_quiesce_io_queues()
    blk_mq_quiesce_tagset()
     list_for_each_entry()                                 .. N queues
      blk_mq_quiesce_queue()
                               nvme_scan_work()
                                nvme_scan_ns_*()
                                 nvme_scan_ns()
                                  nvme_alloc_ns()
                                   blk_mq_alloc_disk()
                                    __blk_mq_alloc_disk()
                                     blk_mq_alloc_queue()
                                      blk_mq_init_allocate_queue()
                                       blk_mq_add_queue_tag_set()
                                        list_add_tail()    .. N+1 queues
 nvme_tcp_setup_ctrl()
  nvme_start_ctrl()
   nvme_unquiesce_io_queues()
    blk_mq_unquiesce_tagset()
     list_for_each_entry()                                 .. N+1 queues
      blk_mq_unquiesce_queue()
       WARN_ON_ONCE(q->quiesce_depth <= 0)