[PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset

Sat Jun 28 03:24:05 PDT 2025

On 13/06/2025 7:56, Ming Lei wrote:
> On Wed, Jun 04, 2025 at 11:17:05AM +0000, Shinichiro Kawasaki wrote:
>> Cc+: Ming,
>>
>> Hi Sagi, thanks for the background explanation and the suggestion.
>>
>> On Jun 04, 2025 / 10:10, Sagi Grimberg wrote:
>> ...
>>>> This is a problem. We are flushing a work that is IO bound, which can
>>>> take a long time to complete.
>>>> Up until now, we have deliberately avoided introducing dependency
>>>> between reset forward progress
>>>> and scan work IO to completely finish.
>>>>
>>>> I would like to keep it this way.
>>>>
>>>> BTW, this is not TCP specific.
>> I see. The blktests test case nvme/063 is dedicated to tcp transport, so that's
>> why it was reported for the TCP transport.
>>
>>> blk_mq_unquiesce_queue is still very much safe to call as many times as we
>>> want.
>>> The only thing that comes in the way is this pesky WARN. How about we make
>>> it go away and have
>>> a debug print instead?
>>>
>>> My preference would be to allow nvme to unquiesce queues that were not
>>> previously quiesced (just
>>> like it historically was) instead of having to block a controller reset
>>> until the scan_work is completed (which
>>> is admin I/O dependent, and may get stuck until admin timeout, which can be
>>> changed by the user for 60
>>> minutes or something arbitrarily long btw).
>>>
>>> How about something like this patch instead:
>>> --
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index c2697db59109..74f3ad16e812 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -327,8 +327,10 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
>>>          bool run_queue = false;
>>>
>>>          spin_lock_irqsave(&q->queue_lock, flags);
>>> -       if (WARN_ON_ONCE(q->quiesce_depth <= 0)) {
>>> -               ;
>>> +       if (q->quiesce_depth <= 0) {
>>> +               printk(KERN_DEBUG
>>> +                       "dev %s: unquiescing a non-quiesced queue,
>>> expected?\n",
>>> +                       q->disk ? q->disk->disk_name : "?", );
>>>          } else if (!--q->quiesce_depth) {
>>>                  blk_queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
>>>                  run_queue = true;
>>> --
>> The WARN was introduced with the commit e70feb8b3e68 ("blk-mq: support
>> concurrent queue quiesce/unquiesce") that Ming authored. Ming, may I
>> ask your comment on the suggestion by Sagi?
> I think it is bad to use one standard block layer API in this unbalanced way,
> that is why WARN_ON_ONCE() is added. We shouldn't encourage driver to use
> APIs in this way.
>
> Question is why nvme have to unquiesce one non-quiesced queue?

It started before quiesce/unquiesce became an API that had to be balanced.

In this case, we are using the tagset quiesce/unquiesce interface. Which 
iterates
over the request queues from the tagset. The problem is that due to 
namespace
scanning, we may add new namespaces (and their request queues) after we
quiesced the tagset. Then when we call tagset unquiesce, we have a new 
request
queue that wasn't quiesced (added after the quiesce).

I don't mind having a BLK_FEAT_ALLOW_UNBALANCED_QUIESCE for the nvme request
queues if you don't want to blindly remove the WARN_ON.