[PATCH 1/2] nvme-tcp: avoid race between nvme scan and reset

Tue Jun 3 03:58:33 PDT 2025

On 02/06/2025 7:35, Shin'ichiro Kawasaki wrote:
> When the nvme scan work and the nvme controller reset work race, the
> WARN below happens:
>
>    WARNING: CPU: 1 PID: 69 at block/blk-mq.c:330 blk_mq_unquiesce_queue+0x8f/0xb0
>
> The WARN can be recreated by repeating the blktests test case nvme/063 a
> few times [1]. Its cause is the new queue allocation for the tag set
> by the scan work between blk_mq_quiesce_tagset() and
> blk_mq_unquiesce_tagset() calls by the reset work.
>
> Reset work                     Scan work
> ------------------------------------------------------------------------
> nvme_reset_ctrl_work()
>   nvme_tcp_teardown_ctrl()
>    nvme_tcp_teardown_io_queues()
>     nvme_quiesce_io_queues()
>      blk_mq_quiesce_tagset()
>       list_for_each_entry()                                 .. N queues
>        blk_mq_quiesce_queue()
>                                 nvme_scan_work()
>                                  nvme_scan_ns_*()
>                                   nvme_scan_ns()
>                                    nvme_alloc_ns()
>                                     blk_mq_alloc_disk()
>                                      __blk_mq_alloc_disk()
>                                       blk_mq_alloc_queue()
>                                        blk_mq_init_allocate_queue()
>                                         blk_mq_add_queue_tag_set()
>                                          list_add_tail()    .. N+1 queues
>   nvme_tcp_setup_ctrl()
>    nvme_start_ctrl()
>     nvme_unquiesce_io_queues()
>      blk_mq_unquiesce_tagset()
>       list_for_each_entry()                                 .. N+1 queues
>        blk_mq_unquiesce_queue()
>         WARN_ON_ONCE(q->quiesce_depth <= 0)
>
> blk_mq_quiesce_queue() is not called for the new queue added by the scan
> work, while blk_mq_unquiesce_queue() is called for it. Hence the WARN.
>
> To suppress the WARN, avoid the race between the reset work and the scan
> work by flushing the scan work at the beginning of the reset work.
>
> Link: https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/ [1]
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki at wdc.com>
> ---
>   drivers/nvme/host/tcp.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index f6379aa33d77..14b9d67329b3 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2491,6 +2491,9 @@ static void nvme_reset_ctrl_work(struct work_struct *work)
>   		container_of(work, struct nvme_ctrl, reset_work);
>   	int ret;
>   
> +	/* prevent racing with ns scanning */
> +	flush_work(&ctrl->scan_work);

This is a problem. We are flushing a work that is IO bound, which can 
take a long time to complete.
Up until now, we have deliberately avoided introducing dependency 
between reset forward progress
and scan work IO to completely finish.

I would like to keep it this way.

BTW, this is not TCP specific.