[RFC PATCH 0/4] nvme-tcp: fix hung issues for deleting

Ming Lei ming.lei at redhat.com
Tue Jun 6 08:14:56 PDT 2023


Hello Chunguang,

On Mon, May 29, 2023 at 06:59:22PM +0800, brookxu.cn wrote:
> From: Chunguang Xu <chunguang.xu at shopee.com>
> 
> We found that nvme_remove_namespaces() may hang in flush_work(&ctrl->scan_work)
> while removing ctrl. The root cause may due to the state of ctrl changed to
> NVME_CTRL_DELETING while removing ctrl , which intterupt nvme_tcp_error_recovery_work()/
> nvme_reset_ctrl_work()/nvme_tcp_reconnect_or_remove().  At this time, ctrl is

I didn't dig into ctrl state check in these error handler yet, but error
handling is supposed to provide forward progress for any controller state.

Can you explain a bit how switching to DELETING interrupts the above
error handling and breaks the forward progress guarantee?

> freezed and queue is quiescing . Since scan_work may continue to issue IOs to
> load partition table, make it blocked, and lead to nvme_tcp_error_recovery_work()
> hang in flush_work(&ctrl->scan_work).
> 
> After analyzation, we found that there are mainly two case: 
> 1. Since ctrl is freeze, scan_work hang in __bio_queue_enter() while it issue
>    new IO to load partition table.

Yeah, nvme freeze usage is fragile, and I suggested to move
nvme_start_freeze() from nvme_tcp_teardown_io_queues to
nvme_tcp_configure_io_queues(), such as the posted change on rdma:

https://lore.kernel.org/linux-block/CAHj4cs-4gQHnp5aiekvJmb6o8qAcb6nLV61uOGFiisCzM49_dg@mail.gmail.com/T/#ma0d6bbfaa0c8c1be79738ff86a2fdcf7582e06b0

> 2. Since queus is quiescing, requeue timeouted IO may hang in hctx->dispatch
>    queue, leading scan_work waiting for IO completion.

That still looks one problem in related error handling code, which is
supposed to recover and unquiesce queue finally.


Thanks, 
Ming




More information about the Linux-nvme mailing list