[RFC PATCH 0/4] nvme-tcp: fix hung issues for deleting

Tue Jun 6 07:32:45 PDT 2023

Hi grimberg, I have read Ming's patch, it seems that MIng fix the case
my patchset missed, Ming mainly fixes the hang when reconnect fails,
my patchset fixes the issue that while processing error_recover or
reconnect(have not reach max retries), user actively remove ctrl(nvme
disconnect),  this will interrupt error_recovery or recoonect, but
ctrl freezed and th request queue quiescing, the new IO or timeouted
IOs cannot continue to process, as a result nvme_remove_namespaces
hang on flush scan_work or blk_mq_freeze_queue_wait, new IO hang or
__bio_queue_enter()，it seems that if the first patch add the next
code, it may cover Ming's case:

static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
{
    /* If we are resetting/deleting then do nothing */
    if (ctrl->state != NVME_CTRL_CONNECTING) {
        WARN_ON_ONCE(ctrl->state == NVME_CTRL_NEW ||
        ctrl->state == NVME_CTRL_LIVE);
        return;
    }

    if (nvmf_should_reconnect(ctrl)) {
        dev_info(ctrl->device, "Reconnecting in %d seconds...\n",
            ctrl->opts->reconnect_delay);
        queue_delayed_work(nvme_wq, &to_tcp_ctrl(ctrl)->connect_work,
            ctrl->opts->reconnect_delay * HZ);
    } else {
        dev_info(ctrl->device, "Removing controller...\n");
        nvme_delete_ctrl(ctrl);
+      nvme_ctrl_reconnect_exit(ctrl);
    }
}

Thanls.

Sagi Grimberg <sagi at grimberg.me> 于2023年6月6日周二 07:09写道：
>
>
> > From: Chunguang Xu <chunguang.xu at shopee.com>
> >
> > We found that nvme_remove_namespaces() may hang in flush_work(&ctrl->scan_work)
> > while removing ctrl. The root cause may due to the state of ctrl changed to
> > NVME_CTRL_DELETING while removing ctrl , which intterupt nvme_tcp_error_recovery_work()/
> > nvme_reset_ctrl_work()/nvme_tcp_reconnect_or_remove().  At this time, ctrl is
> > freezed and queue is quiescing . Since scan_work may continue to issue IOs to
> > load partition table, make it blocked, and lead to nvme_tcp_error_recovery_work()
> > hang in flush_work(&ctrl->scan_work).
> >
> > After analyzation, we found that there are mainly two case:
> > 1. Since ctrl is freeze, scan_work hang in __bio_queue_enter() while it issue
> >     new IO to load partition table.
> > 2. Since queus is quiescing, requeue timeouted IO may hang in hctx->dispatch
> >     queue, leading scan_work waiting for IO completion.
>
> Hey, can you please look at the discussion with Mings' proposal in
> "nvme: add nvme_delete_dead_ctrl for avoiding io deadlock" ?
>
Hi grimberg, I have look MIng's patch, I think we may fix
> Looks the same to me.