nvme-tcp crashes the system when overloading the backend device.

Wed Sep 1 07:47:12 PDT 2021

> Hi Sagi,
> 
> I can reproduce this problem with any recent kernel.
> At least all these kernels I tested suffer from the problem: 5.10.40, 5.10.57, 5.14-rc4 as well as SuSE SLES15-SP2 with kernel 5.3.18-24.37-default.
> On the initiator I use Ubuntu 20.04 LTS with kernel 5.10.0-1019.

Thanks.

>>> Is it possible to check if the R5 device has inflight commands? if not
> there is some race condition or misaccounting that prevents an orderly
> shutdown of the queues.
> 
> I will double check; however, I don't think that the underlying device is the problem.
> The exact same test passes with the nvmet-rdma target.
> It only fails with the nvmet-tcp target driver.

OK, that is useful information.

> 
> At far as I can tell I exhaust the budget in nvmet_tcp_io_work and requeue:
> 
> 1293         } while (pending && ops < NVMET_TCP_IO_WORK_BUDGET);
> 1294
> 1295         /*
> 1296          * Requeue the worker if idle deadline period is in progress or any
> 1297          * ops activity was recorded during the do-while loop above.
> 1298          */
> 1299         if (nvmet_tcp_check_queue_deadline(queue, ops) || pending)
> 1300                 queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);
> 
> I added pr_info statements in the code to determine what is going on:
> 2021-09-01T07:15:26.944067-06:00 gold kernel: [ 5502.786914] nvmet_tcp: MARK exhausted budget: ret = 0, ops = 71
> 2021-09-01T07:15:26.944070-06:00 gold kernel: [ 5502.787455] nvmet: ctrl 49 keep-alive timer (15 seconds) expired!
> 2021-09-01T07:15:26.944072-06:00 gold kernel: [ 5502.787461] nvmet: ctrl 49 fatal error occurred!
> 
> Shortly after the routine nvmet_fatal_error_handler gets triggered:
> static void nvmet_fatal_error_handler(struct work_struct *work)
> {
>          struct nvmet_ctrl *ctrl =
>                          container_of(work, struct nvmet_ctrl, fatal_err_work);
> 
>          pr_err("ctrl %d fatal error occurred!\n", ctrl->cntlid);
>          ctrl->ops->delete_ctrl(ctrl);
> }
> 
> Some of nvme_tcp_wq workers now keep running and the number of workers keeps increasing.
> root      3686  3.3  0.0      0     0 ?        I<   07:31   0:29 [kworker/11:0H-nvmet_tcp_wq]
> root      3689 12.0  0.0      0     0 ?        I<   07:31   1:43 [kworker/25:0H-nvmet_tcp_wq]
> root      3695 12.0  0.0      0     0 ?        I<   07:31   1:43 [kworker/55:3H-nvmet_tcp_wq]
> root      3699  5.0  0.0      0     0 ?        I<   07:31   0:43 [kworker/38:1H-nvmet_tcp_wq]
> root      3704 11.5  0.0      0     0 ?        I<   07:31   1:39 [kworker/21:0H-nvmet_tcp_wq]
> root      3708 12.1  0.0      0     0 ?        I<   07:31   1:44 [kworker/31:0H-nvmet_tcp_wq]
> 
> "nvmetcli clear" will no longer return after this and when you keep the initiators running the system eventually crashes.
> 

OK, so maybe some information can help. When you reproduce this for the 
first time I would dump all the threads in the system to dmesg.

So if you can do the following:
1. reproduce the hang
2. nvmetcli clear
3. echo t > /proc/sysrq-trigger

And share the dmesg output with us?