nvme-tcp crashes the system when overloading the backend device.

Wed Sep 1 07:36:03 PDT 2021

Hi Sagi,

I can reproduce this problem with any recent kernel.
At least all these kernels I tested suffer from the problem: 5.10.40, 5.10.57, 5.14-rc4 as well as SuSE SLES15-SP2 with kernel 5.3.18-24.37-default.
On the initiator I use Ubuntu 20.04 LTS with kernel 5.10.0-1019.

>>Is it possible to check if the R5 device has inflight commands? if not 
there is some race condition or misaccounting that prevents an orderly 
shutdown of the queues.

I will double check; however, I don't think that the underlying device is the problem.
The exact same test passes with the nvmet-rdma target. 
It only fails with the nvmet-tcp target driver.

At far as I can tell I exhaust the budget in nvmet_tcp_io_work and requeue:

1293         } while (pending && ops < NVMET_TCP_IO_WORK_BUDGET);
1294 
1295         /*
1296          * Requeue the worker if idle deadline period is in progress or any
1297          * ops activity was recorded during the do-while loop above.
1298          */
1299         if (nvmet_tcp_check_queue_deadline(queue, ops) || pending)
1300                 queue_work_on(queue_cpu(queue), nvmet_tcp_wq, &queue->io_work);

I added pr_info statements in the code to determine what is going on:  
2021-09-01T07:15:26.944067-06:00 gold kernel: [ 5502.786914] nvmet_tcp: MARK exhausted budget: ret = 0, ops = 71
2021-09-01T07:15:26.944070-06:00 gold kernel: [ 5502.787455] nvmet: ctrl 49 keep-alive timer (15 seconds) expired!
2021-09-01T07:15:26.944072-06:00 gold kernel: [ 5502.787461] nvmet: ctrl 49 fatal error occurred!

Shortly after the routine nvmet_fatal_error_handler gets triggered:
static void nvmet_fatal_error_handler(struct work_struct *work)
{
        struct nvmet_ctrl *ctrl =
                        container_of(work, struct nvmet_ctrl, fatal_err_work);

        pr_err("ctrl %d fatal error occurred!\n", ctrl->cntlid);
        ctrl->ops->delete_ctrl(ctrl);
}

Some of nvme_tcp_wq workers now keep running and the number of workers keeps increasing.
root      3686  3.3  0.0      0     0 ?        I<   07:31   0:29 [kworker/11:0H-nvmet_tcp_wq]
root      3689 12.0  0.0      0     0 ?        I<   07:31   1:43 [kworker/25:0H-nvmet_tcp_wq]
root      3695 12.0  0.0      0     0 ?        I<   07:31   1:43 [kworker/55:3H-nvmet_tcp_wq]
root      3699  5.0  0.0      0     0 ?        I<   07:31   0:43 [kworker/38:1H-nvmet_tcp_wq]
root      3704 11.5  0.0      0     0 ?        I<   07:31   1:39 [kworker/21:0H-nvmet_tcp_wq]
root      3708 12.1  0.0      0     0 ?        I<   07:31   1:44 [kworker/31:0H-nvmet_tcp_wq]

"nvmetcli clear" will no longer return after this and when you keep the initiators running the system eventually crashes.

--Mark

On 01/09/2021, 14:49, "Sagi Grimberg" <sagi at grimberg.me> wrote:

    > Hi all,
    > 
    > I can consistently crash a system when I sufficiently overload the nvme-tcp target.
    > The easiest way to reproduce the problem is by creating a raid5.
    > 
    > While this R5 is resyncing export it with the nvmet-tcp target driver and start a high queue-depth 4K random fio workload from the initiator.
    > At some point the target system will start logging these messages:
    > [ 2865.725069] nvmet: ctrl 238 keep-alive timer (15 seconds) expired!
    > [ 2865.725072] nvmet: ctrl 236 keep-alive timer (15 seconds) expired!
    > [ 2865.725075] nvmet: ctrl 238 fatal error occurred!
    > [ 2865.725076] nvmet: ctrl 236 fatal error occurred!
    > [ 2865.725080] nvmet: ctrl 237 keep-alive timer (15 seconds) expired!
    > [ 2865.725083] nvmet: ctrl 237 fatal error occurred!
    > [ 2865.725087] nvmet: ctrl 235 keep-alive timer (15 seconds) expired!
    > [ 2865.725094] nvmet: ctrl 235 fatal error occurred!

    This is the target not getting a keep-alive commands from the initiator, 
    hence it triggers a controller removal. btw what kernel version is this?

    While the target is tearing down the controller queues, and waits for
    all the inflight commands to complete and drop the final reference on
    the queue. I'm wandering if something is preventing these inflight
    commands from getting completed by the backend R5 device.

    > 
    > Even when you stop all IO from the initiator some of the nvmet_tcp_wq workers will keep running forever.

    That is because the target is waiting for the commands to complete, but 
    they don't.

    > Eventually the system runs out of resources.
    > At some point the system will reach a workload of 2000+ and crash.
    > 
    > So far, I have been unable to determine why the number of nvmet_tcp_wq keeps increasing.
    > It must be because the current failed worker gets replaced by a new worker without the old being terminated.

    Is it possible to check if the R5 device has inflight commands? if not 
    there is some race condition or misaccounting that prevents an orderly 
    shutdown of the queues.