nvme-tcp crashes the system when overloading the backend device.

Tue Aug 31 06:30:51 PDT 2021

Hi all,

I can consistently crash a system when I sufficiently overload the nvme-tcp target.
The easiest way to reproduce the problem is by creating a raid5.

While this R5 is resyncing export it with the nvmet-tcp target driver and start a high queue-depth 4K random fio workload from the initiator.
At some point the target system will start logging these messages:
[ 2865.725069] nvmet: ctrl 238 keep-alive timer (15 seconds) expired!
[ 2865.725072] nvmet: ctrl 236 keep-alive timer (15 seconds) expired!
[ 2865.725075] nvmet: ctrl 238 fatal error occurred!
[ 2865.725076] nvmet: ctrl 236 fatal error occurred!
[ 2865.725080] nvmet: ctrl 237 keep-alive timer (15 seconds) expired!
[ 2865.725083] nvmet: ctrl 237 fatal error occurred!
[ 2865.725087] nvmet: ctrl 235 keep-alive timer (15 seconds) expired!
[ 2865.725094] nvmet: ctrl 235 fatal error occurred!

Even when you stop all IO from the initiator some of the nvmet_tcp_wq workers will keep running forever.
The workload shown with "top" never returns to the normal idle level.

root      5669  1.1  0.0      0     0 ?        D<   03:39   0:09 [kworker/22:2H+nvmet_tcp_wq]
root      5670  0.8  0.0      0     0 ?        D<   03:39   0:06 [kworker/55:2H+nvmet_tcp_wq]
root      5676  0.2  0.0      0     0 ?        D<   03:39   0:01 [kworker/29:2H+nvmet_tcp_wq]
root      5677 12.2  0.0      0     0 ?        D<   03:39   1:35 [kworker/59:2H+nvmet_tcp_wq]
root      5679  5.7  0.0      0     0 ?        D<   03:39   0:44 [kworker/27:2H+nvmet_tcp_wq]
root      5680  2.9  0.0      0     0 ?        I<   03:39   0:23 [kworker/57:2H-nvmet_tcp_wq]
root      5681  1.0  0.0      0     0 ?        D<   03:39   0:08 [kworker/60:2H+nvmet_tcp_wq]
root      5682  0.5  0.0      0     0 ?        D<   03:39   0:04 [kworker/18:2H+nvmet_tcp_wq]
root      5683  5.8  0.0      0     0 ?        D<   03:39   0:45 [kworker/54:2H+nvmet_tcp_wq]

The number of running nvmet_tcp_wq will keep increasing once you hit the problem:

gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | tail -3
41114 ?        D<     0:00 [kworker/25:21H+nvmet_tcp_wq]
41152 ?        D<     0:00 [kworker/54:25H+nvmet_tcp_wq]

gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvme | grep wq | wc -l
500
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvme | grep wq | wc -l
502
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | wc -l
503
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | wc -l
505
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | wc -l
506
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | wc -l
511
gold:/var/crash/2021-08-26-08:38 # ps ax | grep nvmet_tcp_wq | wc -l
661

Eventually the system runs out of resources.
At some point the system will reach a workload of 2000+ and crash.

So far, I have been unable to determine why the number of nvmet_tcp_wq keeps increasing.
It must be because the current failed worker gets replaced by a new worker without the old being terminated.

Thanks,

Mark Ruijter