nvme: nvme-tcp shutdown when remote is unreachable
Belanger, Martin
Martin.Belanger at dell.com
Tue Aug 31 06:43:49 PDT 2021
Hello linux-nvme community,
I ran into a 1 minute deadlock trying to disconnect from a remote (tcp) controller while the network was down. In this particular case, the network has been down for a period of time shorter than the kato. So, the kernel module has not yet detected that the network is down.
After further investigations, I found that during a disconnect we try to clear the NVME_CC_ENABLE bit - see function nvme_disable_ctrl() in host/core.c. However, since the network is down, this operation blocks until the 1 minute timeout (default) expires. Therefore, the disconnect operation blocks for 1 minute.
Interestingly, nvme_disable_ctrl() defines a 5 sec timeout for reading the status (NVME_REG_CSTS) from the controller. So we have a 1 minute timeout for writing to the controller, but only a 5 sec timeout for reading from the controller.
My question is: On a disconnect shouldn't we use timeout shorter than the default 1 minute when writing to the controller? At the very least, shouldn't the timeout for writing to the controller be the same (5 sec) as the timeout for reading from the controller?
Thanks,
Martin Belanger
Engineering Technologist, Dell Inc.
Internal Use - Confidential
More information about the Linux-nvme
mailing list