nvme-fabrics: devices are uninterruptable
Belanger, Martin
Martin.Belanger at dell.com
Wed Jan 11 06:37:58 PST 2023
POSIX.1 specifies that certain functions such as read() or write() can act as cancellation points.
Ref: https://pubs.opengroup.org/onlinepubs/000095399/functions/xsh_chap02_09.html#tag_02_09_05_02
Cancellation point functions can be forced to terminate before completion. Typically, sending a signal to a process/thread will cause cancellation point functions to return immediately with an error (e.g. -1) and with errno set to EINTR. For example, if a read() is currently blocked on a socket, and the process/thread receives a signal, then read() will return -1 and errno will be set to EINTR. At this point the process/thread has the option of ignoring errno==EINTR and resume the read() operation, or can decide to exit() if the signal received matches a specific type such as SIGINT (CTRL-C) or SIGTERM. To do that, the process/thread can use a signal handler that caches the signal type received so that when control is returned to the process/thread it can query which signal type was received and act accordingly when errno==EINTR.
The nvme driver does not seem to allow cancellation points. In other words, processes/threads blocked on read()/write() associated with a nvme device (e.g. /dev/nvme-fabrics, /sys/class/nvme/nvme0/delete_controller) cannot be interrupted by signals. This can be problematic especially for the following cases:
1) When scaling to a large number of connections (N), applications may be blocked on /dev/nvme-fabrics for long periods of time. Creating a connection to a controller is typically very fast (msec). However, if connectivity is down (e.g. networking issues) it takes about 3 seconds for the kernel to return with an error message indicating that the connection has failed. Let's say we want to create N=100 connections while connectivity is down. Because /dev/nvme-fabrics only allows one connection request at a time, it will take 3 * N = 300 seconds (5 minutes) before all connection requests get processed by the kernel. If multiple processes/threads request connections in parallel, they will all be blocked (except for 1) trying to write to /dev/nvme-fabrics. And there is no way to stop/cancel a process/thread once it is blocked on /dev/nvme-fabrics. Signals, including SIGKILL, have no effect whatsoever.
2) Similarly, deleting a controller by writing "1" to the "delete_controller" device while connectivity to that controller is down will block the calling process/thread for 1 minute (built-in timeout waiting for a response). While blocked, there is no way to terminate the process/thread. SIGINT (CTRL-C), SIGTERM, or even SIGKILL have no effect.
I wanted to ask the community if there is a reason for the nvme driver to not support POSIX cancellation points? I also wanted to know whether it would be possible to add support for it? Is there a downside to doing so?
Regards,
Martin
More information about the Linux-nvme
mailing list