nvme-fabrics: devices are uninterruptable

Fri Jan 13 08:58:50 PST 2023

> On Wed, 2023-01-11 at 14:37 +0000, Belanger, Martin wrote:
> > POSIX.1 specifies that certain functions such as read() or write() can
> > act as cancellation points.
> >
> > Ref:
> > https://urldefense.com/v3/__https://pubs.opengroup.org/onlinepubs/0000
> >
> 95399/functions/xsh_chap02_09.html*tag_02_09_05_02__;Iw!!LpKI!mRHjcC
> Da
> > 9GbhJbYIEQbL0gDby2fJRUEoRclC6iPBgKoy-serG9wheVae8Bfg8xF3-
> 6PccKiHA5iKfu
> > Y$ [pubs[.]opengroup[.]org]
> >
> > Cancellation point functions can be forced to terminate before
> > completion.
> 
> I think you are confusing things here. The page you mention is about pthreads.
> pthread cancellation points are points at which a
> pthread_cancel() call from another will interrupt a thread that is using
> PTHREAD_CANCEL_DEFERRED cancellability, and nothing more. The
> "cancellation point" logic applies *only* to the specific signal that is used for
> implementing pthread_cancel(). It has nothing to do with the cancellation of
> I/O requests. The spec says nothing about the semantics of cancelling I/O
> system calls. Usually the thread cancellation will occur either before entering or
> after returning from the system call, rather than interrupting it. The general
> semantics of signal delivery apply.
Hi Martin. That's for your response.

Agreed. I should not have used the "cancellation point" terminology. Instead, I could simply have said that many system calls will report the EINTR error code if a signal occurred while the system call was in progress. I've used this with dozens of projects in the past and it has always worked flawlessly (i.e. writing to sockets).

The documentation says that a blocked write() may return a number of bytes less than the specified count if, among other things, the call was interrupted by a signal after it had transferred some, but before it had transferred all the requested bytes.

Similarly, read() can return with a number of bytes smaller than the requested number if read() was interrupted by a signal.

I understand that /dev/nvme-fabrics is a special kind of file. It's not a regular file and it's not a socket. However, it should be possible to interrupt a process that is currently pending to write() to /dev/nvme-fabrics before any bytes have been written. Once bytes have actually been written and the kernel has started processing the connection request, I recognize that interrupting the write() operation at that point is not desirable. However, if we have several processes/threads currently pending to write() to /dev/nvme-fabrics because another process is currently being served, and before they have a chance to write any bytes, it is perfectly reasonable to allow a signal to interrupt the write() and allow these processes to exit, no harm done.

Another thing that I've been wondering is why the kernel does not allow multiple connection requests in parallel? It should be possible for multiple processes to write commands to /dev/nvme-fabrics concurrently. I mean, each process needs to open() /dev/nvme-fabrics, which gives them their own file descriptor. Then they can write() to or read() from that file descriptor independent of what other processes are doing. This would prevent processes from being blocked for long periods of time like the 100 connection example I mentioned earlier. In other words, 100 connection requests could be made in parallel, all of them timing out at the same time after 3 seconds (instead of 5 minutes).

> 
> >  Typically, sending a signal to a process/thread will cause
> > cancellation point functions to return immediately with an error (e.g.
> > -1) and with errno set to EINTR. [...]
> >
> > The nvme driver does not seem to allow cancellation points. In other
> > words, processes/threads blocked on read()/write() associated with a
> > nvme device (e.g. /dev/nvme-fabrics,
> > /sys/class/nvme/nvme0/delete_controller) cannot be interrupted by
> > signals. This can be problematic especially for the following cases:
> 
> What you actually want to refer to is (I think) the section about "Interruption of
> system calls and library functions by signal handlers"
> in signal(7): "If  a  blocked  call to one of the following interfaces is interrupted
> by a signal handler, then [...] the call fails with the error EINTR: ... read(2),
> readv(2), write(2), writev(2), and ioctl(2) calls on 'slow' devices." Note that this
> paragraph goes on saying that "a (local) disk is not a slow device according to
> this definition; I/O operations on disk devices are not interrupted by signals." I
> assume the last sentence applies to NVMe disks, too. nvme-fabrics is a
> different topic; one could argue it should have socket-like semantics (and
> socket IO _is_ interrupted with EINTR, same man page section).

Exactly!

> 
> > 1) When scaling to a large number of connections (N), applications may
> > be blocked on /dev/nvme-fabrics for long periods of time.
> > Creating a connection to a controller is typically very fast (msec).
> > However, if connectivity is down (e.g. networking issues) it takes
> > about 3 seconds for the kernel to return with an error message
> > indicating that the connection has failed. Let's say we want to create
> > N=100 connections while connectivity is down. Because
> > /dev/nvme-fabrics only allows one connection request at a time, it
> > will take 3 * N = 300 seconds (5 minutes) before all connection
> > requests get processed by the kernel. If multiple processes/threads
> > request connections in parallel, they will all be blocked (except for
> > 1) trying to write to /dev/nvme-fabrics. And there is no way to
> > stop/cancel a process/thread once it is blocked on /dev/nvme-fabrics.
> > Signals, including SIGKILL, have no effect whatsoever.
> 
> I think that SIGKILL does have an effect; it will at turn the affected process into
> a zombie. See above for nvme-fabrics.

I'll have to check that again. Last time I tried I did not see any effect with "kill -9".

> 
> 
> > 2) Similarly, deleting a controller by writing "1" to the
> > "delete_controller" device while connectivity to that controller is
> > down will block the calling process/thread for 1 minute (built-in
> > timeout waiting for a response). While blocked, there is no way to
> > terminate the process/thread. SIGINT (CTRL-C), SIGTERM, or even
> > SIGKILL have no effect.
> >
> > I wanted to ask the community if there is a reason for the nvme driver
> > to not support POSIX cancellation points? I also wanted to know
> > whether it would be possible to add support for it? Is there a
> > downside to doing so?
> 
> Repeat, this has nothing to do with cancellation points.
> 
> Martin