nvme-fabrics: devices are uninterruptable

Martin Wilck mwilck at suse.com
Fri Jan 13 03:26:21 PST 2023


On Wed, 2023-01-11 at 14:37 +0000, Belanger, Martin wrote:
> POSIX.1 specifies that certain functions such as read() or write()
> can act as cancellation points.
> 
> Ref:
> https://pubs.opengroup.org/onlinepubs/000095399/functions/xsh_chap02_09.html#tag_02_09_05_02
> 
> Cancellation point functions can be forced to terminate before
> completion.

I think you are confusing things here. The page you mention is about
pthreads. pthread cancellation points are points at which a
pthread_cancel() call from another will interrupt a thread that is
using PTHREAD_CANCEL_DEFERRED cancellability, and nothing more. The
"cancellation point" logic applies *only* to the specific signal that
is used for implementing pthread_cancel(). It has nothing to do with
the cancellation of I/O requests. The spec says nothing about the
semantics of cancelling I/O system calls. Usually the thread
cancellation will occur either before entering or after returning from
the system call, rather than interrupting it. The general semantics of
signal delivery apply.

>  Typically, sending a signal to a process/thread will cause
> cancellation point functions to return immediately with an error
> (e.g. -1) and with errno set to EINTR. [...]
> 
> The nvme driver does not seem to allow cancellation points. In other
> words, processes/threads blocked on read()/write() associated with a
> nvme device (e.g. /dev/nvme-fabrics,
> /sys/class/nvme/nvme0/delete_controller) cannot be interrupted by
> signals. This can be problematic especially for the following cases: 

What you actually want to refer to is (I think) the section about
"Interruption of system calls and library functions by signal handlers"
in signal(7): "If  a  blocked  call to one of the following interfaces
is interrupted by a signal handler, then [...] the call fails with the
error EINTR: ... read(2), readv(2), write(2), writev(2), and ioctl(2)
calls on 'slow' devices." Note that this paragraph goes on saying that
"a (local) disk is not a slow device according to this definition; I/O
operations on disk devices are not interrupted by signals." I assume
the last sentence applies to NVMe disks, too. nvme-fabrics is a
different topic; one could argue it should have socket-like semantics
(and socket IO _is_ interrupted with EINTR, same man page section).

> 1) When scaling to a large number of connections (N), applications
> may be blocked on /dev/nvme-fabrics for long periods of time.
> Creating a connection to a controller is typically very fast (msec).
> However, if connectivity is down (e.g. networking issues) it takes
> about 3 seconds for the kernel to return with an error message
> indicating that the connection has failed. Let's say we want to
> create N=100 connections while connectivity is down. Because
> /dev/nvme-fabrics only allows one connection request at a time, it
> will take 3 * N = 300 seconds (5 minutes) before all connection
> requests get processed by the kernel. If multiple processes/threads
> request connections in parallel, they will all be blocked (except for
> 1) trying to write to /dev/nvme-fabrics. And there is no way to
> stop/cancel a process/thread once it is blocked on /dev/nvme-fabrics.
> Signals, including SIGKILL, have no effect whatsoever.

I think that SIGKILL does have an effect; it will at turn the affected
process into a zombie. See above for nvme-fabrics.


> 2) Similarly, deleting a controller by writing "1" to the
> "delete_controller" device while connectivity to that controller is
> down will block the calling process/thread for 1 minute (built-in
> timeout waiting for a response). While blocked, there is no way to
> terminate the process/thread. SIGINT (CTRL-C), SIGTERM, or even
> SIGKILL have no effect.
> 
> I wanted to ask the community if there is a reason for the nvme
> driver to not support POSIX cancellation points? I also wanted to
> know whether it would be possible to add support for it? Is there a
> downside to doing so? 

Repeat, this has nothing to do with cancellation points.

Martin




More information about the Linux-nvme mailing list