[PATCH] nvme: bound the freeze drain in passthrough commands
Chao S
coshi036 at gmail.com
Tue Jun 23 15:28:49 PDT 2026
Hi Keith, Christoph,
Both questions land on the same point, so one answer below.
Keith wrote:
> The IO timeout callbacks that nvme drivers provide are supposed to
> forcefully reclaim any IO no matter what state the device is in. Is
> that not happening for some reason?
Christoph wrote:
> Note that the blocked message itself is not a problem, but around
> this time we should have done a controller reset and fixed up the
> issue. Does that not happen for your test case?
It did happen. nvme_io_timeout was the default 30, and the relevant
dmesg from the campaign that produced the hung-task report:
[ 44.220408] nvme nvme0: I/O tag 451 opcode 0x2 (Read) QID 1 timeout, aborting
[ 44.220798] nvme nvme0: I/O tag 819 opcode 0x1 (Write) QID 1
timeout, aborting
[ 44.220820] nvme nvme0: Abort status: 0x0
[ 44.221286] nvme nvme0: Abort status: 0x0
[ 45.591197] nvme nvme0: resetting controller
[ 46.307151] nvme nvme0: IO queues lost
[...100s of silence...]
[144.561596] INFO: task systemd-udevd:134 blocked for more than 123 seconds.
Timeout fires, abort is accepted, reset starts, reset reaches the
"IO queues lost" branch (drivers/nvme/host/pci.c). Then nvme_reset_work
itself blocks at
nvme_mark_namespaces_dead -> blk_mark_disk_dead -> blk_report_disk_dead
-> bdev_mark_dead(bdev, true) -> sync_blockdev -> folio_wait_writeback
i.e. the unconditional sync_blockdev in bdev_mark_dead's bare-bdev
else-branch (block/bdev.c) is itself waiting on the writeback that the
reset was supposed to drain. The reset_work kworker doesn't show in
this dmesg because it crosses the 123s threshold ~25s after the
snapshot console was cut.
So in this report, the IO timeout did its job, but the reset that the
timeout kicks off cannot complete, and nvme_passthru_start (which is
already in nvme_wait_freeze at this point) has no way to back out.
Two reasons I still think the bound in passthru_start is worth applying
on its own merit:
1. The reset path has several ways to fail to drain in
nvme_io_timeout: abort can be rejected, the admin tag for abort
can be unavailable, the controller can be wedged before abort
lands, an in-progress reset can outlast nvme_io_timeout, or (as
here) reset itself can block. Each leaves nvme_passthru_start
waiting forever, holding ctrl->scan_lock + subsys->lock + every
namespace's freeze ref, which then fans out on bd_disk->open_mutex
via any concurrent bdev_open/release or BLKRRPART.
2. The same pattern is already established in the tree. pci shutdown
(drivers/nvme/host/pci.c), nvme-tcp reset, nvme-rdma reset,
nvme-apple, and Daniel Wagner's 2021 nvme-fc series
(20210818120530.130501-1-dwagner at suse.de) all use
nvme_wait_freeze_timeout(NVME_IO_TIMEOUT) for exactly this reason.
nvme_passthru_start is the only userspace-reachable caller still
on the unbounded variant.
Christoph also wrote:
> So not blocking forever sounds useful, but this might break existing
> uses. I guess we could do it based on the O_NONBLOCK flag if people
> really cared.
NVME_IO_TIMEOUT is already the bound any submitted I/O can be
synchronously waited on; a freeze drain that legitimately exceeds it
implies the controller isn't doing useful I/O anyway, and
nvme_core.io_timeout=N scales both knobs coherently. Drain on a
healthy system is sub-second.
Gating on O_NONBLOCK is a reasonable fallback if you'd rather keep
the old default, but the nvme ioctl path doesn't currently consult
fd flags, so it would be a new userspace contract. My concern with
the current default is that any of the scenarios above lets a
userspace ioctl wedge a kthread with two ctrl-wide mutexes held; the
+9-line bound prevents that without changing the success path.
If the maintainer preference is the O_NONBLOCK gate instead, I'll
respin.
Thanks,
Chao
On Wed, May 27, 2026 at 11:46 AM Keith Busch <kbusch at kernel.org> wrote:
>
> On Wed, May 27, 2026 at 01:59:23AM -0400, Chao Shi wrote:
> > If a completion is silently dropped or the device hangs, the calling
> > task wedges with ctrl->scan_lock and ctrl->subsys->lock held, fanning
> > out into hung-task reports on any concurrent open/close/passthru on
> > the same controller:
>
> The IO timeout callbacks that nvme drivers provide are supposed to
> forcefully reclaim any IO no matter what state the device is in. Is that
> not happening for some reason?
More information about the Linux-nvme
mailing list