[PATCH] nvme: bound the freeze drain in passthrough commands

Chao S coshi036 at gmail.com
Tue Jun 23 15:28:49 PDT 2026


Hi Keith, Christoph,

Both questions land on the same point, so one answer below.

Keith wrote:
> The IO timeout callbacks that nvme drivers provide are supposed to
> forcefully reclaim any IO no matter what state the device is in. Is
> that not happening for some reason?

Christoph wrote:
> Note that the blocked message itself is not a problem, but around
> this time we should have done a controller reset and fixed up the
> issue.  Does that not happen for your test case?

It did happen.  nvme_io_timeout was the default 30, and the relevant
dmesg from the campaign that produced the hung-task report:

  [ 44.220408] nvme nvme0: I/O tag 451 opcode 0x2 (Read) QID 1 timeout, aborting
  [ 44.220798] nvme nvme0: I/O tag 819 opcode 0x1 (Write) QID 1
timeout, aborting
  [ 44.220820] nvme nvme0: Abort status: 0x0
  [ 44.221286] nvme nvme0: Abort status: 0x0
  [ 45.591197] nvme nvme0: resetting controller
  [ 46.307151] nvme nvme0: IO queues lost
  [...100s of silence...]
  [144.561596] INFO: task systemd-udevd:134 blocked for more than 123 seconds.

Timeout fires, abort is accepted, reset starts, reset reaches the
"IO queues lost" branch (drivers/nvme/host/pci.c).  Then nvme_reset_work
itself blocks at

  nvme_mark_namespaces_dead -> blk_mark_disk_dead -> blk_report_disk_dead
    -> bdev_mark_dead(bdev, true) -> sync_blockdev -> folio_wait_writeback

i.e. the unconditional sync_blockdev in bdev_mark_dead's bare-bdev
else-branch (block/bdev.c) is itself waiting on the writeback that the
reset was supposed to drain.  The reset_work kworker doesn't show in
this dmesg because it crosses the 123s threshold ~25s after the
snapshot console was cut.

So in this report, the IO timeout did its job, but the reset that the
timeout kicks off cannot complete, and nvme_passthru_start (which is
already in nvme_wait_freeze at this point) has no way to back out.

Two reasons I still think the bound in passthru_start is worth applying
on its own merit:

1. The reset path has several ways to fail to drain in
   nvme_io_timeout: abort can be rejected, the admin tag for abort
   can be unavailable, the controller can be wedged before abort
   lands, an in-progress reset can outlast nvme_io_timeout, or (as
   here) reset itself can block.  Each leaves nvme_passthru_start
   waiting forever, holding ctrl->scan_lock + subsys->lock + every
   namespace's freeze ref, which then fans out on bd_disk->open_mutex
   via any concurrent bdev_open/release or BLKRRPART.

2. The same pattern is already established in the tree.  pci shutdown
   (drivers/nvme/host/pci.c), nvme-tcp reset, nvme-rdma reset,
   nvme-apple, and Daniel Wagner's 2021 nvme-fc series
   (20210818120530.130501-1-dwagner at suse.de) all use
   nvme_wait_freeze_timeout(NVME_IO_TIMEOUT) for exactly this reason.
   nvme_passthru_start is the only userspace-reachable caller still
   on the unbounded variant.

Christoph also wrote:
> So not blocking forever sounds useful, but this might break existing
> uses.  I guess we could do it based on the O_NONBLOCK flag if people
> really cared.

NVME_IO_TIMEOUT is already the bound any submitted I/O can be
synchronously waited on; a freeze drain that legitimately exceeds it
implies the controller isn't doing useful I/O anyway, and
nvme_core.io_timeout=N scales both knobs coherently.  Drain on a
healthy system is sub-second.

Gating on O_NONBLOCK is a reasonable fallback if you'd rather keep
the old default, but the nvme ioctl path doesn't currently consult
fd flags, so it would be a new userspace contract.  My concern with
the current default is that any of the scenarios above lets a
userspace ioctl wedge a kthread with two ctrl-wide mutexes held; the
+9-line bound prevents that without changing the success path.

If the maintainer preference is the O_NONBLOCK gate instead, I'll
respin.

Thanks,
Chao

On Wed, May 27, 2026 at 11:46 AM Keith Busch <kbusch at kernel.org> wrote:
>
> On Wed, May 27, 2026 at 01:59:23AM -0400, Chao Shi wrote:
> > If a completion is silently dropped or the device hangs, the calling
> > task wedges with ctrl->scan_lock and ctrl->subsys->lock held, fanning
> > out into hung-task reports on any concurrent open/close/passthru on
> > the same controller:
>
> The IO timeout callbacks that nvme drivers provide are supposed to
> forcefully reclaim any IO no matter what state the device is in. Is that
> not happening for some reason?



More information about the Linux-nvme mailing list