[PATCH v2] nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks

Hannes Reinecke hare at suse.de
Thu May 28 23:45:19 PDT 2026


On 5/28/26 17:24, Achkinazi, Igor wrote:
> When nvme_ns_head_submit_bio() remaps a bio from the multipath head to
> a per-path namespace, bio_set_dev() clears BIO_REMAPPED.  The remapped
> bio is then resubmitted through submit_bio_noacct() which calls
> bio_check_eod() because BIO_REMAPPED is not set.
> 
> This races with nvme_ns_remove() which zeroes the per-path capacity
> before synchronize_srcu():
> 
>    CPU 0 (IO submission)
>    ---------------------
>    srcu_read_lock()
>    nvme_find_path() -> ns
>      [NVME_NS_READY is set]
> 
>    CPU 1 (namespace removal)
>    -------------------------
>    clear_bit(NVME_NS_READY)
>    set_capacity(ns->disk, 0)
>    synchronize_srcu()  <- blocks
> 
>    CPU 0 (IO submission)
>    ---------------------
>    bio_set_dev(bio, ns->disk->part0)
>      [clears BIO_REMAPPED]
>    submit_bio_noacct(bio)
>      -> bio_check_eod() sees capacity=0
>      -> bio fails with IO error
> 
> The SRCU read lock prevents synchronize_srcu() from completing, but
> does not prevent set_capacity(0) from executing.  The bio fails the
> EOD check before it reaches the NVMe driver, so nvme_failover_req()
> never gets a chance to redirect it to another path of multipath.  IO errors
> are reported to the application despite another path being available.
> 
> On older kernels (before commit 0b64682e78f7 "block: skip unnecessary
> checks for split bio"), the same race was also reachable through split
> remainders resubmitted via submit_bio_noacct().
> 
> Observed during NVMe multipath failover testing at Dell on
> 5.14.0-570.23.1.el9_6.x86_64 (RHEL 9.7) and
> 6.4.0-150600.23.53-default (SLES 15.6).
> 
> Fix this by setting BIO_REMAPPED after bio_set_dev() in
> nvme_ns_head_submit_bio().  This skips bio_check_eod() on the per-path
> device; the EOD check already passed on the multipath head.
> 
> NVMe per-path namespace devices are always whole disks (bd_partno=0),
> so the blk_partition_remap() skip also gated by BIO_REMAPPED is a
> no-op.  The flag does not persist across failover and cannot go stale
> if the namespace geometry changes between attempts: nvme_failover_req()
> calls bio_set_dev() to redirect the bio back to the multipath head,
> which clears BIO_REMAPPED.  When nvme_requeue_work() resubmits through
> submit_bio_noacct(), bio_check_eod() runs normally against the current
> capacity.
> 
> Same approach as commit 3a905c37c351 ("block: skip bio_check_eod for
> partition-remapped bios").
> 
> A broader solution that moves bio validation into the queue-entered
> context and eliminates the set_capacity(0) hack is being developed
> upstream, however this minimal fix is suitable for backporting to
> stable kernels affected today. The link to the mentioned patch:
> https://lore.kernel.org/linux-block/20260519172326.3462354-1-kbusch@meta.com/
> 
> Fixes: a7c7f7b2b641 ("nvme: use bio_set_dev to assign ->bi_bdev")
> Cc: stable at vger.kernel.org
> Signed-off-by: Igor Achkinazi <igor.achkinazi at dell.com>
> ---
> v2:
>    - Corrected race description: primary race is in the initial
>      submit_bio_noacct() call in nvme_ns_head_submit_bio(), not
>      only in split remainders (which are no longer affected on
>      current mainline since commit 0b64682e78f7)
>    - Dropped incorrect arguments about submit_bio_noacct_nocheck
>      export status and BIO_REMAPPED propagation to split clones
>    - Added analysis showing BIO_REMAPPED flag does not persist
>      across failover (nvme_failover_req clears it via bio_set_dev)
>    - Referenced upstream RFC series addressing the root cause
> 
>   drivers/nvme/host/multipath.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 263161cb8ac0..04f7c7e59945 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -511,6 +511,13 @@ static void nvme_ns_head_submit_bio(struct bio *bio)
>          ns = nvme_find_path(head);
>          if (likely(ns)) {
>                  bio_set_dev(bio, ns->disk->part0);
> +               /*
> +                * Skip bio_check_eod() when this bio enters
> +                * submit_bio_noacct() for the per-path device.
> +                * The EOD check already passed on the multipath head.
> +                */
> +               bio_set_flag(bio, BIO_REMAPPED);
>                  bio->bi_opf |= REQ_NVME_MPATH;
>                  trace_block_bio_remap(bio, disk_devt(ns->head->disk),
>                                        bio->bi_iter.bi_sector);
> --
> 2.43.0
> 
> 
> Internal Use - Confidential
> 
... or you could introduce __bio_set_dev():

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 97d747320b35..5a2709adeea7 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -518,15 +518,20 @@ static inline void blkcg_punt_bio_submit(struct 
bio *bio)
  }
  #endif /* CONFIG_BLK_CGROUP */

-static inline void bio_set_dev(struct bio *bio, struct block_device *bdev)
+static inline void __bio_set_dev(struct bio *bio, struct block_device 
*bdev)
  {
-       bio_clear_flag(bio, BIO_REMAPPED);
         if (bio->bi_bdev != bdev)
                 bio_clear_flag(bio, BIO_BPS_THROTTLED);
         bio->bi_bdev = bdev;
         bio_associate_blkg(bio);
  }

+static inline void bio_set_dev(struct bio *bio, struct block_device *bdev)
+{
+       bio_clear_flag(bio, BIO_REMAPPED);
+       __bio_set_dev(bio, bdev);
+}
+
  /*
   * BIO list management for use by remapping drivers (e.g. DM or MD) 
and loop.
   *

to avoid all this clear-and-set-flag dance.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



More information about the Linux-nvme mailing list