[PATCH BUG FIX 2/2] nvme-multipath: clear BIO_QOS flags on requeue

Christoph Hellwig hch at lst.de
Sun Nov 23 22:25:52 PST 2025


On Sun, Nov 23, 2025 at 11:18:58AM -0800, Chaitanya Kulkarni wrote:
> When a bio goes through the rq_qos infrastructure on a path's request
> queue, it gets BIO_QOS_THROTTLED or BIO_QOS_MERGED flags set. These
> flags indicate that rq_qos_done_bio() should be called on completion
> to update rq_qos accounting.
> 
> During path failover in nvme_failover_req(), the bio's bi_bdev is
> redirected from the failed path's disk to the multipath head's disk
> via bio_set_dev(). However, the BIO_QOS flags are not cleared.
> 
> When the bio eventually completes (either successfully via a new path
> or with an error via bio_io_error()), rq_qos_done_bio() checks for
> these flags and calls __rq_qos_done_bio(q->rq_qos, bio) where q is
> obtained from the bio's current bi_bdev - which is now the multipath
> head's queue, not the original path's queue.
> 
> The multipath head's queue does not have rq_qos enabled (q->rq_qos is
> NULL), but the code assumes that if BIO_QOS_* flags are set, q->rq_qos
> must be valid. This assumption is documented in block/blk-rq-qos.h:
> 
>   "If a bio has BIO_QOS_xxx set, it implicitly implies that
>    q->rq_qos is present."
> 
> This breaks when a bio is moved between queues during NVMe multipath
> failover, leading to a NULL pointer dereference.
> 
> Execution Context timeline :-
> 
>    * =====> dd process context
>    [USER] dd process                                
>      [SYSCALL] write() - dd process context           
>        submit_bio()                              
>        nvme_ns_head_submit_bio() - path selection
>        blk_mq_submit_bio()  #### QOS FLAGS SET HERE
>                                                   
>         [USER] dd waits or returns                       
>                                                   
>           ==== I/O in flight on NVMe hardware =====
> 
>    ===== End of submission path ====
>    ------------------------------------------------------
>    
>    * dd ====> Interrupt context;
>    [IRQ] NVMe completion interrupt              
>        nvme_irq()                                
>         nvme_complete_rq()                        
>          nvme_failover_req() ### BIO MOVED TO HEAD 
>             spin_lock_irqsave (atomic section)    
>             bio_set_dev() changes bi_bdev         
>             ### BUG: QOS flags NOT cleared          
>             kblockd_schedule_work()                   
>                                                    
>    * Interrupt context =====> kblockd workqueue
>    [WQ] kblockd workqueue - kworker process         
>        nvme_requeue_work()                       
>         submit_bio_noacct()                       
>          nvme_ns_head_submit_bio()                 
>           nvme_find_path() returns NULL             
>            bio_io_error()                            
>             bio_endio()                               
>              rq_qos_done_bio()  ### CRASH ###
>                                                    
>    KERNEL PANIC / OOPS       
> 
> Crash from blktests nvme/058 (rapid namespace remapping):
> 
> [ 1339.636033] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 1339.641025] nvme nvme4: rescanning namespaces.
> [ 1339.642064] #PF: supervisor read access in kernel mode
> [ 1339.642067] #PF: error_code(0x0000) - not-present page
> [ 1339.642070] PGD 0 P4D 0
> [ 1339.642073] Oops: Oops: 0000 [#1] SMP NOPTI
> [ 1339.642078] CPU: 35 UID: 0 PID: 4579 Comm: kworker/35:2H
>                Tainted: G   O     N  6.17.0-rc3nvme+ #5 PREEMPT(voluntary)
> [ 1339.642084] Tainted: [O]=OOT_MODULE, [N]=TEST
> [ 1339.673446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> 	       BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [ 1339.682359] Workqueue: kblockd nvme_requeue_work [nvme_core]
> [ 1339.686613] RIP: 0010:__rq_qos_done_bio+0xd/0x40
> [ 1339.690161] Code: 75 dd 5b 5d 41 5c c3 cc cc cc cc 66 90 90 90 90 90 90 90
>                      90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 f5
> 		     53 48 89 fb <48> 8b 03 48 8b 40 30 48 85 c0 74 0b 48 89 ee
> 		     48 89 df ff d0 0f 1f
> [ 1339.703691] RSP: 0018:ffffc900066f3c90 EFLAGS: 00010202
> [ 1339.706844] RAX: ffff888148b9ef00 RBX: 0000000000000000 RCX: 0000000000000000
> [ 1339.711136] RDX: 00000000000001c0 RSI: ffff8882aaab8a80 RDI: 0000000000000000
> [ 1339.715691] RBP: ffff8882aaab8a80 R08: 0000000000000000 R09: 0000000000000000
> [ 1339.720472] R10: 0000000000000000 R11: fefefefefefefeff R12: ffff8882aa3b6010
> [ 1339.724650] R13: 0000000000000000 R14: ffff8882338bcef0 R15: ffff8882aa3b6020
> [ 1339.729029] FS:  0000000000000000(0000) GS:ffff88985c0cf000(0000) knlGS:0000000000000000
> [ 1339.734525] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1339.738563] CR2: 0000000000000000 CR3: 0000000111045000 CR4: 0000000000350ef0
> [ 1339.742750] DR0: ffffffff845ccbec DR1: ffffffff845ccbed DR2: ffffffff845ccbee
> [ 1339.745630] DR3: ffffffff845ccbef DR6: 00000000ffff0ff0 DR7: 0000000000000600
> [ 1339.748488] Call Trace:
> [ 1339.749512]  <TASK>
> [ 1339.750449]  bio_endio+0x71/0x2e0
> [ 1339.751833]  nvme_ns_head_submit_bio+0x290/0x320 [nvme_core]
> [ 1339.754073]  __submit_bio+0x222/0x5e0
> [ 1339.755623]  ? rcu_is_watching+0xd/0x40
> [ 1339.757201]  ? submit_bio_noacct_nocheck+0x131/0x370
> [ 1339.759210]  submit_bio_noacct_nocheck+0x131/0x370
> [ 1339.761189]  ? submit_bio_noacct+0x20/0x620
> [ 1339.762849]  nvme_requeue_work+0x4b/0x60 [nvme_core]
> [ 1339.764828]  process_one_work+0x20e/0x630
> [ 1339.766528]  worker_thread+0x184/0x330
> [ 1339.768129]  ? __pfx_worker_thread+0x10/0x10
> [ 1339.769942]  kthread+0x10a/0x250
> [ 1339.771263]  ? __pfx_kthread+0x10/0x10
> [ 1339.772776]  ? __pfx_kthread+0x10/0x10
> [ 1339.774381]  ret_from_fork+0x273/0x2e0
> [ 1339.775948]  ? __pfx_kthread+0x10/0x10
> [ 1339.777504]  ret_from_fork_asm+0x1a/0x30
> [ 1339.779163]  </TASK>
> 
> Fix this by clearing both BIO_QOS_THROTTLED and BIO_QOS_MERGED flags
> when bios are redirected to the multipath head in nvme_failover_req().
> This is consistent with the existing code that clears REQ_POLLED and
> REQ_NOWAIT flags when the bio changes queues.
> 
> Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux at gmail.com>
> ---
>  drivers/nvme/host/multipath.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 3da980dc60d9..2535dba8ce1e 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -168,6 +168,16 @@ void nvme_failover_req(struct request *req)
>  		 * the flag to avoid spurious EAGAIN I/O failures.
>  		 */
>  		bio->bi_opf &= ~REQ_NOWAIT;
> +		/*
> +		 * BIO_QOS_THROTTLED and BIO_QOS_MERGED were set when the bio
> +		 * went through the path's request queue rq_qos infrastructure.
> +		 * The bio is now being redirected to the multipath head's
> +		 * queue which may not have rq_qos enabled, so these flags are
> +		 * no longer valid and must be cleared to prevent
> +		 * rq_qos_done_bio() from dereferencing a NULL q->rq_qos.
> +		 */
> +		bio_clear_flag(bio, BIO_QOS_THROTTLED);
> +		bio_clear_flag(bio, BIO_QOS_MERGED);

This really should go into blk_steal_bios instead.  As should be the
existing nowait/polled fixups..




More information about the Linux-nvme mailing list