[PATCHv4 0/2] nvme: fix system fault observed while shutting down controller

Mon Nov 4 22:12:07 PST 2024

This patch series addresses the system fault observed while shutting down 
fabric controller. We already fixed it[1] earlier however it was later 
relaized that we do have a better and optimal way to address it [2].

The first patch in the series reverts the changes implemented in [3].
So essentially we're making keep-alive operation asynchronous again as 
it was earlier.

The second patch in the series fix the kernel crash observed while 
shutting down fabric controller. The system fault was observed due 
to the keep-alive request sneaking in while shutting down fabric 
controller. We encounter the below intermittent kernel crash while 
running blktest nvme/037:

dmesg output:
------------
run blktests nvme/037 at 2024-10-04 03:59:27
<snip>
nvme nvme1: new ctrl: "blktests-subsystem-5"
nvme nvme1: Failed to configure AEN (cfg 300)
nvme nvme1: Removing ctrl: NQN "blktests-subsystem-5"
nvme nvme1: long keepalive RTT (54760 ms)
nvme nvme1: failed nvme_keep_alive_end_io error=4
BUG: Kernel NULL pointer dereference on read at 0x00000080
Faulting instruction address: 0xc00000000091c9f8
Oops: Kernel access of bad area, sig: 7 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
<snip>
CPU: 28 UID: 0 PID: 338 Comm: kworker/u263:2 Kdump: loaded Not tainted 6.11.0+ #89
Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
Workqueue: nvme-wq nvme_keep_alive_work [nvme_core]
NIP:  c00000000091c9f8 LR: c00000000084150c CTR: 0000000000000004
<snip>
NIP [c00000000091c9f8] sbitmap_any_bit_set+0x68/0xb8
LR [c00000000084150c] blk_mq_do_dispatch_ctx+0xcc/0x280
Call Trace:
    autoremove_wake_function+0x0/0xbc (unreliable)
    __blk_mq_sched_dispatch_requests+0x114/0x24c
    blk_mq_sched_dispatch_requests+0x44/0x84
    blk_mq_run_hw_queue+0x140/0x220
    nvme_keep_alive_work+0xc8/0x19c [nvme_core]
    process_one_work+0x200/0x4e0
    worker_thread+0x340/0x504
    kthread+0x138/0x140
    start_kernel_thread+0x14/0x18

We realized  that the above crash is regression caused due to changes
implemented in commit a54a93d0e359 ("nvme: move stopping keep-alive into
nvme_uninit_ctrl()"). Ideally we should stop keep-alive at the very
beginning of the controller shutdown code path or before destroying admin 
queue and freeing admin tagset, so that keep-alive wouldn't sneak in or 
interfere with the shutdown operation. However we removed the keep alive 
stop operation from the beginning of the controller shutdown code path in 
commit a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_
ctrl()") and that now created the possibility of keep-alive sneaking in and 
interfering with the shutdown operation and causing observed kernel crash. 

To fix the observed crash, we decided to move nvme_stop_keep_alive() from
nvme_uninit_ctrl() to nvme_remove_admin_tag_set(). This change would ensure
that we don't forward progress and delete the admin queue until the keep-
alive operation is finished (if it's in-flight) or cancelled. The second 
patch in the series help address the kernel crash.

[1]https://lore.kernel.org/all/ZxFSkNI2p65ucTB5@kbusch-mbp.dhcp.thefacebook.com/
[2]https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/
[3]https://lore.kernel.org/all/20241016030339.54029-3-nilay@linux.ibm.com/

Changes from v3:
    - Add a brief explanation in the first patch commit log describing the
      reason about why a commit is being reverted (Ming Lei)
Changes from v2:
    - Move nvme_stop_keep_alive() from nvme_uninit_ctrl() to
      nvme_remove_admin_tag_set() instead of adding it to
      nvme_stop_ctrl() which would help save one callsite of
      nvme_stop_keep_alive() (Ming Lei)
    - The third patch in the series isn't necessary if we avoid the full
      revert and squash the series to just one fixing commit (Keith
      Busch)
Changes from v1:
    - Update the commit log of the third patch to make the intent of the
      changes clear (Sagi Grimberg)

Nilay Shroff (2):
  Revert "nvme: make keep-alive synchronous operation"
  nvme-fabrics: fix kernel crash while shutting down controller

 drivers/nvme/host/core.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

-- 
2.45.2