[PATCH 0/2] nvme: system fault while shutting down fabric controller
Nilay Shroff
nilay at linux.ibm.com
Fri Oct 4 04:46:55 PDT 2024
We observed a kernel task hang up and a kernel crash while shutting down
NVMe fabric controller. These issues were observed while running blktest
nvme/037. And this patch series address issues encountered while running
this test.
We intermittently observe the blow kernel task hang while running the
blktest nvme/037. This test setup NVMeOF passthru controller using loop
target, connect to it and then immediately terminate/cleanup the
connection.
dmesg output:
-------------
run blktests nvme/037 at 2024-10-04 00:46:02
nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
nvme nvme1: D3 entry latency set to 10 seconds
nvme nvme1: creating 32 I/O queues.
nvme nvme1: new ctrl: "blktests-subsystem-1"
nvme nvme1: Failed to configure AEN (cfg 300)
nvme nvme1: resetting controller
INFO: task nvme:3082 blocked for more than 120 seconds.
Not tainted 6.11.0+ #89
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:nvme state:D stack:0 pid:3082 tgid:3082 ppid:2983 flags:0x00042080
Call Trace:
0xc000000070f5bf90 (unreliable)
__switch_to+0x148/0x230
__schedule+0x260/0x6dc
schedule+0x40/0x100
blk_mq_freeze_queue_wait+0xa4/0xec
blk_mq_destroy_queue+0x68/0xac
nvme_remove_admin_tag_set+0x2c/0xb8 [nvme_core]
nvme_loop_destroy_admin_queue+0x68/0x88 [nvme_loop]
nvme_do_delete_ctrl+0x1e0/0x268 [nvme_core]
nvme_delete_ctrl_sync+0xd4/0x104 [nvme_core]
nvme_sysfs_delete+0x78/0x90 [nvme_core]
dev_attr_store+0x34/0x50
sysfs_kf_write+0x64/0x78
kernfs_fop_write_iter+0x1b0/0x290
vfs_write+0x3bc/0x4f8
ksys_write+0x84/0x140
system_call_exception+0x124/0x320
system_call_vectored_common+0x15c/0x2ec
As we can see from the above trace that nvme task hangs up indefinitely
while shutting down loop controller. This task couldn't forward progress
because it's waiting for unfinished/outstanding requests which haven't
yet finished.
The first patch in the series fixes the above hang by ensuring that while
shutting down nvme loop controller, we flush off any pending I/O to the
completion, which might have been queued after that queue has been quiesced.
So the first patch adds a missing unquiesce admin and IO queue operation in
the nvme loop driver just before the respective queue is destroyed.
The second patch in the series fixes another issue with the nvme keep-alive
operation. The keep-alive operation could potentially sneak in while
the fabric controller is shutting down. We encounter the below intermittent
kernel crash while running blktest nvme/037:
dmesg output:
------------
run blktests nvme/037 at 2024-10-04 03:59:27
<snip>
nvme nvme1: new ctrl: "blktests-subsystem-5"
nvme nvme1: Failed to configure AEN (cfg 300)
nvme nvme1: Removing ctrl: NQN "blktests-subsystem-5"
nvme nvme1: long keepalive RTT (54760 ms)
nvme nvme1: failed nvme_keep_alive_end_io error=4
BUG: Kernel NULL pointer dereference on read at 0x00000080
Faulting instruction address: 0xc00000000091c9f8
Oops: Kernel access of bad area, sig: 7 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
<snip>
CPU: 28 UID: 0 PID: 338 Comm: kworker/u263:2 Kdump: loaded Not tainted 6.11.0+ #89
Hardware name: IBM,9043-MRX POWER10 (architected) 0x800200 0xf000006 of:IBM,FW1060.00 (NM1060_028) hv:phyp pSeries
Workqueue: nvme-wq nvme_keep_alive_work [nvme_core]
NIP: c00000000091c9f8 LR: c00000000084150c CTR: 0000000000000004
<snip>
NIP [c00000000091c9f8] sbitmap_any_bit_set+0x68/0xb8
LR [c00000000084150c] blk_mq_do_dispatch_ctx+0xcc/0x280
Call Trace:
autoremove_wake_function+0x0/0xbc (unreliable)
__blk_mq_sched_dispatch_requests+0x114/0x24c
blk_mq_sched_dispatch_requests+0x44/0x84
blk_mq_run_hw_queue+0x140/0x220
nvme_keep_alive_work+0xc8/0x19c [nvme_core]
process_one_work+0x200/0x4e0
worker_thread+0x340/0x504
kthread+0x138/0x140
start_kernel_thread+0x14/0x18
The above crash occurred while shutting down fabric/loop controller.
While shutting down fabric controller, if nvme keep-alive request sneaks in
and later flushed off then nvme_keep_alive_end_io() function is
asynchronously invoked to handle the end of the keep-alive operation. The
nvme_keep_alive_end_io() decrements the admin-queue-usage-ref-counter and
assuming this is the last/only request in the admin queue then the admin
queue-usage-ref-counter becomes zero. If that happens then blk-mq destroy
queue operation (blk_mq_destroy_queue()) which could be potentially
running simultaneously on another cpu (as this is the controller shutdown
code path) would forward progress and deletes the admin queue. However at
the same time nvme keep-alive thread running on another cpu hasn't yet
returned/finished from it's async blk-mq request operation (i.e blk_execute_
rq_nowait()) and so it could still access admin queue resources which could
have been already released from controller shutdown code path and that
causes the observed symptom.
For instance, find below the sequence of operations running simultaneously
on two cpus and causing this issue:
cpu0:
nvme_keep_alive_work()
->blk_execute_rq_no_wait()
->blk_mq_run_hw_queue()
->blk_mq_sched_dispatch_requests()
->__blk_mq_sched_dispatch_requests()
->blk_mq_dispatch_rq_list()
->nvme_loop_queue_rq()
->nvme_fail_nonready_command() -- here keep-alive req fails because admin queue is shutting down
->nvme_complete_rq()
->nvme_end_req()
->blk_mq_end_request()
->__blk_mq_end_request()
->nvme_keep_alive_end_io() -- here we decrement admin-queue-usage-ref-counter
cpu1:
nvme_loop_delete_ctrl_host()
->nvme_loop_shutdown_ctrl()
->nvme_loop_destroy_admin_queue()
->nvme_remove_admin_tag_set()
->blk_mq_destroy_queue() -- here we wait until admin-queue-usage-ref-counter reches to zero
->blk_put_queue() -- here we destroy queue once admin-queue-usage-ref-counter becomes zero
-- From here on we are not supposed to access admin queue
resources, however, nvme keep-alive thread running on
cpu0 which is not yet finished and so may still access
the admin qeueue pointer and causing the observed crash.
So prima facie, from the above trace it appears that the nvme keep-alive
thread running on one cpu races with the shutdown controller operation
which could be running on another cpu.
The second patch in the series addresses above issue by making nvme keep-
alive a synchronous operation so that we decrement admin-queue-usage-ref-
counter only after keep-alive operation/command finish and returns status.
This would also ensure that blk_mq_destroy_queue() doesn't return until
the nvme keep-alive thread finish it's work and so it's safe to destroy the
queue.
Moreover, the Keep-alive command is lightweight and low-frequency, making
it a synchronous approach shall be reasonable from a performance
perspective. Since this command is not frequent compared to other NVMe
operations (like reads/writes), it does not introduce a significant
performance overhead when handled synchronously.
Nilay Shroff (2):
nvme-loop: flush off pending I/O while shutting down loop controller
nvme: make keep-alive synchronous operation
drivers/nvme/host/core.c | 25 ++++++++++---------------
drivers/nvme/target/loop.c | 13 +++++++++++++
2 files changed, 23 insertions(+), 15 deletions(-)
--
2.45.2
More information about the Linux-nvme
mailing list