[PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
Ewan D. Milne
emilne at redhat.com
Fri Apr 23 15:07:13 BST 2021
On Thu, 2021-04-22 at 17:22 +0200, mwilck at suse.com wrote:
> From: Martin Wilck <mwilck at suse.com>
>
> We have observed a few crashes run_timer_softirq(), where a broken
> timer_list struct belonging to an anatt_timer was encountered. The
> broken
> structures look like this, and we see actually multiple ones attached
> to
> the same timer base:
>
> crash> struct timer_list 0xffff92471bcfdc90
> struct timer_list {
> entry = {
> next = 0xdead000000000122, // LIST_POISON2
> pprev = 0x0
> },
> expires = 4296022933,
> function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
> flags = 20
> }
>
> If such a timer is encountered in run_timer_softirq(), the kernel
> crashes. The test scenario was an I/O load test with lots of NVMe
> controllers, some of which were removed and re-added on the storage
> side.
>
> I think this may happen if the rdma recovery_work starts, in this
> call
> chain:
>
> nvme_rdma_error_recovery_work()
> /* this stops all sorts of activity for the controller, but not
> the multipath-related work queue and timer */
> nvme_rdma_reconnect_or_remove(ctrl)
> => kicks reconnect_work
>
> work queue: reconnect_work
>
> nvme_rdma_reconnect_ctrl_work()
> nvme_rdma_setup_ctrl()
> nvme_rdma_configure_admin_queue()
> nvme_init_identify()
> nvme_mpath_init()
> # this sets some fields of the timer_list without taking a
> lock
> timer_setup()
> nvme_read_ana_log()
> mod_timer() or del_timer_sync()
>
> Similar for TCP. The idea for the patch is based on the observation
> that
> nvme_rdma_reset_ctrl_work() calls nvme_stop_ctrl()-
> >nvme_mpath_stop(),
> whereas nvme_rdma_error_recovery_work() stops only the keepalive
> timer, but
> not the anatt timer.
>
> I admit that the root cause analysis isn't rock solid yet. In
> particular, I
> can't explain why we see LIST_POISON2 in the "next" pointer, which
> would
> indicate that the timer has been detached before; yet we find it
> linked to
> the timer base when the crash occurs.
>
> OTOH, the anatt_timer is only touched in nvme_mpath_init() (see
> above) and
> nvme_mpath_stop(), so the hypothesis that modifying active timers may
> cause
> the issue isn't totally out of sight. I suspect that the LIST_POISON2
> may
> come to pass in multiple steps.
>
> If anyone has better ideas, please advise. The issue occurs very
> sporadically; verifying this by testing will be difficult.
>
We have a couple reports of this same crash, one when traversing the
timer list and one when removing the timer. Note LIST_POISON2 in RAX.
I was also thinking that stopping would resolve it.
It's somewhat hard to reproduce so it may take a while to verify
but we'll test this patch out.
-Ewan
crash> bt
PID: 0 TASK: ffffffffbae12780 CPU: 0 COMMAND: "swapper/0"
#0 [ffff922477a03cd8] machine_kexec at ffffffffb9c5bf3e
#1 [ffff922477a03d30] __crash_kexec at ffffffffb9d6072d
#2 [ffff922477a03df8] crash_kexec at ffffffffb9d6160d
#3 [ffff922477a03e10] oops_end at ffffffffb9c22d4d
#4 [ffff922477a03e30] general_protection at ffffffffba6011fe
[exception RIP: run_timer_softirq+294]
RIP: ffffffffb9d3ea36 RSP: ffff922477a03ee0 RFLAGS: 00010086
RAX: dead000000000200 RBX: ffff922477a1aa80 RCX: 0000000000000100
RDX: ffff922477a03ef0 RSI: 0000000000000002 RDI: ffff922477a1aa80
RBP: 0000000000000001 R8: ffffffffbae12780 R9: 0000000000000000
R10: 0000000000000310 R11: 000000000b5ad84c R12: 000000011dd60000
R13: ffff922477a03ef0 R14: ffff922472de0f80 R15: ffffffffbae05100
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#5 [ffff922477a03f68] __softirqentry_text_start at ffffffffba8000e4
#6 [ffff922477a03fc8] irq_exit at ffffffffb9cbc1d7
#7 [ffff922477a03fd8] smp_apic_timer_interrupt at ffffffffba6027e4
#8 [ffff922477a03ff0] apic_timer_interrupt at ffffffffba601d6f
--- <IRQ stack> ---
#9 [ffffffffbae03dd8] apic_timer_interrupt at ffffffffba601d6f
[exception RIP: native_safe_halt+14]
RIP: ffffffffba4d667e RSP: ffffffffbae03e80 RFLAGS: 00000246
RAX: ffffffffba4d62d0 RBX: 0000000000000000 RCX: 00000000000f4240
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000082
RBP: 0000000000000000 R8: 000266f562c6c886 R9: 0000000000000001
R10: 000000000000031f R11: ffff9224722cb400 R12: ffffffffffffffff
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: ffffffffffffff13 CS: 0010 SS: 0018
#10 [ffffffffbae03e80] default_idle at ffffffffba4d62ec
#11 [ffffffffbae03ea8] do_idle at ffffffffb9cebbb7
#12 [ffffffffbae03ef0] cpu_startup_entry at ffffffffb9cebe0f
#13 [ffffffffbae03f10] start_kernel at ffffffffbb6b21e5
#14 [ffffffffbae03f50] secondary_startup_64 at ffffffffb9c000e7
(different crash)
PID: 69334 TASK: ffff8d02a652dd00 CPU: 0 COMMAND: "kworker/u2:2"
#0 [ffffa658418cfb28] machine_kexec at ffffffffb565bf3e
#1 [ffffa658418cfb80] __crash_kexec at ffffffffb576072d
#2 [ffffa658418cfc48] crash_kexec at ffffffffb576160d
#3 [ffffa658418cfc60] oops_end at ffffffffb5622d4d
#4 [ffffa658418cfc80] general_protection at ffffffffb60011fe
[exception RIP: detach_if_pending+58]
RIP: ffffffffb573cdfa RSP: ffffa658418cfd38 RFLAGS: 00010086
RAX: dead000000000200 RBX: ffff8d02aac74f80 RCX: 0000000000000000
RDX: ffffa65841d5bb68 RSI: ffff8d02afa1aa80 RDI: ffff8d02aac74f80
RBP: ffff8d02aac74f80 R8: ffff8d02a652dd00 R9: 0000000000000000
R10: 0000000002f41000 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000000 R14: ffff8d02a7621e80 R15: ffff8d02aac750a0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#5 [ffffa658418cfd50] try_to_del_timer_sync at ffffffffb573cf4f
#6 [ffffa658418cfd78] del_timer_sync at ffffffffb573cfb5
#7 [ffffa658418cfd88] nvme_read_ana_log at ffffffffc01df657 [nvme_core]
#8 [ffffa658418cfdb8] nvme_mpath_init at ffffffffc01e0642 [nvme_core]
#9 [ffffa658418cfdd0] nvme_init_identify at ffffffffc01ddba8 [nvme_core]
#10 [ffffa658418cfe38] nvme_tcp_setup_ctrl at ffffffffc040e2e9 [nvme_tcp]
#11 [ffffa658418cfe80] nvme_tcp_reconnect_ctrl_work at ffffffffc040e4bf [nvme_tcp]
#12 [ffffa658418cfe98] process_one_work at ffffffffb56d3477
#13 [ffffa658418cfed8] worker_thread at ffffffffb56d3b40
#14 [ffffa658418cff10] kthread at ffffffffb56d9502
#15 [ffffa658418cff50] ret_from_fork at ffffffffb6000255
More information about the Linux-nvme
mailing list