[PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue

Fri Apr 23 15:07:13 BST 2021

On Thu, 2021-04-22 at 17:22 +0200, mwilck at suse.com wrote:
> From: Martin Wilck <mwilck at suse.com>
> 
> We have observed a few crashes run_timer_softirq(), where a broken
> timer_list struct belonging to an anatt_timer was encountered. The
> broken
> structures look like this, and we see actually multiple ones attached
> to
> the same timer base:
> 
> crash> struct timer_list 0xffff92471bcfdc90
> struct timer_list {
>   entry = {
>     next = 0xdead000000000122,  // LIST_POISON2
>     pprev = 0x0
>   },
>   expires = 4296022933,
>   function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
>   flags = 20
> }
> 
> If such a timer is encountered in run_timer_softirq(), the kernel
> crashes. The test scenario was an I/O load test with lots of NVMe
> controllers, some of which were removed and re-added on the storage
> side.
> 
> I think this may happen if the rdma recovery_work starts, in this
> call
> chain:
> 
> nvme_rdma_error_recovery_work()
>   /* this stops all sorts of activity for the controller, but not
>      the multipath-related work queue and timer */
>   nvme_rdma_reconnect_or_remove(ctrl)
>     => kicks reconnect_work
> 
> work queue: reconnect_work
> 
> nvme_rdma_reconnect_ctrl_work()
>   nvme_rdma_setup_ctrl()
>     nvme_rdma_configure_admin_queue()
>        nvme_init_identify()
>           nvme_mpath_init()
> 	     # this sets some fields of the timer_list without taking a
> lock
>              timer_setup()
>              nvme_read_ana_log()
> 	         mod_timer() or del_timer_sync()
> 
> Similar for TCP. The idea for the patch is based on the observation
> that
> nvme_rdma_reset_ctrl_work() calls nvme_stop_ctrl()-
> >nvme_mpath_stop(),
> whereas nvme_rdma_error_recovery_work() stops only the keepalive
> timer, but
> not the anatt timer.
> 
> I admit that the root cause analysis isn't rock solid yet. In
> particular, I
> can't explain why we see LIST_POISON2 in the "next" pointer, which
> would
> indicate that the timer has been detached before; yet we find it
> linked to
> the timer base when the crash occurs.
> 
> OTOH, the anatt_timer is only touched in nvme_mpath_init() (see
> above) and
> nvme_mpath_stop(), so the hypothesis that modifying active timers may
> cause
> the issue isn't totally out of sight. I suspect that the LIST_POISON2
> may
> come to pass in multiple steps.
> 
> If anyone has better ideas, please advise. The issue occurs very
> sporadically; verifying this by testing will be difficult.
> 

We have a couple reports of this same crash, one when traversing the
timer list and one when removing the timer.  Note LIST_POISON2 in RAX.
I was also thinking that stopping would resolve it.

It's somewhat hard to reproduce so it may take a while to verify
but we'll test this patch out.

-Ewan

crash> bt
PID: 0      TASK: ffffffffbae12780  CPU: 0   COMMAND: "swapper/0"
 #0 [ffff922477a03cd8] machine_kexec at ffffffffb9c5bf3e
 #1 [ffff922477a03d30] __crash_kexec at ffffffffb9d6072d
 #2 [ffff922477a03df8] crash_kexec at ffffffffb9d6160d
 #3 [ffff922477a03e10] oops_end at ffffffffb9c22d4d
 #4 [ffff922477a03e30] general_protection at ffffffffba6011fe
    [exception RIP: run_timer_softirq+294]
    RIP: ffffffffb9d3ea36  RSP: ffff922477a03ee0  RFLAGS: 00010086
    RAX: dead000000000200  RBX: ffff922477a1aa80  RCX: 0000000000000100
    RDX: ffff922477a03ef0  RSI: 0000000000000002  RDI: ffff922477a1aa80
    RBP: 0000000000000001   R8: ffffffffbae12780   R9: 0000000000000000
    R10: 0000000000000310  R11: 000000000b5ad84c  R12: 000000011dd60000
    R13: ffff922477a03ef0  R14: ffff922472de0f80  R15: ffffffffbae05100
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffff922477a03f68] __softirqentry_text_start at ffffffffba8000e4
 #6 [ffff922477a03fc8] irq_exit at ffffffffb9cbc1d7
 #7 [ffff922477a03fd8] smp_apic_timer_interrupt at ffffffffba6027e4
 #8 [ffff922477a03ff0] apic_timer_interrupt at ffffffffba601d6f
--- <IRQ stack> ---
 #9 [ffffffffbae03dd8] apic_timer_interrupt at ffffffffba601d6f
    [exception RIP: native_safe_halt+14]
    RIP: ffffffffba4d667e  RSP: ffffffffbae03e80  RFLAGS: 00000246
    RAX: ffffffffba4d62d0  RBX: 0000000000000000  RCX: 00000000000f4240
    RDX: 0000000000000001  RSI: 0000000000000000  RDI: 0000000000000082
    RBP: 0000000000000000   R8: 000266f562c6c886   R9: 0000000000000001
    R10: 000000000000031f  R11: ffff9224722cb400  R12: ffffffffffffffff
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffff13  CS: 0010  SS: 0018
#10 [ffffffffbae03e80] default_idle at ffffffffba4d62ec
#11 [ffffffffbae03ea8] do_idle at ffffffffb9cebbb7
#12 [ffffffffbae03ef0] cpu_startup_entry at ffffffffb9cebe0f
#13 [ffffffffbae03f10] start_kernel at ffffffffbb6b21e5
#14 [ffffffffbae03f50] secondary_startup_64 at ffffffffb9c000e7

(different crash)

PID: 69334  TASK: ffff8d02a652dd00  CPU: 0   COMMAND: "kworker/u2:2"
 #0 [ffffa658418cfb28] machine_kexec at ffffffffb565bf3e
 #1 [ffffa658418cfb80] __crash_kexec at ffffffffb576072d
 #2 [ffffa658418cfc48] crash_kexec at ffffffffb576160d
 #3 [ffffa658418cfc60] oops_end at ffffffffb5622d4d
 #4 [ffffa658418cfc80] general_protection at ffffffffb60011fe
    [exception RIP: detach_if_pending+58]
    RIP: ffffffffb573cdfa  RSP: ffffa658418cfd38  RFLAGS: 00010086
    RAX: dead000000000200  RBX: ffff8d02aac74f80  RCX: 0000000000000000
    RDX: ffffa65841d5bb68  RSI: ffff8d02afa1aa80  RDI: ffff8d02aac74f80
    RBP: ffff8d02aac74f80   R8: ffff8d02a652dd00   R9: 0000000000000000
    R10: 0000000002f41000  R11: 0000000000000000  R12: 0000000000000001
    R13: 0000000000000000  R14: ffff8d02a7621e80  R15: ffff8d02aac750a0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffa658418cfd50] try_to_del_timer_sync at ffffffffb573cf4f
 #6 [ffffa658418cfd78] del_timer_sync at ffffffffb573cfb5
 #7 [ffffa658418cfd88] nvme_read_ana_log at ffffffffc01df657 [nvme_core]
 #8 [ffffa658418cfdb8] nvme_mpath_init at ffffffffc01e0642 [nvme_core]
 #9 [ffffa658418cfdd0] nvme_init_identify at ffffffffc01ddba8 [nvme_core]
#10 [ffffa658418cfe38] nvme_tcp_setup_ctrl at ffffffffc040e2e9 [nvme_tcp]
#11 [ffffa658418cfe80] nvme_tcp_reconnect_ctrl_work at ffffffffc040e4bf [nvme_tcp]
#12 [ffffa658418cfe98] process_one_work at ffffffffb56d3477
#13 [ffffa658418cfed8] worker_thread at ffffffffb56d3b40
#14 [ffffa658418cff10] kthread at ffffffffb56d9502
#15 [ffffa658418cff50] ret_from_fork at ffffffffb6000255