[PATCH v3] nvme: rdma/tcp: fix list corruption with anatt timer

Tue Apr 27 10:33:04 BST 2021

On 4/27/21 10:52 AM, mwilck at suse.com wrote:
> From: Martin Wilck <mwilck at suse.com>
> 
> We have observed a few crashes run_timer_softirq(), where a broken
> timer_list struct belonging to an anatt_timer was encountered. The broken
> structures look like this, and we see actually multiple ones attached to
> the same timer base:
> 
> crash> struct timer_list 0xffff92471bcfdc90
> struct timer_list {
>   entry = {
>     next = 0xdead000000000122,  // LIST_POISON2
>     pprev = 0x0
>   },
>   expires = 4296022933,
>   function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
>   flags = 20
> }
> 
> If such a timer is encountered in run_timer_softirq(), the kernel
> crashes. The test scenario was an I/O load test with lots of NVMe
> controllers, some of which were removed and re-added on the storage side.
> 
> I think this may happen if the rdma recovery_work starts, in this call
> chain:
> 
> nvme_rdma_error_recovery_work()
>   /* this stops all sorts of activity for the controller, but not
>      the multipath-related work queue and timer */
>   nvme_rdma_reconnect_or_remove(ctrl)
>     => kicks reconnect_work
> 
> work queue: reconnect_work
> 
> nvme_rdma_reconnect_ctrl_work()
>   nvme_rdma_setup_ctrl()
>     nvme_rdma_configure_admin_queue()
>        nvme_init_identify()
>           nvme_mpath_init()
> 	     # this sets some fields of the timer_list without taking a lock
>              timer_setup()
>              nvme_read_ana_log()
> 	         mod_timer() or del_timer_sync()
> 
> Similar for TCP. The idea for the patch is based on the observation that
> nvme_rdma_reset_ctrl_work() calls nvme_stop_ctrl()->nvme_mpath_stop(),
> whereas nvme_rdma_error_recovery_work() stops only the keepalive timer, but
> not the anatt timer.
> 
> I admit that the root cause analysis isn't rock solid yet. In particular, I
> can't explain why we see LIST_POISON2 in the "next" pointer, which would
> indicate that the timer has been detached before; yet we find it linked to
> the timer base when the crash occurs.
> 
> OTOH, the anatt_timer is only touched in nvme_mpath_init() (see above) and
> nvme_mpath_stop(), so the hypothesis that modifying active timers may cause
> the issue isn't totally out of sight. I suspect that the LIST_POISON2 may
> come to pass in multiple steps.
> 
> If anyone has better ideas, please advise. The issue occurs very
> sporadically; verifying this by testing will be difficult.
> 
> Signed-off-by: Martin Wilck <mwilck at suse.com>
> Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
> Reviewed-by: Chao Leng <lengchao at huawei.com>
> Cc: stable at vger.kernel.org
> 
As indicated in my previous mail, please change the description. We have
since established a actual reason (duplicate calls to add_timer()), so
please list it here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare at suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)