[PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue

Mon Apr 26 10:34:14 BST 2021

On Sun, 2021-04-25 at 13:34 +0200, Hannes Reinecke wrote:
> On 4/23/21 3:38 PM, mwilck at suse.com wrote:
> > From: Martin Wilck <mwilck at suse.com>
> > 
> > We have observed a few crashes run_timer_softirq(), where a broken
> > timer_list struct belonging to an anatt_timer was encountered. The
> > broken
> > structures look like this, and we see actually multiple ones attached
> > to
> > the same timer base:
> > 
> > crash> struct timer_list 0xffff92471bcfdc90
> > struct timer_list {
> >    entry = {
> >      next = 0xdead000000000122,  // LIST_POISON2
> >      pprev = 0x0
> >    },
> >    expires = 4296022933,
> >    function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
> >    flags = 20
> > }
> > 
> > If such a timer is encountered in run_timer_softirq(), the kernel
> > crashes. The test scenario was an I/O load test with lots of NVMe
> > controllers, some of which were removed and re-added on the storage
> > side.
> > 
> ...
> 
> But isn't this the result of detach_timer()? IE this suspiciously looks
> like perfectly normal operation; is you look at expire_timers() we're
> first calling 'detach_timer()' before calling the timer function, ie 
> every crash in the timer function would have this signature.
> And, incidentally, so would any timer function which does not crash.
> 
> Sorry to kill your analysis ...

No problem, I realized this myself, and actually mentioned it in the
commit description. OTOH, this timer is only modified in very few
places, and all but nvme_mpath_init() use the proper APIs for modifying
or deleting timers, so the initialization of a (possibly still) running
timer is the only suspect, afaics.

My personal theory is that the corruption might happen in several
steps, the first step being timer_setup() mofiying fields of a pending
timer. But I couldn't figure it out completely, and found it too hand-
waving to mention in the commit description.

> This doesn't mean that the patch isn't valid (in the sense that it 
> resolve the issue), but we definitely will need to work on root cause
> analysis.

I'd be grateful for any help figuring out the missing bits.

Thanks,
Martin