[PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue

Mon Apr 26 11:06:46 BST 2021

On 4/26/21 11:34 AM, Martin Wilck wrote:
> On Sun, 2021-04-25 at 13:34 +0200, Hannes Reinecke wrote:
>> On 4/23/21 3:38 PM, mwilck at suse.com wrote:
>>> From: Martin Wilck <mwilck at suse.com>
>>>
>>> We have observed a few crashes run_timer_softirq(), where a broken
>>> timer_list struct belonging to an anatt_timer was encountered. The
>>> broken
>>> structures look like this, and we see actually multiple ones attached
>>> to
>>> the same timer base:
>>>
>>> crash> struct timer_list 0xffff92471bcfdc90
>>> struct timer_list {
>>>    entry = {
>>>      next = 0xdead000000000122,  // LIST_POISON2
>>>      pprev = 0x0
>>>    },
>>>    expires = 4296022933,
>>>    function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
>>>    flags = 20
>>> }
>>>
>>> If such a timer is encountered in run_timer_softirq(), the kernel
>>> crashes. The test scenario was an I/O load test with lots of NVMe
>>> controllers, some of which were removed and re-added on the storage
>>> side.
>>>
>> ...
>>
>> But isn't this the result of detach_timer()? IE this suspiciously looks
>> like perfectly normal operation; is you look at expire_timers() we're
>> first calling 'detach_timer()' before calling the timer function, ie 
>> every crash in the timer function would have this signature.
>> And, incidentally, so would any timer function which does not crash.
>>
>> Sorry to kill your analysis ...
> 
> No problem, I realized this myself, and actually mentioned it in the
> commit description. OTOH, this timer is only modified in very few
> places, and all but nvme_mpath_init() use the proper APIs for modifying
> or deleting timers, so the initialization of a (possibly still) running
> timer is the only suspect, afaics.
> 
> My personal theory is that the corruption might happen in several
> steps, the first step being timer_setup() mofiying fields of a pending
> timer. But I couldn't figure it out completely, and found it too hand-
> waving to mention in the commit description.
> 
>> This doesn't mean that the patch isn't valid (in the sense that it 
>> resolve the issue), but we definitely will need to work on root cause
>> analysis.
> 
> I'd be grateful for any help figuring out the missing bits.
> 
No, the problem lies in the pecularities of NVMe error recovery.

The crucial point here is that during error recovery all I/O is aborted,
but after that the queue is _NOT_ stopped.
So it's perfectly legal to send I/O even if the controller is not fully
LIVE. And if _this_ I/O fails then we'll find ourselves in lots of
interesting situations.

I guess the problem lies with nvme_init_identify().

We're calling nvme_mpath_init() (which will start the ANATT timer), but
issue some more I/O afterwards.
If any of _that_ I/O fails, an error is registered with the caller (eg
in nvme_tcp_configure_admin_queue(), _but_ the ANATT timer continues to
run. Then we'll be calling nvme_tcp_reconnect_or_remove() and delete the
controller. Completely oblivious to the running ANATT timer.
Which will get quite unhappy once it triggers as the controller
structure from which the timer was started is done for.

So with this analysis an alternative fix would be to move
nvme_path_init() right at the end of nvme_init_identify(), such that
we'll never get into the situation where nvme_init_identify() fails yet
the ANATT timer is running.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare at suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)