[PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue

Hannes Reinecke hare at suse.de
Sun Apr 25 12:34:51 BST 2021


On 4/23/21 3:38 PM, mwilck at suse.com wrote:
> From: Martin Wilck <mwilck at suse.com>
> 
> We have observed a few crashes run_timer_softirq(), where a broken
> timer_list struct belonging to an anatt_timer was encountered. The broken
> structures look like this, and we see actually multiple ones attached to
> the same timer base:
> 
> crash> struct timer_list 0xffff92471bcfdc90
> struct timer_list {
>    entry = {
>      next = 0xdead000000000122,  // LIST_POISON2
>      pprev = 0x0
>    },
>    expires = 4296022933,
>    function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
>    flags = 20
> }
> 
> If such a timer is encountered in run_timer_softirq(), the kernel
> crashes. The test scenario was an I/O load test with lots of NVMe
> controllers, some of which were removed and re-added on the storage side.
> 
...

But isn't this the result of detach_timer()? IE this suspiciously looks 
like perfectly normal operation; is you look at expire_timers() we're 
first calling 'detach_timer()' before calling the timer function, ie 
every crash in the timer function would have this signature.
And, incidentally, so would any timer function which does not crash.

Sorry to kill your analysis ...

This doesn't mean that the patch isn't valid (in the sense that it 
resolve the issue), but we definitely will need to work on root cause 
analysis.

Cheera,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare at suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer



More information about the Linux-nvme mailing list