[PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue

Martin Wilck mwilck at suse.com
Mon Apr 26 16:18:37 BST 2021


On Mon, 2021-04-26 at 10:31 +0800, Chao Leng wrote:
> 
> 
> On 2021/4/25 19:34, Hannes Reinecke wrote:
> > On 4/23/21 3:38 PM, mwilck at suse.com wrote:
> > > 
> > > If such a timer is encountered in run_timer_softirq(), the kernel
> > > crashes. The test scenario was an I/O load test with lots of NVMe
> > > controllers, some of which were removed and re-added on the storage
> > > side.
> > > 
> > ...
> > 
> > But isn't this the result of detach_timer()? IE this suspiciously
> > looks like perfectly normal operation; is you look at expire_timers()
> > we're first calling 'detach_timer()' before calling the timer
> > function, ie every crash in the timer function would have this
> > signature.
> > And, incidentally, so would any timer function which does not crash.
> > 
> > Sorry to kill your analysis ...
> > 
> > This doesn't mean that the patch isn't valid (in the sense that it
> > resolve the issue), but we definitely will need to work on root cause
> > analysis.
> The process maybe:1.ana_work add the timer;2.error recovery occurs,
> in reconnecting, reinitialize the timer and call nvme_read_ana_log,
> nvme_read_ana_log may add the timer again.
> The same timer is added twice, crash will happens later.

Thanks a lot! I think you nailed it.

I had another look at my dump with this in mind, and indeed I was able
to prove that the timer of the same NVMe controller had been started
(at least) twice. This explains why it looks as if it had been cleanly
detached, and is still running nonetheless.

In particular, I saw this:

[ 4389.939928] nvme nvme36: queue_size 128 > ctrl sqsize 32, clamping down
[ 4389.962709] nvme nvme36: creating 4 I/O queues.
[ 4389.978462] nvme nvme36: starting error recovery
[ 4389.978509] nvme nvme36: failed to send request -32
[ 4389.978724] nvme nvme36: Failed to configure AEN (cfg 900)
[ 4389.978732] nvme nvme36: Successfully reconnected (1 attempt)
[ 4389.978739] nvme nvme36: failed to send request -32
[ 4389.991420] nvme nvme36: Reconnecting in 10 seconds...
...
[ 4441.399435] nvme nvme36: queue_size 128 > ctrl sqsize 32, clamping down
[ 4441.407647] nvme nvme36: creating 4 I/O queues.
[ 4441.445655] nvme nvme36: Successfully reconnected (1 attempt)
[ 4441.699665] nvme nvme36: starting error recovery
[ 4441.719493] nvme nvme36: Identify Descriptors failed (881)
[ 4441.735409] nvme nvme36: Reconnecting in 10 seconds...
...
[ 4510.891400] nvme nvme36: ANATT timeout, resetting controller.
...
[ 4517.035411] general protection fault: 0000 [#1] SMP PTI

The protection fault was caused by the anatt timer of another nvme
controller. But I could see the anatt_timer of nvme36 still linked to
an active timer_base, about to expire 44s in the future. Calculating
back, that timer had been started around 4441s above, whereas the timer
that expired a few seconds earlier must have been started at 4389.9.

Regards,
Martin





More information about the Linux-nvme mailing list