[PATCH v3] nvme: rdma/tcp: fix list corruption with anatt timer

Martin Wilck mwilck at suse.com
Wed Apr 28 07:41:51 BST 2021


On Wed, 2021-04-28 at 08:35 +0200, Hannes Reinecke wrote:
> On 4/27/21 9:54 PM, Martin Wilck wrote:
> > On Tue, 2021-04-27 at 20:05 +0200, Hannes Reinecke wrote:
> > > On 4/27/21 6:25 PM, Christoph Hellwig wrote:
> > > > On Tue, Apr 27, 2021 at 11:33:04AM +0200, Hannes Reinecke
> > > > wrote:
> > > > > As indicated in my previous mail, please change the
> > > > > description.
> > > > > We have
> > > > > since established a actual reason (duplicate calls to
> > > > > add_timer()), so
> > > > > please list it here.
> > > > 
> > > > So what happens if the offending add_timer is changed to
> > > > mod_timer?
> > > > 
> > > I guess that should be fine, as the boilerplate said it can act
> > > as a safe version of add_timer.
> > > 
> > > But that would just solve the crash upon add_timer().
> > 
> > The code doesn't use add_timer(), only mod_timer() and
> > del_timer_sync(). And we didn't observe a crash upon add_timer().
> > What
> > we observed was that a timer had been enqueued multiple times, and
> > the
> > kernel crashes in expire_timers()->detach_timer(), when it
> > encounters
> > an already detached entry in the timer list.
> > 
> nvme_mpath_init() doesn't use add_timer, but it uses timer_setup().

Yes. The notion that this is wrong was the idea behind my patch.

> And
> calling that on an already pending timer is even worse :-)
> 
> And my point is that the anatt timer is not stopped at the end of
> nvme_init_identify() if any of the calls to
> 
> nvme_configure_apst()
> nvme_configure_timestamp()
> nvme_configure_directives()
> nvme_configure_acre()
> 
> returns with an error. If they do the controller is reset, causing
> eg nvme_tcp_configure_admin_queue() to be called, which will be
> calling timer_setup() with the original timer still running.
> If the (original) timer triggers _after_ that time we have the crash.

You are right. But afaics the problem doesn't have to originate in
these 4 function calls. The same applies even after
nvme_init_identify() has returned. Any error that would trigger the
error recovery work after the anatt timer has started would have this
effect.

Regards,
Martin





More information about the Linux-nvme mailing list