[PATCH v3] nvme: rdma/tcp: fix list corruption with anatt timer

Hannes Reinecke hare at suse.de
Wed Apr 28 07:35:15 BST 2021


On 4/27/21 9:54 PM, Martin Wilck wrote:
> On Tue, 2021-04-27 at 20:05 +0200, Hannes Reinecke wrote:
>> On 4/27/21 6:25 PM, Christoph Hellwig wrote:
>>> On Tue, Apr 27, 2021 at 11:33:04AM +0200, Hannes Reinecke wrote:
>>>> As indicated in my previous mail, please change the description.
>>>> We have
>>>> since established a actual reason (duplicate calls to
>>>> add_timer()), so
>>>> please list it here.
>>>
>>> So what happens if the offending add_timer is changed to mod_timer?
>>>
>> I guess that should be fine, as the boilerplate said it can act
>> as a safe version of add_timer.
>>
>> But that would just solve the crash upon add_timer().
> 
> The code doesn't use add_timer(), only mod_timer() and
> del_timer_sync(). And we didn't observe a crash upon add_timer(). What
> we observed was that a timer had been enqueued multiple times, and the
> kernel crashes in expire_timers()->detach_timer(), when it encounters
> an already detached entry in the timer list.
> 
nvme_mpath_init() doesn't use add_timer, but it uses timer_setup(). And
calling that on an already pending timer is even worse :-)

And my point is that the anatt timer is not stopped at the end of
nvme_init_identify() if any of the calls to

nvme_configure_apst()
nvme_configure_timestamp()
nvme_configure_directives()
nvme_configure_acre()

returns with an error. If they do the controller is reset, causing
eg nvme_tcp_configure_admin_queue() to be called, which will be
calling timer_setup() with the original timer still running.
If the (original) timer triggers _after_ that time we have the crash.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare at suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)



More information about the Linux-nvme mailing list