[bug report] blktests nvme/022 lead kernel WARNING and NULL pointer

Hannes Reinecke hare at suse.de
Sat May 22 15:59:06 BST 2021


On 5/22/21 2:12 AM, Yi Zhang wrote:
> On Sat, May 22, 2021 at 2:19 AM Sagi Grimberg <sagi at grimberg.me> wrote:
>>
>>
>>>>> What about this?
>>>
>>> Hi Hannes
>>> With this patch, no WARNNING/NULL pointer this time, but still have
>>> 'keep-alive timer expired' and reset failure issue, here is the full
>>> log:
>>>
>>> # ./check nvme/022
>>> nvme/022 (test NVMe reset command on NVMeOF file-backed ns)  [failed]
>>>       runtime  10.646s  ...  11.087s
>>>       --- tests/nvme/022.out 2021-05-20 20:16:31.384068807 -0400
>>>       +++ /root/blktests/results/nodev/nvme/022.out.bad 2021-05-20
>>> 20:24:27.874250466 -0400
>>>       @@ -1,4 +1,5 @@
>>>        Running nvme/022
>>>        91fdba0d-f87b-4c25-b80f-db7be1418b9e
>>>        uuid.91fdba0d-f87b-4c25-b80f-db7be1418b9e
>>>       +ERROR: reset failed
>>>        Test complete
>>> # cat results/nodev/nvme/022.full
>>> Reset: Network dropped connection on reset
>>> NQN:blktests-subsystem-1 disconnected 1 controller(s)
>>>
>>> [37353.068448] run blktests nvme/022 at 2021-05-20 20:24:16
>>> [37353.146301] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>> [37353.161765] nvmet: creating controller 1 for subsystem
>>> blktests-subsystem-1 for NQN
>>> nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904.
>>> [37353.175796] nvme nvme0: creating 128 I/O queues.
>>> [37353.189734] nvme nvme0: new ctrl: "blktests-subsystem-1"
>>> [37354.216686] nvme nvme0: resetting controller
>>> [37363.270607] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
>>> [37363.276521] nvmet: ctrl 1 fatal error occurred!
>>> [37363.281058] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
>>>
>>> # ./check nvme/021
>>> nvme/021 (test NVMe list command on NVMeOF file-backed ns)   [passed]
>>>       runtime  10.958s  ...  11.382s
>>> # dmesg
>>> [38142.862881] run blktests nvme/021 at 2021-05-20 20:37:26
>>> [38142.941038] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>> [38142.956621] nvmet: creating controller 1 for subsystem
>>> blktests-subsystem-1 for NQN
>>> nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904.
>>> [38142.970524] nvme nvme0: creating 128 I/O queues.
>>> [38142.984356] nvme nvme0: new ctrl: "blktests-subsystem-1"
>>> [38144.014601] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
>>> [38153.030107] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
>>> [38153.036018] nvmet: ctrl 1 fatal error occurred!
>>
>> I think that the main reason is that there are 128 queues that are being
>> created, and during that time the keep alive timer ends up expiring as
>> it is shorter (used to be 15 seconds, now 5 by default).
>>
>> nvmet only stops the keep-alive timer when the controller is freed,
>> which is pretty late in the sequence.. The problem is that it needs to
>> be this way because if we shut it down sooner a host can die in the
>> middle of a teardown sequence and we still need to detect that and
>> cleanup ourselves. But maybe we can mod the keep-alive timer for
>> every queue we delete, just in the case the host is not deleting
>> fast enough?
>>
>> Ming, does this solve the issue you are seeing?
> 
> Hi Sagi
> The issue was fixed by this patch. :)
> 
>> --
>> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
>> index 1853db38b682..f0715e9a4a9c 100644
>> --- a/drivers/nvme/target/core.c
>> +++ b/drivers/nvme/target/core.c
>> @@ -804,6 +804,7 @@ void nvmet_sq_destroy(struct nvmet_sq *sq)
>>           percpu_ref_exit(&sq->ref);
>>
>>           if (ctrl) {
>> +               ctrl->cmd_seen = true;
>>                   nvmet_ctrl_put(ctrl);
>>                   sq->ctrl = NULL; /* allows reusing the queue later */
>>           }
>> --
>>
>> We probably need to rename cmd_seen to extend_tbkas (extend traffic
>> based keep-alive).
>>
> 
> 
Thanks for the confirmation.

I'll send a proper patchset.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare at suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer



More information about the Linux-nvme mailing list