[bug report] blktests nvme/022 lead kernel WARNING and NULL pointer

Yi Zhang yi.zhang at redhat.com
Sat May 22 01:12:22 BST 2021


On Sat, May 22, 2021 at 2:19 AM Sagi Grimberg <sagi at grimberg.me> wrote:
>
>
> >>> What about this?
> >
> > Hi Hannes
> > With this patch, no WARNNING/NULL pointer this time, but still have
> > 'keep-alive timer expired' and reset failure issue, here is the full
> > log:
> >
> > # ./check nvme/022
> > nvme/022 (test NVMe reset command on NVMeOF file-backed ns)  [failed]
> >      runtime  10.646s  ...  11.087s
> >      --- tests/nvme/022.out 2021-05-20 20:16:31.384068807 -0400
> >      +++ /root/blktests/results/nodev/nvme/022.out.bad 2021-05-20
> > 20:24:27.874250466 -0400
> >      @@ -1,4 +1,5 @@
> >       Running nvme/022
> >       91fdba0d-f87b-4c25-b80f-db7be1418b9e
> >       uuid.91fdba0d-f87b-4c25-b80f-db7be1418b9e
> >      +ERROR: reset failed
> >       Test complete
> > # cat results/nodev/nvme/022.full
> > Reset: Network dropped connection on reset
> > NQN:blktests-subsystem-1 disconnected 1 controller(s)
> >
> > [37353.068448] run blktests nvme/022 at 2021-05-20 20:24:16
> > [37353.146301] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> > [37353.161765] nvmet: creating controller 1 for subsystem
> > blktests-subsystem-1 for NQN
> > nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904.
> > [37353.175796] nvme nvme0: creating 128 I/O queues.
> > [37353.189734] nvme nvme0: new ctrl: "blktests-subsystem-1"
> > [37354.216686] nvme nvme0: resetting controller
> > [37363.270607] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
> > [37363.276521] nvmet: ctrl 1 fatal error occurred!
> > [37363.281058] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
> >
> > # ./check nvme/021
> > nvme/021 (test NVMe list command on NVMeOF file-backed ns)   [passed]
> >      runtime  10.958s  ...  11.382s
> > # dmesg
> > [38142.862881] run blktests nvme/021 at 2021-05-20 20:37:26
> > [38142.941038] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> > [38142.956621] nvmet: creating controller 1 for subsystem
> > blktests-subsystem-1 for NQN
> > nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904.
> > [38142.970524] nvme nvme0: creating 128 I/O queues.
> > [38142.984356] nvme nvme0: new ctrl: "blktests-subsystem-1"
> > [38144.014601] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
> > [38153.030107] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
> > [38153.036018] nvmet: ctrl 1 fatal error occurred!
>
> I think that the main reason is that there are 128 queues that are being
> created, and during that time the keep alive timer ends up expiring as
> it is shorter (used to be 15 seconds, now 5 by default).
>
> nvmet only stops the keep-alive timer when the controller is freed,
> which is pretty late in the sequence.. The problem is that it needs to
> be this way because if we shut it down sooner a host can die in the
> middle of a teardown sequence and we still need to detect that and
> cleanup ourselves. But maybe we can mod the keep-alive timer for
> every queue we delete, just in the case the host is not deleting
> fast enough?
>
> Ming, does this solve the issue you are seeing?

Hi Sagi
The issue was fixed by this patch. :)

> --
> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
> index 1853db38b682..f0715e9a4a9c 100644
> --- a/drivers/nvme/target/core.c
> +++ b/drivers/nvme/target/core.c
> @@ -804,6 +804,7 @@ void nvmet_sq_destroy(struct nvmet_sq *sq)
>          percpu_ref_exit(&sq->ref);
>
>          if (ctrl) {
> +               ctrl->cmd_seen = true;
>                  nvmet_ctrl_put(ctrl);
>                  sq->ctrl = NULL; /* allows reusing the queue later */
>          }
> --
>
> We probably need to rename cmd_seen to extend_tbkas (extend traffic
> based keep-alive).
>


-- 
Best Regards,
  Yi Zhang




More information about the Linux-nvme mailing list