[bug report] blktests nvme/022 lead kernel WARNING and NULL pointer

Fri May 21 19:19:13 BST 2021

>>> What about this?
> 
> Hi Hannes
> With this patch, no WARNNING/NULL pointer this time, but still have
> 'keep-alive timer expired' and reset failure issue, here is the full
> log:
> 
> # ./check nvme/022
> nvme/022 (test NVMe reset command on NVMeOF file-backed ns)  [failed]
>      runtime  10.646s  ...  11.087s
>      --- tests/nvme/022.out 2021-05-20 20:16:31.384068807 -0400
>      +++ /root/blktests/results/nodev/nvme/022.out.bad 2021-05-20
> 20:24:27.874250466 -0400
>      @@ -1,4 +1,5 @@
>       Running nvme/022
>       91fdba0d-f87b-4c25-b80f-db7be1418b9e
>       uuid.91fdba0d-f87b-4c25-b80f-db7be1418b9e
>      +ERROR: reset failed
>       Test complete
> # cat results/nodev/nvme/022.full
> Reset: Network dropped connection on reset
> NQN:blktests-subsystem-1 disconnected 1 controller(s)
> 
> [37353.068448] run blktests nvme/022 at 2021-05-20 20:24:16
> [37353.146301] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> [37353.161765] nvmet: creating controller 1 for subsystem
> blktests-subsystem-1 for NQN
> nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904.
> [37353.175796] nvme nvme0: creating 128 I/O queues.
> [37353.189734] nvme nvme0: new ctrl: "blktests-subsystem-1"
> [37354.216686] nvme nvme0: resetting controller
> [37363.270607] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
> [37363.276521] nvmet: ctrl 1 fatal error occurred!
> [37363.281058] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
> 
> # ./check nvme/021
> nvme/021 (test NVMe list command on NVMeOF file-backed ns)   [passed]
>      runtime  10.958s  ...  11.382s
> # dmesg
> [38142.862881] run blktests nvme/021 at 2021-05-20 20:37:26
> [38142.941038] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> [38142.956621] nvmet: creating controller 1 for subsystem
> blktests-subsystem-1 for NQN
> nqn.2014-08.org.nvmexpress:uuid:6a70d220-bfde-1000-03ce-ea40b8730904.
> [38142.970524] nvme nvme0: creating 128 I/O queues.
> [38142.984356] nvme nvme0: new ctrl: "blktests-subsystem-1"
> [38144.014601] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
> [38153.030107] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
> [38153.036018] nvmet: ctrl 1 fatal error occurred!

I think that the main reason is that there are 128 queues that are being
created, and during that time the keep alive timer ends up expiring as
it is shorter (used to be 15 seconds, now 5 by default).

nvmet only stops the keep-alive timer when the controller is freed,
which is pretty late in the sequence.. The problem is that it needs to
be this way because if we shut it down sooner a host can die in the
middle of a teardown sequence and we still need to detect that and
cleanup ourselves. But maybe we can mod the keep-alive timer for
every queue we delete, just in the case the host is not deleting
fast enough?

Ming, does this solve the issue you are seeing?
--

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 1853db38b682..f0715e9a4a9c 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -804,6 +804,7 @@ void nvmet_sq_destroy(struct nvmet_sq *sq)
         percpu_ref_exit(&sq->ref);

         if (ctrl) {
+               ctrl->cmd_seen = true;
                 nvmet_ctrl_put(ctrl);
                 sq->ctrl = NULL; /* allows reusing the queue later */
         }
--

We probably need to rename cmd_seen to extend_tbkas (extend traffic
based keep-alive).