[PATCH blktests v3] nvme/046: test queue count changes on reconnect
Shinichiro Kawasaki
shinichiro.kawasaki at wdc.com
Tue Sep 13 03:57:44 PDT 2022
On Sep 13, 2022 / 08:57, Daniel Wagner wrote:
> The target is allowed to change the number of I/O queues. Test if the
> host is able to reconnect in this scenario.
>
> Signed-off-by: Daniel Wagner <dwagner at suse.de>
> ---
> I've send a fix for the bug reported in nvmet_subsys_attr_qid_max_show()
Thanks!
>
> Though I have found another problem with this test:
>
> nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:.
> nvme nvme0: creating 8 I/O queues.
> nvme nvme0: mapped 8/0/0 default/read/poll queues.
> nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 10.161.61.228:4420
> nvme nvme0: starting error recovery
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.0.0-rc2+ #26 Not tainted
> ------------------------------------------------------
> kworker/6:6/17471 is trying to acquire lock:
> ffff888103d76440 (&id_priv->handler_mutex){+.+.}-{3:3}, at: rdma_destroy_id+0x17/0x20 [rdma_cm]
>
> but task is already holding lock:
> ffffc90002497de0 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x4ec/0xa90
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 ((work_completion)(&queue->release_work)){+.+.}-{0:0}:
> process_one_work+0x565/0xa90
> worker_thread+0x2c0/0x710
> kthread+0x16c/0x1a0
> ret_from_fork+0x1f/0x30
>
> -> #2 ((wq_completion)nvmet-wq){+.+.}-{0:0}:
> __flush_workqueue+0x105/0x850
> nvmet_rdma_cm_handler+0xf9f/0x174e [nvmet_rdma]
> cma_cm_event_handler+0x77/0x2d0 [rdma_cm]
> cma_ib_req_handler+0xbd4/0x23c0 [rdma_cm]
> cm_process_work+0x2f/0x210 [ib_cm]
> cm_req_handler+0xf73/0x2020 [ib_cm]
> cm_work_handler+0x6d6/0x37c0 [ib_cm]
> process_one_work+0x5b6/0xa90
> worker_thread+0x2c0/0x710
> kthread+0x16c/0x1a0
> ret_from_fork+0x1f/0x30
>
> -> #1 (&id_priv->handler_mutex/1){+.+.}-{3:3}:
> __mutex_lock+0x12d/0xe40
> cma_ib_req_handler+0x956/0x23c0 [rdma_cm]
> cm_process_work+0x2f/0x210 [ib_cm]
> cm_req_handler+0xf73/0x2020 [ib_cm]
> cm_work_handler+0x6d6/0x37c0 [ib_cm]
> process_one_work+0x5b6/0xa90
> worker_thread+0x2c0/0x710
> kthread+0x16c/0x1a0
> ret_from_fork+0x1f/0x30
>
> -> #0 (&id_priv->handler_mutex){+.+.}-{3:3}:
> __lock_acquire+0x1a9d/0x2d90
> lock_acquire+0x15d/0x3f0
> __mutex_lock+0x12d/0xe40
> rdma_destroy_id+0x17/0x20 [rdma_cm]
> nvmet_rdma_free_queue+0x54/0x180 [nvmet_rdma]
> nvmet_rdma_release_queue_work+0x2c/0x70 [nvmet_rdma]
> process_one_work+0x5b6/0xa90
> worker_thread+0x2c0/0x710
> kthread+0x16c/0x1a0
> ret_from_fork+0x1f/0x30
>
> other info that might help us debug this:
>
> Chain exists of:
> &id_priv->handler_mutex --> (wq_completion)nvmet-wq --> (work_completion)(&queue->release_work)
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock((work_completion)(&queue->release_work));
> lock((wq_completion)nvmet-wq);
> lock((work_completion)(&queue->release_work));
> lock(&id_priv->handler_mutex);
I also observe this WARNING on my test machine with nvme_trtype=rdma. It looks a
hidden rdma issue for me...
>
> And there is the question how to handle the different kernel logs output:
>
> c740:~/blktests # nvme_trtype=tcp ./check nvme/046
> nvme/046 (Test queue count changes on reconnect) [passed]
> runtime 32.750s ... 32.340s
> c740:~/blktests # nvme_trtype=rdma ./check nvme/046
> nvme/046 (Test queue count changes on reconnect) [failed]
> runtime 32.340s ... 24.716s
> something found in dmesg:
> [307171.053929] run blktests nvme/046 at 2022-09-13 08:32:07
> [307171.884448] rdma_rxe: loaded
> [307171.965507] eth0 speed is unknown, defaulting to 1000
> [307171.967049] eth0 speed is unknown, defaulting to 1000
> [307171.968671] eth0 speed is unknown, defaulting to 1000
> [307172.057714] infiniband eth0_rxe: set active
> [307172.058544] eth0 speed is unknown, defaulting to 1000
> [307172.058630] infiniband eth0_rxe: added eth0
> [307172.107627] eth0 speed is unknown, defaulting to 1000
> [307172.230771] nvmet: adding nsid 1 to subsystem blktests-feature-detect
> ...
> (See '/root/blktests/results/nodev/nvme/046.dmesg' for the entire message)
> c740:~/blktests # nvme_trtype=fc ./check nvme/046
> nvme/046 (Test queue count changes on reconnect) [failed]
> runtime 24.716s ... 87.454s
> --- tests/nvme/046.out 2022-09-09 16:23:22.926123227 +0200
> +++ /root/blktests/results/nodev/nvme/046.out.bad 2022-09-13 08:35:03.716097654 +0200
> @@ -1,3 +1,89 @@
> Running nvme/046
> -NQN:blktests-subsystem-1 disconnected 1 controller(s)
> +grep: /sys/class/nvme-fabrics/ctl/nvme0/state: No such file or directory
> +grep: /sys/class/nvme-fabrics/ctl/nvme0/state: No such file or directory
> +grep: /sys/class/nvme-fabrics/ctl/nvme0/state: No such file or directory
> +grep: /sys/class/nvme-fabrics/ctl/nvme0/state: No such file or directory
> +grep: /sys/class/nvme-fabrics/ctl/nvme0/state: No such file or directory
> ...
> (Run 'diff -u tests/nvme/046.out /root/blktests/results/nodev/nvme/046.out.bad' to see the entire diff)
>
> The grep error is something I can fix in the test but I don't know how
> to handle the 'eth0 speed is unknown' kernel message which will make
> the test always fail. Is it possible to ignore them when parsing the
> kernel log for errors?
FYI, each blktests test case can define DMESG_FILTER not to fail with specific
keywords in dmesg. Test cases meta/011 and block/028 are reference use cases.
Having said that, I don't think 'eth0 speed is unknown' in dmesg makes the test
case fail. See _check_dmesg() in "check" script, which lists keywords that
blktests handles as errors. I guess the WARNING above is the failure cause,
probably.
--
Shin'ichiro Kawasaki
More information about the Linux-nvme
mailing list