[PATCH blktests v3] nvme/046: test queue count changes on reconnect

Tue Sep 13 03:57:44 PDT 2022

On Sep 13, 2022 / 08:57, Daniel Wagner wrote:
> The target is allowed to change the number of I/O queues. Test if the
> host is able to reconnect in this scenario.
> 
> Signed-off-by: Daniel Wagner <dwagner at suse.de>
> ---
> I've send a fix for the bug reported in nvmet_subsys_attr_qid_max_show()

Thanks!

> 
> Though I have found another problem with this test:
> 
>  nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:.
>  nvme nvme0: creating 8 I/O queues.
>  nvme nvme0: mapped 8/0/0 default/read/poll queues.
>  nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 10.161.61.228:4420
>  nvme nvme0: starting error recovery
> 
>  ======================================================
>  WARNING: possible circular locking dependency detected
>  6.0.0-rc2+ #26 Not tainted
>  ------------------------------------------------------
>  kworker/6:6/17471 is trying to acquire lock:
>  ffff888103d76440 (&id_priv->handler_mutex){+.+.}-{3:3}, at: rdma_destroy_id+0x17/0x20 [rdma_cm]
>  
>  but task is already holding lock:
>  ffffc90002497de0 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x4ec/0xa90
>  
>  which lock already depends on the new lock.
> 
>  
>  the existing dependency chain (in reverse order) is:
>  
>  -> #3 ((work_completion)(&queue->release_work)){+.+.}-{0:0}:
>         process_one_work+0x565/0xa90
>         worker_thread+0x2c0/0x710
>         kthread+0x16c/0x1a0
>         ret_from_fork+0x1f/0x30
>  
>  -> #2 ((wq_completion)nvmet-wq){+.+.}-{0:0}:
>         __flush_workqueue+0x105/0x850
>         nvmet_rdma_cm_handler+0xf9f/0x174e [nvmet_rdma]
>         cma_cm_event_handler+0x77/0x2d0 [rdma_cm]
>         cma_ib_req_handler+0xbd4/0x23c0 [rdma_cm]
>         cm_process_work+0x2f/0x210 [ib_cm]
>         cm_req_handler+0xf73/0x2020 [ib_cm]
>         cm_work_handler+0x6d6/0x37c0 [ib_cm]
>         process_one_work+0x5b6/0xa90
>         worker_thread+0x2c0/0x710
>         kthread+0x16c/0x1a0
>         ret_from_fork+0x1f/0x30
>  
>  -> #1 (&id_priv->handler_mutex/1){+.+.}-{3:3}:
>         __mutex_lock+0x12d/0xe40
>         cma_ib_req_handler+0x956/0x23c0 [rdma_cm]
>         cm_process_work+0x2f/0x210 [ib_cm]
>         cm_req_handler+0xf73/0x2020 [ib_cm]
>         cm_work_handler+0x6d6/0x37c0 [ib_cm]
>         process_one_work+0x5b6/0xa90
>         worker_thread+0x2c0/0x710
>         kthread+0x16c/0x1a0
>         ret_from_fork+0x1f/0x30
>  
>  -> #0 (&id_priv->handler_mutex){+.+.}-{3:3}:
>         __lock_acquire+0x1a9d/0x2d90
>         lock_acquire+0x15d/0x3f0
>         __mutex_lock+0x12d/0xe40
>         rdma_destroy_id+0x17/0x20 [rdma_cm]
>         nvmet_rdma_free_queue+0x54/0x180 [nvmet_rdma]
>         nvmet_rdma_release_queue_work+0x2c/0x70 [nvmet_rdma]
>         process_one_work+0x5b6/0xa90
>         worker_thread+0x2c0/0x710
>         kthread+0x16c/0x1a0
>         ret_from_fork+0x1f/0x30
>  
>  other info that might help us debug this:
> 
>  Chain exists of:
>    &id_priv->handler_mutex --> (wq_completion)nvmet-wq --> (work_completion)(&queue->release_work)
> 
>   Possible unsafe locking scenario:
> 
>         CPU0                    CPU1
>         ----                    ----
>    lock((work_completion)(&queue->release_work));
>                                 lock((wq_completion)nvmet-wq);
>                                 lock((work_completion)(&queue->release_work));
>    lock(&id_priv->handler_mutex);

I also observe this WARNING on my test machine with nvme_trtype=rdma. It looks a
hidden rdma issue for me...

> 
> And there is the question how to handle the different kernel logs output:
> 
> c740:~/blktests # nvme_trtype=tcp ./check nvme/046
> nvme/046 (Test queue count changes on reconnect)             [passed]
>     runtime  32.750s  ...  32.340s
> c740:~/blktests # nvme_trtype=rdma ./check nvme/046
> nvme/046 (Test queue count changes on reconnect)             [failed]
>     runtime  32.340s  ...  24.716s
>     something found in dmesg:
>     [307171.053929] run blktests nvme/046 at 2022-09-13 08:32:07
>     [307171.884448] rdma_rxe: loaded
>     [307171.965507] eth0 speed is unknown, defaulting to 1000
>     [307171.967049] eth0 speed is unknown, defaulting to 1000
>     [307171.968671] eth0 speed is unknown, defaulting to 1000
>     [307172.057714] infiniband eth0_rxe: set active
>     [307172.058544] eth0 speed is unknown, defaulting to 1000
>     [307172.058630] infiniband eth0_rxe: added eth0
>     [307172.107627] eth0 speed is unknown, defaulting to 1000
>     [307172.230771] nvmet: adding nsid 1 to subsystem blktests-feature-detect
>     ...
>     (See '/root/blktests/results/nodev/nvme/046.dmesg' for the entire message)
> c740:~/blktests # nvme_trtype=fc ./check nvme/046
> nvme/046 (Test queue count changes on reconnect)             [failed]
>     runtime  24.716s  ...  87.454s
>     --- tests/nvme/046.out      2022-09-09 16:23:22.926123227 +0200
>     +++ /root/blktests/results/nodev/nvme/046.out.bad   2022-09-13 08:35:03.716097654 +0200
>     @@ -1,3 +1,89 @@
>      Running nvme/046
>     -NQN:blktests-subsystem-1 disconnected 1 controller(s)
>     +grep: /sys/class/nvme-fabrics/ctl/nvme0/state: No such file or directory
>     +grep: /sys/class/nvme-fabrics/ctl/nvme0/state: No such file or directory
>     +grep: /sys/class/nvme-fabrics/ctl/nvme0/state: No such file or directory
>     +grep: /sys/class/nvme-fabrics/ctl/nvme0/state: No such file or directory
>     +grep: /sys/class/nvme-fabrics/ctl/nvme0/state: No such file or directory
>     ...
>     (Run 'diff -u tests/nvme/046.out /root/blktests/results/nodev/nvme/046.out.bad' to see the entire diff)
> 
> The grep error is something I can fix in the test but I don't know how
> to handle the 'eth0 speed is unknown' kernel message which will make
> the test always fail. Is it possible to ignore them when parsing the
> kernel log for errors?

FYI, each blktests test case can define DMESG_FILTER not to fail with specific
keywords in dmesg. Test cases meta/011 and block/028 are reference use cases.

Having said that, I don't think 'eth0 speed is unknown' in dmesg makes the test
case fail. See _check_dmesg() in "check" script, which lists keywords that
blktests handles as errors. I guess the WARNING above is the failure cause,
probably.

-- 
Shin'ichiro Kawasaki