blktests failures with v6.9-rc1 kernel
Shinichiro Kawasaki
shinichiro.kawasaki at wdc.com
Thu Mar 28 22:01:15 PDT 2024
Hi all,
(I added nbd list to the To list since the nbd test case failure is reported)
I ran the latest blktests (git hash: 607513e64e48) with the v6.9-rc1 kernel, and
four failures were observed. I also checked the CKI project blktests runs with
the v6.9-rc1 kernel and found one failure. In total, five failure symptoms were
observed, as listed below.
One notable difference from the v6.8 kernel [1] is block/011. The test case
failure with NVME devices is no longer observed. However, exactly the same
failure symptom is now observed with the other test case nvme/050. I guess
some of the kernel changes between v6.8 and v6.9-rc1 caused this symptom change.
[1] https://lore.kernel.org/linux-block/nnphuvhpwdqjwi3mdisom7iuj2qnxkf4ldzzzfzy63bfet6mas@2jcd4jzzzteu/
List of failures
================
#1: nvme/041,044 (fc transport)
#2: nvme/050
#3: srp/002, 011 (rdma_rxe driver)
#4: nbd/002
#5: zbd/010 (CKI failure)
Failure description
===================
#1: nvme/041,044 (fc transport)
With the trtype=fc configuration, nvme/041 and 044 fail with similar
error messages:
nvme/041 (Create authenticated connections) [failed]
runtime 2.677s ... 4.823s
--- tests/nvme/041.out 2023-11-29 12:57:17.206898664 +0900
+++ /home/shin/Blktests/blktests/results/nodev/nvme/041.out.bad 2024-03-19 14:50:56.399101323 +0900
@@ -2,5 +2,5 @@
Test unauthenticated connection (should fail)
disconnected 0 controller(s)
Test authenticated connection
-disconnected 1 controller(s)
+disconnected 0 controller(s)
Test complete
nvme/044 (Test bi-directional authentication) [failed]
runtime 4.740s ... 7.482s
--- tests/nvme/044.out 2023-11-29 12:57:17.212898647 +0900
+++ /home/shin/Blktests/blktests/results/nodev/nvme/044.out.bad 2024-03-19 14:51:08.062067741 +0900
@@ -4,7 +4,7 @@
Test invalid ctrl authentication (should fail)
disconnected 0 controller(s)
Test valid ctrl authentication
-disconnected 1 controller(s)
+disconnected 0 controller(s)
Test invalid ctrl key (should fail)
disconnected 0 controller(s)
...
(Run 'diff -u tests/nvme/044.out /home/shin/Blktests/blktests/results/nodev/nvme/044.out.bad' to see the entire diff)
A solution was suggested and discussed in Feb/2024 [2].
[2] https://lore.kernel.org/linux-nvme/20240221132404.6311-1-dwagner@suse.de/
#2: nvme/050
The test case fails occasionally with a QEMU NVME device. The failure cause
is the lockdep WARN among ctrl->namespaces_rwsem, dev->shutdown_lock and
workqueue work completion [9]. This WARN is the same as it was observed with
block/011 for the kernel versions from v6.0 to v6.8, which was reported in
Sep/2022 [3].
[3] https://lore.kernel.org/linux-block/20220930001943.zdbvolc3gkekfmcv@shindev/
#3: srp/002, 011 (rdma_rxe driver)
Test process hang is observed occasionally. Reported to the relevant mailing
lists in Aug/2023 [4]. Blktests was modified to change the default driver
from rdma_rxe to siw to avoid impacts on blktests users. The root cause is
not yet understood.
[4] https://lore.kernel.org/linux-rdma/18a3ae8c-145b-4c7f-a8f5-67840feeb98c@acm.org/T/#mee9882c2cfd0cfff33caa04e75418576f4c7a789
#4: nbd/002
The test case fails occasionally. CKI reported this failure for the v6.8
kernel also in the past. A blktests patch reduces the failure ratio, but does
not avoid the failure completely [5]. Needs further investigation.
nbd/002 (tests on partition handling for an nbd device) [failed]
runtime ... 0.414s
--- tests/nbd/002.out 2024-02-19 19:25:07.453721466 +0900
+++ /home/shin/kts/kernel-test-suite/sets/blktests/log/runlog/nodev/nbd/002.out.bad 2024-03-19 14:53:56.320218177 +0900
@@ -1,4 +1,4 @@
Running nbd/002
Testing IOCTL path
Testing the netlink path
-Test complete
+Didn't have partition on the netlink path
[5] https://lore.kernel.org/linux-block/lecgttcqtjsolotcowjvqwsyrphe7fucllkrzupc5guf2gbq5z@grmwusovo22b/
#5: zbd/010 (CKI failure)
CKI observed the failure [6], and Yi Zhang reported it to relevant mailing
lists [7]. Chao Yu provided a fix patch in F2FS [8].
[6] https://datawarehouse.cki-project.org/issue/2508
[7] https://lore.kernel.org/linux-f2fs-devel/CAHj4cs-kfojYC9i0G73PRkYzcxCTex=-vugRFeP40g_URGvnfQ@mail.gmail.com/
[8] https://lore.kernel.org/linux-f2fs-devel/20240327074223.2216487-1-chao@kernel.org/
[9]
[ 1907.056570] ======================================================
[ 1907.058993] WARNING: possible circular locking dependency detected
[ 1907.061446] 6.9.0-rc1 #158 Not tainted
[ 1907.063639] ------------------------------------------------------
[ 1907.066017] kworker/u16:3/2996 is trying to acquire lock:
[ 1907.068423] ffff888120ab39b8 ((work_completion)(&q->timeout_work)){+.+.}-{0:0}, at: __flush_work+0x48f/0xa40
[ 1907.071311]
but task is already holding lock:
[ 1907.075601] ffff8881389f44d8 (&ctrl->namespaces_rwsem){++++}-{3:3}, at: nvme_sync_queues+0x26/0x100 [nvme_core]
[ 1907.079089]
which lock already depends on the new lock.
[ 1907.085827]
the existing dependency chain (in reverse order) is:
[ 1907.090475]
-> #2 (&ctrl->namespaces_rwsem){++++}-{3:3}:
[ 1907.094962] down_read+0x9d/0x480
[ 1907.097356] nvme_start_freeze+0x3d/0xd0 [nvme_core]
[ 1907.099900] nvme_dev_disable+0x57c/0x6b0 [nvme]
[ 1907.102365] nvme_timeout+0x8a9/0x11f0 [nvme]
[ 1907.104854] blk_mq_handle_expired+0x1a4/0x270
[ 1907.107065] bt_iter+0x22a/0x310
[ 1907.109108] blk_mq_queue_tag_busy_iter+0x809/0x1380
[ 1907.111374] blk_mq_timeout_work+0x3b9/0x550
[ 1907.113582] process_one_work+0x802/0x1720
[ 1907.115707] worker_thread+0x720/0x1300
[ 1907.117805] kthread+0x2ec/0x3d0
[ 1907.119851] ret_from_fork+0x30/0x70
[ 1907.121981] ret_from_fork_asm+0x1a/0x30
[ 1907.124167]
-> #1 (&dev->shutdown_lock){+.+.}-{3:3}:
[ 1907.128058] __mutex_lock+0x18d/0x1480
[ 1907.130117] nvme_dev_disable+0x9b/0x6b0 [nvme]
[ 1907.132239] nvme_timeout+0x8a9/0x11f0 [nvme]
[ 1907.134383] blk_mq_handle_expired+0x1a4/0x270
[ 1907.136500] bt_iter+0x22a/0x310
[ 1907.138393] blk_mq_queue_tag_busy_iter+0x809/0x1380
[ 1907.140515] blk_mq_timeout_work+0x3b9/0x550
[ 1907.142490] process_one_work+0x802/0x1720
[ 1907.144459] worker_thread+0x720/0x1300
[ 1907.146355] kthread+0x2ec/0x3d0
[ 1907.148235] ret_from_fork+0x30/0x70
[ 1907.150093] ret_from_fork_asm+0x1a/0x30
[ 1907.151741]
-> #0 ((work_completion)(&q->timeout_work)){+.+.}-{0:0}:
[ 1907.153987] __lock_acquire+0x2e98/0x5ee0
[ 1907.155190] lock_acquire+0x1ac/0x520
[ 1907.156396] __flush_work+0x4ac/0xa40
[ 1907.157483] __cancel_work_sync+0x2c5/0x3f0
[ 1907.158586] nvme_sync_queues+0x71/0x100 [nvme_core]
[ 1907.159772] nvme_reset_work+0xaa/0x4a0 [nvme]
[ 1907.160903] process_one_work+0x802/0x1720
[ 1907.161994] worker_thread+0x720/0x1300
[ 1907.163054] kthread+0x2ec/0x3d0
[ 1907.164175] ret_from_fork+0x30/0x70
[ 1907.165187] ret_from_fork_asm+0x1a/0x30
[ 1907.166199]
other info that might help us debug this:
[ 1907.168719] Chain exists of:
(work_completion)(&q->timeout_work) --> &dev->shutdown_lock --> &ctrl->namespaces_rwsem
[ 1907.171733] Possible unsafe locking scenario:
[ 1907.173507] CPU0 CPU1
[ 1907.174512] ---- ----
[ 1907.175509] rlock(&ctrl->namespaces_rwsem);
[ 1907.176490] lock(&dev->shutdown_lock);
[ 1907.177643] lock(&ctrl->namespaces_rwsem);
[ 1907.178784] lock((work_completion)(&q->timeout_work));
[ 1907.179817]
*** DEADLOCK ***
[ 1907.182067] 4 locks held by kworker/u16:3/2996:
[ 1907.183055] #0: ffff88811a053148 ((wq_completion)nvme-reset-wq){+.+.}-{0:0}, at: process_one_work+0xf57/0x1720
[ 1907.184498] #1: ffff88810fd97d90 ((work_completion)(&dev->ctrl.reset_work)){+.+.}-{0:0}, at: process_one_work+0x77f/0x1720
[ 1907.186027] #2: ffff8881389f44d8 (&ctrl->namespaces_rwsem){++++}-{3:3}, at: nvme_sync_queues+0x26/0x100 [nvme_core]
[ 1907.187564] #3: ffffffffb8e79a20 (rcu_read_lock){....}-{1:2}, at: __flush_work+0xf5/0xa40
[ 1907.188948]
stack backtrace:
[ 1907.190707] CPU: 3 PID: 2996 Comm: kworker/u16:3 Not tainted 6.9.0-rc1 #158
[ 1907.191954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
[ 1907.193367] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[ 1907.194546] Call Trace:
[ 1907.195486] <TASK>
[ 1907.196380] dump_stack_lvl+0x73/0xb0
[ 1907.197398] check_noncircular+0x306/0x3f0
[ 1907.198435] ? __pfx_check_noncircular+0x10/0x10
[ 1907.199516] ? __pfx_lock_release+0x10/0x10
[ 1907.200576] ? lockdep_lock+0xca/0x1c0
[ 1907.201592] ? __pfx_lockdep_lock+0x10/0x10
[ 1907.202643] ? __lock_acquire+0xcba/0x5ee0
[ 1907.203707] __lock_acquire+0x2e98/0x5ee0
[ 1907.204768] ? __pfx___lock_acquire+0x10/0x10
[ 1907.205855] ? __pfx_mark_lock+0x10/0x10
[ 1907.206883] lock_acquire+0x1ac/0x520
[ 1907.207940] ? __flush_work+0x48f/0xa40
[ 1907.209032] ? __pfx_lock_acquire+0x10/0x10
[ 1907.210139] ? __pfx_lock_release+0x10/0x10
[ 1907.211193] ? lockdep_init_map_type+0x169/0x7c0
[ 1907.212324] ? mark_held_locks+0x96/0xe0
[ 1907.213349] ? __flush_work+0x48f/0xa40
[ 1907.214361] __flush_work+0x4ac/0xa40
[ 1907.215351] ? __flush_work+0x48f/0xa40
[ 1907.216397] ? __pfx___flush_work+0x10/0x10
[ 1907.217438] ? __pfx_mark_lock+0x10/0x10
[ 1907.218458] ? __pfx_wq_barrier_func+0x10/0x10
[ 1907.219544] ? mark_held_locks+0x96/0xe0
[ 1907.220569] __cancel_work_sync+0x2c5/0x3f0
[ 1907.221628] ? __pfx___cancel_work_sync+0x10/0x10
[ 1907.222703] ? __timer_delete_sync+0x17f/0x2a0
[ 1907.223777] ? __pfx_lock_acquire+0x10/0x10
[ 1907.224805] nvme_sync_queues+0x71/0x100 [nvme_core]
[ 1907.225898] nvme_reset_work+0xaa/0x4a0 [nvme]
[ 1907.226949] process_one_work+0x802/0x1720
[ 1907.228003] ? __pfx_lock_acquire+0x10/0x10
[ 1907.229043] ? __pfx_process_one_work+0x10/0x10
[ 1907.230089] ? assign_work+0x16c/0x240
[ 1907.231090] worker_thread+0x720/0x1300
[ 1907.232145] ? __pfx_worker_thread+0x10/0x10
[ 1907.233183] kthread+0x2ec/0x3d0
[ 1907.234144] ? _raw_spin_unlock_irq+0x24/0x50
[ 1907.235199] ? __pfx_kthread+0x10/0x10
[ 1907.236265] ret_from_fork+0x30/0x70
[ 1907.237223] ? __pfx_kthread+0x10/0x10
[ 1907.238177] ret_from_fork_asm+0x1a/0x30
[ 1907.239126] </TASK>
More information about the Linux-nvme
mailing list