blktests failures with v6.3
Shinichiro Kawasaki
shinichiro.kawasaki at wdc.com
Mon May 1 01:40:06 PDT 2023
Hi all,
I ran the latest blktests (git hash: c1da4a225273) with v6.3 kernel and observed
test case failures. I call for support to fix them. (I myself is now working on
some of the block/011 failure symptoms.)
List of failures
================
#1: block/011
#2: block/024
#3: nvme/003 (fabrics transport)
#4: nvme/030 or nvme/031 (rdma transport with siw)
#5: nvme/* (fc transport)
This list excludes failures already resovled: scsi/007 and nvme/{044,045}.
Failure description
===================
#1: block/011
This test case shows three failure symptoms.
Symptom A:
The test case fails with NVME devices due to lockdep WARNING "possible
circular locking dependency detected". Reported in Sep/2022 [1] and
solution was discussed. Waiting a fix.
[1] https://lore.kernel.org/linux-block/20220930001943.zdbvolc3gkekfmcv@shindev/
Symptom B:
When this test case passes for a NVME device, the device is left with
zero capacity occasionally. It causes following test cases fail. The device
capacity change happens due to ENODEV during the test. I'm preparing a fix
patch for blktests [2].
[2] https://github.com/kawasaki/blktests/commit/f78ceec0cb95b9b2a9dcbf44999e350b740722e8
Symptom C:
When system disk and the test device belong to same PCI device, the system
disk disappears during the test case run (e.g. a system with two SSDs on
same AHCI controller or HBA). This results in test system hang. I'm
preparing patches for blktests to avoid this failure [3].
[3] https://github.com/kawasaki/blktests/commit/662c33048ba1e949348dc2fca8b4778f940b7cc5
#2: block/024
Fails on slow machines. Reported in Dec/2022. Test case side issue is
suspected. Still need further investigation.
block/024 (do I/O faster than a jiffy and check iostats times) [failed]
runtime ... 4.347s
--- tests/block/024.out 2022-12-06 20:51:41.525066605 +0900
+++ /home/shin/kts/kernel-test-suite/sets/blktests/log/runlog/nodev/block/024.out.bad 2022-12-07 12:51:03.610924521 +0900
@@ -6,5 +6,5 @@
read 1 s
write 1 s
read 2 s
-write 3 s
+write 4 s
Test complete
#3: nvme/003 (fabrics transport)
When nvme test group is run with trtype=rdma, tcp or fc, the test case fails
due to lockdep WARNING "possible circular locking dependency detected".
Detailed kernel message is attached [4].
#4: nvme/030 or nvme/031 (rdma transport with siw)
When nvme test group is run with trtype=rdma and use_siw=1 configurations,
a failure is observed at nvme/030 or nvme/031 occasionally. The failure
cause is "BUG: KASAN: slab-use-after-free in __mutex_lock". Detailed
kernel message is attached [5]
#5: nvme/* (fc transport)
With trtype=fc configuration, many of test cases in nvme test group fail or
hang. Daniel Wagner is driving fix work [6].
[6] https://lore.kernel.org/linux-nvme/20230418130159.11075-1-dwagner@suse.de/
[4]
[ 134.432456] ======================================================
[ 134.433289] WARNING: possible circular locking dependency detected
[ 134.433806] 6.3.0 #62 Not tainted
[ 134.434149] ------------------------------------------------------
[ 134.434709] kworker/1:3/455 is trying to acquire lock:
[ 134.435141] ffff88813bfd9420 (&id_priv->handler_mutex){+.+.}-{3:3}, at: rdma_destroy_id+0x17/0x20 [rdma_cm]
[ 134.435986]
but task is already holding lock:
[ 134.436554] ffff888131327db0 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x793/0x1350
[ 134.437459]
which lock already depends on the new lock.
[ 134.438235]
the existing dependency chain (in reverse order) is:
[ 134.438914]
-> #3 ((work_completion)(&queue->release_work)){+.+.}-{0:0}:
[ 134.439688] process_one_work+0x7dc/0x1350
[ 134.440084] worker_thread+0xfc/0x1260
[ 134.440489] kthread+0x29e/0x340
[ 134.440857] ret_from_fork+0x2c/0x50
[ 134.441251]
-> #2 ((wq_completion)nvmet-wq){+.+.}-{0:0}:
[ 134.441885] __flush_workqueue+0x130/0x12f0
[ 134.442328] nvmet_rdma_cm_handler+0x961/0x39c0 [nvmet_rdma]
[ 134.442900] cma_cm_event_handler+0xb2/0x2f0 [rdma_cm]
[ 134.443412] cma_ib_req_handler+0x1096/0x4a50 [rdma_cm]
[ 134.443925] cm_process_work+0x46/0x3b0 [ib_cm]
[ 134.444399] cm_req_handler+0x20e2/0x61d0 [ib_cm]
[ 134.444880] cm_work_handler+0xc15/0x6ce0 [ib_cm]
[ 134.445365] process_one_work+0x843/0x1350
[ 134.445794] worker_thread+0xfc/0x1260
[ 134.446165] kthread+0x29e/0x340
[ 134.447271] ret_from_fork+0x2c/0x50
[ 134.448373]
-> #1 (&id_priv->handler_mutex/1){+.+.}-{3:3}:
[ 134.450542] __mutex_lock+0x186/0x18f0
[ 134.451682] cma_ib_req_handler+0xc3c/0x4a50 [rdma_cm]
[ 134.452921] cm_process_work+0x46/0x3b0 [ib_cm]
[ 134.454150] cm_req_handler+0x20e2/0x61d0 [ib_cm]
[ 134.455423] cm_work_handler+0xc15/0x6ce0 [ib_cm]
[ 134.456680] process_one_work+0x843/0x1350
[ 134.457865] worker_thread+0xfc/0x1260
[ 134.458985] kthread+0x29e/0x340
[ 134.460097] ret_from_fork+0x2c/0x50
[ 134.461294]
-> #0 (&id_priv->handler_mutex){+.+.}-{3:3}:
[ 134.463349] __lock_acquire+0x2fc0/0x60f0
[ 134.464430] lock_acquire+0x1a7/0x4e0
[ 134.465448] __mutex_lock+0x186/0x18f0
[ 134.466461] rdma_destroy_id+0x17/0x20 [rdma_cm]
[ 134.467569] nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma]
[ 134.468735] nvmet_rdma_release_queue_work+0x3e/0x90 [nvmet_rdma]
[ 134.469904] process_one_work+0x843/0x1350
[ 134.470923] worker_thread+0xfc/0x1260
[ 134.471918] kthread+0x29e/0x340
[ 134.472902] ret_from_fork+0x2c/0x50
[ 134.473884]
other info that might help us debug this:
[ 134.476264] Chain exists of:
&id_priv->handler_mutex --> (wq_completion)nvmet-wq --> (work_completion)(&queue->release_work)
[ 134.479201] Possible unsafe locking scenario:
[ 134.480949] CPU0 CPU1
[ 134.481941] ---- ----
[ 134.482871] lock((work_completion)(&queue->release_work));
[ 134.483938] lock((wq_completion)nvmet-wq);
[ 134.485094] lock((work_completion)(&queue->release_work));
[ 134.486391] lock(&id_priv->handler_mutex);
[ 134.487384]
*** DEADLOCK ***
[ 134.489712] 2 locks held by kworker/1:3/455:
[ 134.490728] #0: ffff888127b28538 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x766/0x1350
[ 134.492084] #1: ffff888131327db0 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x793/0x1350
[ 134.493538]
stack backtrace:
[ 134.495147] CPU: 1 PID: 455 Comm: kworker/1:3 Not tainted 6.3.0 #62
[ 134.496262] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[ 134.497573] Workqueue: nvmet-wq nvmet_rdma_release_queue_work [nvmet_rdma]
[ 134.498769] Call Trace:
[ 134.499649] <TASK>
[ 134.500494] dump_stack_lvl+0x57/0x90
[ 134.501455] check_noncircular+0x27b/0x310
[ 134.502444] ? __pfx_mark_lock+0x10/0x10
[ 134.503413] ? __pfx_check_noncircular+0x10/0x10
[ 134.504431] ? find_held_lock+0x2d/0x110
[ 134.505399] ? lockdep_lock+0xca/0x1c0
[ 134.506351] ? __pfx_lockdep_lock+0x10/0x10
[ 134.507333] ? mark_held_locks+0x9e/0xe0
[ 134.508299] __lock_acquire+0x2fc0/0x60f0
[ 134.509272] ? __pfx___wait_for_common+0x10/0x10
[ 134.510292] ? __pfx___lock_acquire+0x10/0x10
[ 134.511296] ? rxe_sched_task+0x50/0x70 [rdma_rxe]
[ 134.512328] ? rxe_post_recv+0x1d9/0x250 [rdma_rxe]
[ 134.513366] lock_acquire+0x1a7/0x4e0
[ 134.514314] ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[ 134.515353] ? __pfx_lock_acquire+0x10/0x10
[ 134.516345] ? __pfx_ib_drain_qp_done+0x10/0x10 [ib_core]
[ 134.517447] ? __pfx___might_resched+0x10/0x10
[ 134.518455] __mutex_lock+0x186/0x18f0
[ 134.519405] ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[ 134.520436] ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[ 134.521463] ? __pfx___mutex_lock+0x10/0x10
[ 134.522447] ? rcu_is_watching+0x11/0xb0
[ 134.523408] ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[ 134.524434] rdma_destroy_id+0x17/0x20 [rdma_cm]
[ 134.525451] nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma]
[ 134.526528] nvmet_rdma_release_queue_work+0x3e/0x90 [nvmet_rdma]
[ 134.527643] process_one_work+0x843/0x1350
[ 134.528600] ? __pfx_lock_acquire+0x10/0x10
[ 134.529562] ? __pfx_process_one_work+0x10/0x10
[ 134.530551] ? __pfx_do_raw_spin_lock+0x10/0x10
[ 134.531537] worker_thread+0xfc/0x1260
[ 134.532471] ? __kthread_parkme+0xc1/0x1f0
[ 134.533428] ? __pfx_worker_thread+0x10/0x10
[ 134.534385] kthread+0x29e/0x340
[ 134.535264] ? __pfx_kthread+0x10/0x10
[ 134.536164] ret_from_fork+0x2c/0x50
[ 134.537058] </TASK>
[5]
May 01 11:19:44 testnode1 kernel: ==================================================================
May 01 11:19:44 testnode1 kernel: BUG: KASAN: slab-use-after-free in __mutex_lock+0x1324/0x18f0
May 01 11:19:44 testnode1 kernel: Read of size 8 at addr ffff888197b37418 by task kworker/u8:0/9
May 01 11:19:44 testnode1 kernel:
May 01 11:19:44 testnode1 kernel: CPU: 0 PID: 9 Comm: kworker/u8:0 Not tainted 6.3.0 #62
May 01 11:19:44 testnode1 kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
May 01 11:19:44 testnode1 kernel: Workqueue: iw_cm_wq cm_work_handler [iw_cm]
May 01 11:19:44 testnode1 kernel: Call Trace:
May 01 11:19:44 testnode1 kernel: <TASK>
May 01 11:19:44 testnode1 kernel: dump_stack_lvl+0x57/0x90
May 01 11:19:44 testnode1 kernel: print_report+0xcf/0x660
May 01 11:19:44 testnode1 kernel: ? __mutex_lock+0x1324/0x18f0
May 01 11:19:44 testnode1 kernel: kasan_report+0xa4/0xe0
May 01 11:19:44 testnode1 kernel: ? __mutex_lock+0x1324/0x18f0
May 01 11:19:44 testnode1 kernel: __mutex_lock+0x1324/0x18f0
May 01 11:19:44 testnode1 kernel: ? cma_iw_handler+0xac/0x4f0 [rdma_cm]
May 01 11:19:44 testnode1 kernel: ? _raw_spin_unlock_irqrestore+0x30/0x60
May 01 11:19:44 testnode1 kernel: ? rcu_is_watching+0x11/0xb0
May 01 11:19:44 testnode1 kernel: ? _raw_spin_unlock_irqrestore+0x30/0x60
May 01 11:19:44 testnode1 kernel: ? trace_hardirqs_on+0x12/0x100
May 01 11:19:44 testnode1 kernel: ? __pfx___mutex_lock+0x10/0x10
May 01 11:19:44 testnode1 kernel: ? __percpu_counter_sum+0x147/0x1e0
May 01 11:19:44 testnode1 kernel: ? domain_dirty_limits+0x246/0x390
May 01 11:19:44 testnode1 kernel: ? wb_over_bg_thresh+0x4d5/0x610
May 01 11:19:44 testnode1 kernel: ? rcu_is_watching+0x11/0xb0
May 01 11:19:44 testnode1 kernel: ? cma_iw_handler+0xac/0x4f0 [rdma_cm]
May 01 11:19:44 testnode1 kernel: cma_iw_handler+0xac/0x4f0 [rdma_cm]
May 01 11:19:44 testnode1 kernel: ? rcu_is_watching+0x11/0xb0
May 01 11:19:44 testnode1 kernel: ? __pfx_cma_iw_handler+0x10/0x10 [rdma_cm]
May 01 11:19:44 testnode1 kernel: ? attach_entity_load_avg+0x4e2/0x920
May 01 11:19:44 testnode1 kernel: ? _raw_spin_unlock_irqrestore+0x30/0x60
May 01 11:19:44 testnode1 kernel: ? rcu_is_watching+0x11/0xb0
May 01 11:19:44 testnode1 kernel: cm_work_handler+0x139e/0x1c50 [iw_cm]
May 01 11:19:44 testnode1 kernel: ? __pfx_cm_work_handler+0x10/0x10 [iw_cm]
May 01 11:19:44 testnode1 kernel: ? rcu_is_watching+0x11/0xb0
May 01 11:19:44 testnode1 kernel: ? __pfx_try_to_wake_up+0x10/0x10
May 01 11:19:44 testnode1 kernel: ? __pfx_do_raw_spin_lock+0x10/0x10
May 01 11:19:44 testnode1 kernel: ? __pfx___might_resched+0x10/0x10
May 01 11:19:44 testnode1 kernel: ? _raw_spin_unlock_irq+0x24/0x50
May 01 11:19:44 testnode1 kernel: process_one_work+0x843/0x1350
May 01 11:19:44 testnode1 kernel: ? __pfx_lock_acquire+0x10/0x10
May 01 11:19:44 testnode1 kernel: ? __pfx_process_one_work+0x10/0x10
May 01 11:19:44 testnode1 kernel: ? __pfx_do_raw_spin_lock+0x10/0x10
May 01 11:19:44 testnode1 kernel: worker_thread+0xfc/0x1260
May 01 11:19:44 testnode1 kernel: ? __pfx_worker_thread+0x10/0x10
May 01 11:19:44 testnode1 kernel: kthread+0x29e/0x340
May 01 11:19:44 testnode1 kernel: ? __pfx_kthread+0x10/0x10
May 01 11:19:44 testnode1 kernel: ret_from_fork+0x2c/0x50
May 01 11:19:44 testnode1 kernel: </TASK>
May 01 11:19:44 testnode1 kernel:
May 01 11:19:44 testnode1 kernel: Allocated by task 4225:
May 01 11:19:44 testnode1 kernel: kasan_save_stack+0x2f/0x50
May 01 11:19:44 testnode1 kernel: kasan_set_track+0x21/0x30
May 01 11:19:44 testnode1 kernel: __kasan_kmalloc+0xa6/0xb0
May 01 11:19:44 testnode1 kernel: __rdma_create_id+0x5b/0x5d0 [rdma_cm]
May 01 11:19:44 testnode1 kernel: __rdma_create_kernel_id+0x12/0x40 [rdma_cm]
May 01 11:19:44 testnode1 kernel: nvme_rdma_alloc_queue+0x26a/0x5f0 [nvme_rdma]
May 01 11:19:44 testnode1 kernel: nvme_rdma_setup_ctrl+0xb84/0x1d90 [nvme_rdma]
May 01 11:19:44 testnode1 kernel: nvme_rdma_create_ctrl+0x7b5/0xd20 [nvme_rdma]
May 01 11:19:44 testnode1 kernel: nvmf_dev_write+0xddd/0x22b0 [nvme_fabrics]
May 01 11:19:44 testnode1 kernel: vfs_write+0x211/0xd50
May 01 11:19:44 testnode1 kernel: ksys_write+0x100/0x1e0
May 01 11:19:44 testnode1 kernel: do_syscall_64+0x5b/0x80
May 01 11:19:44 testnode1 kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc
May 01 11:19:44 testnode1 kernel:
May 01 11:19:44 testnode1 kernel: Freed by task 4227:
May 01 11:19:44 testnode1 kernel: kasan_save_stack+0x2f/0x50
May 01 11:19:44 testnode1 kernel: kasan_set_track+0x21/0x30
May 01 11:19:44 testnode1 kernel: kasan_save_free_info+0x2a/0x50
May 01 11:19:44 testnode1 kernel: ____kasan_slab_free+0x169/0x1c0
May 01 11:19:44 testnode1 kernel: slab_free_freelist_hook+0xdb/0x1b0
May 01 11:19:44 testnode1 kernel: __kmem_cache_free+0xb8/0x2e0
May 01 11:19:44 testnode1 kernel: nvme_rdma_free_queue+0x4a/0x70 [nvme_rdma]
May 01 11:19:44 testnode1 kernel: nvme_rdma_teardown_io_queues.part.0+0x14a/0x1e0 [nvme_rdma]
May 01 11:19:44 testnode1 kernel: nvme_rdma_delete_ctrl+0x4f/0x100 [nvme_rdma]
May 01 11:19:44 testnode1 kernel: nvme_do_delete_ctrl+0x14e/0x240 [nvme_core]
May 01 11:19:44 testnode1 kernel: nvme_sysfs_delete+0xcb/0x100 [nvme_core]
May 01 11:19:44 testnode1 kernel: kernfs_fop_write_iter+0x359/0x530
May 01 11:19:44 testnode1 kernel: vfs_write+0x58f/0xd50
May 01 11:19:44 testnode1 kernel: ksys_write+0x100/0x1e0
May 01 11:19:44 testnode1 kernel: do_syscall_64+0x5b/0x80
May 01 11:19:44 testnode1 kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc
May 01 11:19:44 testnode1 kernel:
May 01 11:19:44 testnode1 kernel: The buggy address belongs to the object at ffff888197b37000
which belongs to the cache kmalloc-2k of size 2048
May 01 11:19:44 testnode1 kernel: The buggy address is located 1048 bytes inside of
freed 2048-byte region [ffff888197b37000, ffff888197b37800)
May 01 11:19:44 testnode1 kernel:
May 01 11:19:44 testnode1 kernel: The buggy address belongs to the physical page:
May 01 11:19:44 testnode1 kernel: page:00000000fbe33a6e refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x197b30
May 01 11:19:44 testnode1 kernel: head:00000000fbe33a6e order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
May 01 11:19:44 testnode1 kernel: anon flags: 0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
May 01 11:19:44 testnode1 kernel: raw: 0017ffffc0010200 ffff888100042f00 0000000000000000 dead000000000001
May 01 11:19:44 testnode1 kernel: raw: 0000000000000000 0000000000080008 00000001ffffffff 0000000000000000
May 01 11:19:44 testnode1 kernel: page dumped because: kasan: bad access detected
May 01 11:19:44 testnode1 kernel:
May 01 11:19:44 testnode1 kernel: Memory state around the buggy address:
May 01 11:19:44 testnode1 kernel: ffff888197b37300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
May 01 11:19:44 testnode1 kernel: ffff888197b37380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
May 01 11:19:44 testnode1 kernel: >ffff888197b37400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
May 01 11:19:44 testnode1 kernel: ^
May 01 11:19:44 testnode1 kernel: ffff888197b37480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
May 01 11:19:44 testnode1 kernel: ffff888197b37500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
May 01 11:19:44 testnode1 kernel: ==================================================================
More information about the Linux-nvme
mailing list