[bug report] blktests nvme/061 hang with rdma transport and siw driver

Bernard Metzler BMT at zurich.ibm.com
Tue Apr 15 06:09:15 PDT 2025


> -----Original Message-----
> From: Shinichiro Kawasaki <shinichiro.kawasaki at wdc.com>
> Sent: Tuesday, April 15, 2025 1:13 PM
> To: linux-nvme at lists.infradead.org; linux-rdma at vger.kernel.org
> Cc: Daniel Wagner <wagi at kernel.org>
> Subject: [EXTERNAL] [bug report] blktests nvme/061 hang with rdma transport
> and siw driver
> 
> Hello all,
> 
> Recently, a new blktests test case nvme/061 was introduced. It does "test
> fabric
> target teardown and setup during I/O". When I run this test case repeatedly
> with
> rdma transport and siw driver on the kernel v6.15-rc2, KASAN slab-use-
> after-free
> happens in __pwq_activate_work() [1], and then the test system hangs. The
> hang
> is recreated in stable manner.
> 
> It looks the new test case revealed a hidden problem. I observed the same
> hang
> with a few older kernels v6.14 and v6.13. Then the problem has been
> existing for
> a while.
> 
> Actions for fix will be appreciated. I'm willing to run tests with debug
> patches
> for fix candidate patches.
> 
> 

<snip>


Hi Shinichiro,

That appears to be an interesting new test..! I get an immediate
'Oops' when trying rxe as an alternative software RDMA driver.

I'll look into siw. Not sure I can fix it the next 2 weeks since
OOO traveling, but will try to find some time.

Thanks, Bernard.

Here is the rxe oops:

[  106.826346] rdma_rxe: loaded
[  106.832164] loop: module loaded
[  107.066868] run blktests nvme/061 at 2025-04-15 15:03:04
[  107.081270] infiniband eno1_rxe: set active
[  107.081274] infiniband eno1_rxe: added eno1
[  107.089683] infiniband enp4s0f4d1_rxe: set active
[  107.089687] infiniband enp4s0f4d1_rxe: added enp4s0f4d1
[  107.264770] loop0: detected capacity change from 0 to 2097152
[  107.267376] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[  107.271276] nvmet_rdma: enabling port 0 (10.0.0.2:4420)
[  107.312957] BUG: kernel NULL pointer dereference, address: 0000000000000028
[  107.312973] #PF: supervisor read access in kernel mode
[  107.312979] #PF: error_code(0x0000) - not-present page
[  107.312986] PGD 0 P4D 0 
[  107.312992] Oops: Oops: 0000 [#1] SMP PTI
[  107.312999] CPU: 1 UID: 0 PID: 123 Comm: kworker/u32:4 Not tainted 6.15.0-rc2 #1 PREEMPT(undef) 
[  107.313008] Hardware name: LENOVO 10A6S05601/SHARKBAY, BIOS FBKTD8AUS 09/17/2019
[  107.313016] Workqueue: rxe_wq do_work [rdma_rxe]
[  107.313030] RIP: 0010:rxe_mr_copy+0x58/0x230 [rdma_rxe]
[  107.313041] Code: 83 7f 7c 04 49 89 f6 48 89 d3 41 89 cd 0f 84 f9 00 00 00 89 ca e8 68 f7 ff ff 85 c0 0f 85 95 01 00 00 49 8b 84 24 f0 00 00 00 <f6> 40 28 02 74 28 44 8b 45 d4 44 89 e9 48 89 da 4c 89 f6 4c 89 e7
[  107.313055] RSP: 0018:ffffb00b40467cc8 EFLAGS: 00010246
[  107.313062] RAX: 0000000000000000 RBX: ffff8f64434f804a RCX: 0000000000000400
[  107.313070] RDX: 0000000000000400 RSI: ffff8f64b8c9cc00 RDI: ffff8f64bef78a00
[  107.313077] RBP: ffffb00b40467d00 R08: 0000000000000000 R09: ffff8f6440b68e00
[  107.313084] R10: ffffb00b40467d50 R11: ffff8f6440b68e00 R12: ffff8f64bef78a00
[  107.313091] R13: 0000000000000400 R14: ffff8f64b8c9c800 R15: ffff8f64470d1000
[  107.313098] FS:  0000000000000000(0000) GS:ffff8f6b8dc9e000(0000) knlGS:0000000000000000
[  107.313106] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  107.313129] CR2: 0000000000000028 CR3: 000000069d81a004 CR4: 00000000001706f0
[  107.313148] Call Trace:
[  107.313164]  <TASK>
[  107.313170]  rxe_receiver+0x1310/0x26d0 [rdma_rxe]
[  107.313180]  do_task+0x6b/0x1f0 [rdma_rxe]
[  107.313189]  do_work+0xe/0x20 [rdma_rxe]
[  107.313198]  process_one_work+0x1b3/0x400
[  107.313206]  worker_thread+0x25b/0x370
[  107.313212]  kthread+0x116/0x240
[  107.313218]  ? __pfx_worker_thread+0x10/0x10
[  107.313225]  ? _raw_spin_unlock_irq+0x17/0x40
[  107.313233]  ? __pfx_kthread+0x10/0x10
[  107.313239]  ret_from_fork+0x3c/0x60
[  107.313246]  ? __pfx_kthread+0x10/0x10
[  107.313253]  ret_from_fork_asm+0x1a/0x30
[  107.313260]  </TASK>
[  107.313263] Modules linked in: loop rdma_rxe ip6_udp_tunnel udp_tunnel ib_uverbs nvmet_rdma nvmet nvme_rdma nvme_fabrics rdma_cm iw_cm ib_cm ib_core nvme_core ip_set nfnetlink sunrpc intel_rapl_msr snd_hda_codec_realtek intel_rapl_common iTCO_wdt snd_hda_codec_generic iTCO_vendor_support snd_hda_codec_hdmi x86_pkg_temp_thermal snd_hda_scodec_component snd_hda_intel intel_powerclamp coretemp snd_intel_dspcfg snd_hda_codec rapl snd_hwdep intel_cstate snd_hda_core snd_pcm mei_me intel_uncore snd_timer i2c_i801 mei i2c_smbus snd lpc_ich soundcore xfs csiostor i915 drm_client_lib i2c_algo_bit drm_buddy ttm drm_display_helper drm_kms_helper cxgb4 e1000e scsi_transport_fc drm ptp pps_core video wmi fuse
[  107.313341] CR2: 0000000000000028
[  107.313346] ---[ end trace 0000000000000000 ]---
[  107.313354] RIP: 0010:rxe_mr_copy+0x58/0x230 [rdma_rxe]
[  107.313366] Code: 83 7f 7c 04 49 89 f6 48 89 d3 41 89 cd 0f 84 f9 00 00 00 89 ca e8 68 f7 ff ff 85 c0 0f 85 95 01 00 00 49 8b 84 24 f0 00 00 00 <f6> 40 28 02 74 28 44 8b 45 d4 44 89 e9 48 89 da 4c 89 f6 4c 89 e7
[  107.313381] RSP: 0018:ffffb00b40467cc8 EFLAGS: 00010246
[  107.313388] RAX: 0000000000000000 RBX: ffff8f64434f804a RCX: 0000000000000400
[  107.313396] RDX: 0000000000000400 RSI: ffff8f64b8c9cc00 RDI: ffff8f64bef78a00
[  107.313403] RBP: ffffb00b40467d00 R08: 0000000000000000 R09: ffff8f6440b68e00
[  107.313411] R10: ffffb00b40467d50 R11: ffff8f6440b68e00 R12: ffff8f64bef78a00
[  107.313418] R13: 0000000000000400 R14: ffff8f64b8c9c800 R15: ffff8f64470d1000
[  107.313426] FS:  0000000000000000(0000) GS:ffff8f6b8dc9e000(0000) knlGS:0000000000000000
[  107.313434] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  107.313441] CR2: 0000000000000028 CR3: 000000069d81a004 CR4: 00000000001706f0
[  107.313449] note: kworker/u32:4[123] exited with irqs disabled



More information about the Linux-nvme mailing list