Kernel panic is seen while running iozone over multiple ports and toggling the link on TOT kernel

Mon Sep 28 13:58:10 EDT 2020

Hi all,
I observed the following trace with TOT linux kernel while running NVMF over multiple ports and toggling the link/interface. Issue is observed very intermittently.
Attached is the target configuration on actual disks.

On host, started iozone then toggling the multiport interface1 for 5 seconds and interface2 for 8 seconds one after the other in a loop
Observed the below kernel panic after couple of hours.

[142799.524961] BUG: kernel NULL pointer dereference, address: 0000000000000198
[142799.524965] #PF: supervisor write access in kernel mode
[142799.524966] #PF: error_code(0x0002) - not-present page
[142799.524967] PGD 0 P4D 0
[142799.524970] Oops: 0002 [#1] SMP PTI
[142799.524973] CPU: 1 PID: 16 Comm: ksoftirqd/1 Kdump: loaded Tainted: G S      W         5.9.0-rc6 #1
[142799.524974] Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0 01/28/2016
[142799.524980] RIP: 0010:blk_mq_free_request+0x80/0x110
[142799.524982] Code: 00 00 00 00 8b 53 18 b8 01 00 00 00 84 d2 74 0b 31 c0 81 e2 00 08 06 00 0f 95 c0 48 83 84 c5 80 00 00 00 01 f6 43 1c 40 74 08 <f0> 41 ff 8d 98 01 00 00 8b 05 5a 4a c4 01 85 c0 75 5e 49 8b 7c 24
[142799.524983] RSP: 0018:ffffbb96c0123dc0 EFLAGS: 00010202
[142799.524984] RAX: 0000000000000000 RBX: ffff9e6b70f60280 RCX: 0000000000000018
[142799.524986] RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff9e6b70f60280
[142799.524987] RBP: ffffdb96be8fd400 R08: 0000000000000000 R09: 0000000000000000
[142799.524988] R10: 00513cf381a1baf1 R11: 0000000000000000 R12: ffff9e6b590f7698
[142799.524989] R13: 0000000000000000 R14: 0000000000000004 R15: ffffffff948050c0
[142799.524990] FS:  0000000000000000(0000) GS:ffff9e6befc40000(0000) knlGS:0000000000000000
[142799.524992] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[142799.524993] CR2: 0000000000000198 CR3: 00000001d6a0a001 CR4: 00000000003706e0
[142799.524994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[142799.524995] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[142799.524995] Call Trace:
[142799.525006]  nvme_keep_alive_end_io+0x15/0x70 [nvme_core]
[142799.525011]  nvme_rdma_complete_rq+0x68/0xc0 [nvme_rdma]
[142799.525014]  ? set_next_entity+0xae/0x1f0
[142799.525016]  blk_done_softirq+0x95/0xc0
[142799.525021]  __do_softirq+0xde/0x2ec
[142799.525025]  ? sort_range+0x20/0x20
[142799.525029]  run_ksoftirqd+0x1a/0x20
[142799.525031]  smpboot_thread_fn+0xc5/0x160
[142799.525034]  kthread+0x116/0x130
[142799.525036]  ? kthread_park+0x80/0x80
[142799.525040]  ret_from_fork+0x22/0x30

Keepalive request structure is freed and is accessed by nvme_keep_alive_end_io. looks like a probable race to me.

Thanks,
Dakshaja

-------------- next part --------------
A non-text attachment was scrubbed...
Name: target_config_128
Type: application/octet-stream
Size: 78311 bytes
Desc: target_config_128
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20200928/1453f9cb/attachment-0001.obj>