nvmet-tcp kernel crashes consistently when doing 64k rw with fio.

Mark Ruijter mruijter at primelogic.nl
Wed Aug 11 00:58:28 PDT 2021


When I attach an initiator to a nvmet-tcp target running kernel 5.10.57 the target system crashes whenever the initiator runs fio with a 64K blocksize.

mix:	rwmixread=100
rw:	rw
blksiz: 64k
qdepth:	128
job:    12

dmesg will first show _many_ of these messages before the system crashes and reboots.
---
messages-20210720:2021-07-19T16:06:40.084137-06:00 gold kernel: [ 3402.950666] nvmet_tcp: failed cmd 00000000873b89c1 id 22 opcode 2, data_len: 65536
messages-20210720:2021-07-19T16:06:40.084137-06:00 gold kernel: [ 3402.950667] nvmet_tcp: failed cmd 0000000022c079ff id 23 opcode 2, data_len: 65536
messages-20210720:2021-07-19T16:06:40.084138-06:00 gold kernel: [ 3402.950669] nvmet_tcp: failed cmd 0000000093ff4775 id 9 opcode 2, data_len: 65536
messages-20210720:2021-07-19T16:06:40.084138-06:00 gold kernel: [ 3402.950671] nvmet_tcp: failed cmd 00000000dcb0e105 id 10 opcode 2, data_len: 65536
---

We tested with kernels 5.3.18 (SuSE) and kernel 5.10.57.
Running 64K IO with fio triggers this problem consistently.
The automated test that we run tests 64K reads and 64 writes.

I managed to grab a stack trace from the older SuSE kernel which may be helpful:
--
[65980.188661] nvmet_tcp: failed cmd 00000000d4ea0295 id 113 opcode 1, data_len: 65536
[65980.188663] #PF: error_code(0x0000) - not-present page
[65980.188665] nvmet_tcp: failed cmd 000000009fad62b2 id 114 opcode 1, data_len: 65536
[65980.188665] PGD 0 P4D 0 
[65980.188674] Oops: 0000 [#1] SMP NOPTI
[65980.188677] CPU: 0 PID: 4193 Comm: kworker/0:7H Kdump: loaded Tainted: G           OE  X  N 5.3.18-24.37-default #1 SLE15-SP2
[65980.188678] Hardware name: Supermicro, Supermicro, Supermicro, Supermicro SYS-2029U-TN24R4T, SYS-2029U-TN24R4T, SYS-2029U-TN24R4T, SYS-2029U-TN24R4T/X11DP
[65980.188683] Workqueue: nvmet_tcp_wq nvmet_tcp_io_work [nvmet_tcp]
[65980.188685] RIP: 0010:nvmet_tcp_map_pdu_iovec+0x66/0xf0 [nvmet_tcp]
[65980.188687] Code: 48 05 ff 0f 00 00 48 c1 e8 0c 81 e5 ff 0f 00 00 89 87 b8 01 00 00 48 c1 e0 05 48 03 47 30 45 85 f6 0f 84 81 00 00 00 41 89 ec <8b> 70 0c 48 8b 08 8b 50 08 29 ee 44 39 f6 41 0f 47 f6 48 83 e1 fc
[65980.188690] RSP: 0018:ffff9aa9602a7d20 EFLAGS: 00010206
[65980.188691] RAX: 0000000000000000 RBX: ffff8f6f208c1c80 RCX: f28d67c52c83a000
[65980.188692] RDX: 0000000000000000 RSI: 0000000000001000 RDI: ffff8f6ecfd8a200
[65980.188693] nvmet_tcp: failed cmd 0000000044957200 id 115 opcode 1, data_len: 65536
[65980.188695] RBP: 0000000000000000 R08: 0000000000006ee0 R09: 0000000000000030
[65980.188696] nvmet_tcp: failed cmd 000000002387173b id 127 opcode 1, data_len: 65536
[65980.188697] R10: 0000000000000010 R11: fefefefefefefeff R12: 0000000000000000
[65980.188698] nvmet_tcp: failed cmd 00000000dc7f4316 id 58 opcode 1, data_len: 65536
[65980.188700] nvmet_tcp: failed cmd 000000002b203233 id 60 opcode 1, data_len: 65536
[65980.188701] R13: ffff8f8eb4f1e510 R14: 000000000000f000 R15: ffff8f8e6f05e810
[65980.188702] nvmet_tcp: failed cmd 00000000d75d54dd id 61 opcode 1, data_len: 65536
[65980.188704] nvmet_tcp: failed cmd 00000000040df628 id 62 opcode 1, data_len: 65536
[65980.188705] FS:  0000000000000000(0000) GS:ffff8f6fafe00000(0000) knlGS:0000000000000000
[65980.188706] nvmet_tcp: failed cmd 00000000d2515f84 id 63 opcode 1, data_len: 65536
[65980.188708] nvmet_tcp: failed cmd 000000009705fc83 id 64 opcode 1, data_len: 65536
[65980.188709] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[65980.188710] nvmet_tcp: failed cmd 00000000f219ad1a id 65 opcode 1, data_len: 65536
[65980.188712] nvmet_tcp: failed cmd 00000000c6c09c57 id 66 opcode 1, data_len: 65536
[65980.188713] CR2: 000000000000000c CR3: 0000003ed41a2002 CR4: 00000000007606f0
[65980.188714] nvmet_tcp: failed cmd 0000000008d40238 id 67 opcode 1, data_len: 65536
[65980.188715] nvmet_tcp: failed cmd 000000006f8e7979 id 68 opcode 1, data_len: 65536
[65980.188717] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[65980.188718] nvmet_tcp: failed cmd 000000006d2f85bb id 69 opcode 1, data_len: 65536
[65980.188720] nvmet_tcp: failed cmd 00000000aaccbded id 70 opcode 1, data_len: 65536
[65980.188721] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[65980.188722] PKRU: 55555554
[65980.188723] Call Trace:
[65980.188727]  nvmet_tcp_try_recv_pdu+0x3dc/0x6f0 [nvmet_tcp]
[65980.188734]  ? __switch_to_asm+0x34/0x70
[65980.188735]  ? __switch_to_asm+0x40/0x70
[65980.188736]  ? __switch_to_asm+0x34/0x70
[65980.188737]  ? __switch_to_asm+0x40/0x70
[65980.188738]  ? __switch_to_asm+0x34/0x70
[65980.188739]  ? __switch_to_asm+0x40/0x70
[65980.188741]  nvmet_tcp_io_work+0x6d/0xa80 [nvmet_tcp]
[65980.188743]  ? __switch_to_asm+0x34/0x70
[65980.188746]  process_one_work+0x1f4/0x3e0
[65980.188748]  worker_thread+0x2d/0x3e0
[65980.188750]  ? process_one_work+0x3e0/0x3e0
[65980.188752]  kthread+0x10d/0x130
[65980.188753]  ? kthread_park+0xa0/0xa0
[65980.188755]  ret_from_fork+0x1f/0x40
[65980.188757] Modules linked in: st sr_mod cdrom lp parport_pc ppdev parport xfrm_user xsk_diag sctp_diag udp_diag raw_diag unix_diag af_packet_diag netlink_diag binfmt_misc xfs dm_snapshot dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio raid0 md_mod tcp_diag inet_diag scst_vdisk(OENN) nfsd auth_rpcgss nfs_acl lockd grace loop nvme nvme_core xt_tcpudp ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter nvmet_rdma nvmet_tcp null_blk nvmet iscsi_scst(OENN) scst(OENN) dlm sctp rpcrdma sunrpc rdma_ucm ib_iser rdma_cm iw_cm libiscsi ib_ipoib scsi_transport_iscsi ib_cm configfs ib_umad mlx4_ib mlx4_en mlx4_core mlx5_ib ib_uverbs ib_core mlx5_core mlxfw tls pci_hyperv_intf(XX) ipmi_watchdog
[65980.188788]  af_packet iscsi_ibft iscsi_boot_sysfs rfkill dmi_sysfs msr intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp ipmi_ssif coretemp ast drm_vram_helper kvm_intel i2c_algo_bit ttm kvm drm_kms_helper irqbypass crc32_pclmul ixgbe drm ghash_clmulni_intel xfrm_algo aesni_intel mei_me libphy syscopyarea aes_x86_64 sysfillrect crypto_simd lpc_ich sysimgblt mdio ioatdma cryptd glue_helper joydev fb_sys_fops i2c_i801 mei mfd_core dca ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq acpi_pad button btrfs libcrc32c xor hid_generic usbhid raid6_pq sd_mod crc32c_intel xhci_pci xhci_hcd ahci libahci usbcore libata vmd wmi sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod [last unloaded: parport_pc]
[65980.188824] Supported: No, Unsupported modules are loaded
[65980.188826] CR2: 000000000000000c




More information about the Linux-nvme mailing list