[PATCH] nvme: fc: stop lsrcv workqueue before freeing a rport
Aristeu Rozanski
aris at redhat.com
Fri Nov 14 19:56:40 PST 2025
Hi Ewan,
On Fri, Nov 14, 2025 at 12:47:14PM -0500, Ewan Milne wrote:
> Could you maybe add WARN_ON(!list_empty(&rport->ls_rcv_list)); so we'll
> find out?
reproduced twice with just the warning, no warning and interestingly
both had the same variation with workqueue hitting a list corruption:
[ 4326.096413] nvmet: Created discovery controller 2 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[ 4326.097073] nvme nvme1: NVME-FC{1}: controller connect complete
[ 4326.097078] nvme nvme1: NVME-FC{1}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 4326.097570] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[ 4326.110785] list_del corruption, 00000000f88351f8->next is NULL
[ 4326.110813] ------------[ cut here ]------------
[ 4326.110814] kernel BUG at lib/list_debug.c:52!
[ 4326.110893] monitor event: 0040 ilc:2 [#1]SMP
[ 4326.110899] Modules linked in: nvme_fcloop nvmet_fc nvmet nvme_fc nvme_fabrics nvme nvme_core nvme_keyring nvme_auth sunrpc rfkill virtio_gpu virtio_dma_buf drm_client_lib drm_shmem_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_kms_helper fb virtio_net virtio_input net_failover failover vfio_ccw mdev vfio_iommu_type1 vfio iommufd drm fuse drm_panel_orientation_quirks font loop i2c_core nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock ctcm fsm qeth ccwgroup zfcp scsi_transport_fc qdio dasd_fba_mod dasd_eckd_mod dasd_mod xfs ghash_s390 prng des_s390 libdes sha3_512_s390 sha3_256_s390 virtio_blk sha_common dm_mirror dm_region_hash dm_log dm_mod paes_s390 crypto_engine pkey_cca pkey_ep11 zcrypt pkey_pckmo pkey aes_s390 [last unloaded: nvmet]
[ 4326.110958] CPU: 0 UID: 0 PID: 164768 Comm: kworker/u9:5 Kdump: loaded Not tainted 6.18.0-0.rc0.53c18dc078bb.1.RHEL100912.el10.s390x #1 NONE
[ 4326.110962] Hardware name: IBM 8561 LT1 400 (KVM/Linux)
[ 4326.110964] Workqueue: \x98 0x0 (nvmet-wq)
[ 4326.110980] Krnl PSW : 0404e00180000000 000001e1297369f2 (__list_del_entry_valid_or_report+0x112/0x130)
[ 4326.110987] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[ 4326.110989] Krnl GPRS: 0000000000000030 0000000000000000 0000000000000033 000001e12a9136e8
[ 4326.110992] 00000001fe51d000 0000000000000000 0000000000000000 fffffffffffffff8
[ 4326.110993] 0000000080238028 0000000000000000 0000000000000000 00000000f88351f8
[ 4326.110995] 00000001016f6900 0000000000000000 000001e1297369ee 000001612a8cbca8
[ 4326.111003] Krnl Code: 000001e1297369e2: c02000441908 larl %r2,000001e129fb9bf2
000001e1297369e8: c0e5ffcb7e58 brasl %r14,000001e1290a6698
#000001e1297369ee: af000000 mc 0,0
>000001e1297369f2: b9040032 lgr %r3,%r2
000001e1297369f6: c02000441913 larl %r2,000001e129fb9c1c
000001e1297369fc: c0e5ffcb7e4e brasl %r14,000001e1290a6698
000001e129736a02: af000000 mc 0,0
000001e129736a06: 0707 bcr 0,%r7
[ 4326.111015] Call Trace:
[ 4326.111017] [<000001e1297369f2>] __list_del_entry_valid_or_report+0x112/0x130
[ 4326.111019] ([<000001e1297369ee>] __list_del_entry_valid_or_report+0x10e/0x130)
[ 4326.111022] [<000001e129130d58>] move_linked_works+0x68/0xe0
[ 4326.111027] [<000001e129134554>] worker_thread+0x1f4/0x440
[ 4326.111030] [<000001e12914036c>] kthread+0x12c/0x280
[ 4326.111034] [<000001e1290b9d6c>] __ret_from_fork+0x3c/0x140
[ 4326.111038] [<000001e129bfa11a>] ret_from_fork+0xa/0x30
[ 4326.111042] Last Breaking-Event-Address:
[ 4326.111043] [<000001e1290a66e4>] _printk+0x4c/0x58
[ 4326.111048] Kernel panic - not syncing: Fatal exception: panic_on_oops
Something I did find interesting and saw many times that might help:
[ 4325.112850] nvme nvme0: qid 0: authenticated
[ 4325.113206] nvme nvme0: NVME-FC{0}: controller connect complete
[ 4325.113745] nvme nvme0: NVME-FC{0}: new ctrl: NQN "blktests-subsystem-1", hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 4325.129993] (NULL device *): {0:0} Association freed
[ 4325.130010] (NULL device *): Disconnect LS failed: No Association
[ 4325.178364] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
[ 4325.410043] (NULL device *): {0:1} Association deleted
[ 4325.430596] nvme nvme0: NVME-FC{0}: create association : host wwpn 0x20001100aa000001 rport wwpn 0x20001100ab000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
[ 4325.430754] (NULL device *): queue 0 connect admin queue failed (-111).
[ 4325.430758] nvme nvme0: NVME-FC{0}: reset: Reconnect attempt failed (-111)
[ 4325.430761] nvme nvme0: NVME-FC{0}: Reconnect attempt in 2 seconds
[ 4325.430767] nvme nvme0: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", hostnqn: nqn.2014-08.org.nvmexpress:uuid:2543b704-63fd-45d1-bccd-32e33962e07b
[ 4325.470264] (NULL device *): {0:1} Association freed
[ 4325.470282] (NULL device *): Disconnect LS failed: No Association
[ 4325.112825] nvme nvme0: qid 0: authenticated with hash hmac(sha512) dhgroup ffdhe8192
(the "NULL device *" part)
I agree my patch does look like fixing the symptom, but I haven't touched
the nvme code before, so it's very possible I'm missing the bigger picture.
--
Aristeu
More information about the Linux-nvme
mailing list