[PATCH] nvme: fc: stop lsrcv workqueue before freeing a rport

Fri Nov 14 19:56:40 PST 2025

Hi Ewan,
On Fri, Nov 14, 2025 at 12:47:14PM -0500, Ewan Milne wrote:
> Could you maybe add WARN_ON(!list_empty(&rport->ls_rcv_list)); so we'll
> find out?

reproduced twice with just the warning, no warning and interestingly
both had the same variation with workqueue hitting a list corruption:

	[ 4326.096413] nvmet: Created discovery controller 2 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
	[ 4326.097073] nvme nvme1: NVME-FC{1}: controller connect complete
	[ 4326.097078] nvme nvme1: NVME-FC{1}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
	[ 4326.097570] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
	[ 4326.110785] list_del corruption, 00000000f88351f8->next is NULL
	[ 4326.110813] ------------[ cut here ]------------
	[ 4326.110814] kernel BUG at lib/list_debug.c:52!
	[ 4326.110893] monitor event: 0040 ilc:2 [#1]SMP
	[ 4326.110899] Modules linked in: nvme_fcloop nvmet_fc nvmet nvme_fc nvme_fabrics nvme nvme_core nvme_keyring nvme_auth sunrpc rfkill virtio_gpu virtio_dma_buf drm_client_lib drm_shmem_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_kms_helper fb virtio_net virtio_input net_failover failover vfio_ccw mdev vfio_iommu_type1 vfio iommufd drm fuse drm_panel_orientation_quirks font loop i2c_core nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock ctcm fsm qeth ccwgroup zfcp scsi_transport_fc qdio dasd_fba_mod dasd_eckd_mod dasd_mod xfs ghash_s390 prng des_s390 libdes sha3_512_s390 sha3_256_s390 virtio_blk sha_common dm_mirror dm_region_hash dm_log dm_mod paes_s390 crypto_engine pkey_cca pkey_ep11 zcrypt pkey_pckmo pkey aes_s390 [last unloaded: nvmet]
	[ 4326.110958] CPU: 0 UID: 0 PID: 164768 Comm: kworker/u9:5 Kdump: loaded Not tainted 6.18.0-0.rc0.53c18dc078bb.1.RHEL100912.el10.s390x #1 NONE
	[ 4326.110962] Hardware name: IBM 8561 LT1 400 (KVM/Linux)
	[ 4326.110964] Workqueue: \x98 0x0 (nvmet-wq)
	[ 4326.110980] Krnl PSW : 0404e00180000000 000001e1297369f2 (__list_del_entry_valid_or_report+0x112/0x130)
	[ 4326.110987]            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
	[ 4326.110989] Krnl GPRS: 0000000000000030 0000000000000000 0000000000000033 000001e12a9136e8
	[ 4326.110992]            00000001fe51d000 0000000000000000 0000000000000000 fffffffffffffff8
	[ 4326.110993]            0000000080238028 0000000000000000 0000000000000000 00000000f88351f8
	[ 4326.110995]            00000001016f6900 0000000000000000 000001e1297369ee 000001612a8cbca8
	[ 4326.111003] Krnl Code: 000001e1297369e2: c02000441908        larl    %r2,000001e129fb9bf2
		   000001e1297369e8: c0e5ffcb7e58       brasl   %r14,000001e1290a6698
		  #000001e1297369ee: af000000           mc      0,0
		  >000001e1297369f2: b9040032           lgr     %r3,%r2
		   000001e1297369f6: c02000441913       larl    %r2,000001e129fb9c1c
		   000001e1297369fc: c0e5ffcb7e4e       brasl   %r14,000001e1290a6698
		   000001e129736a02: af000000           mc      0,0
		   000001e129736a06: 0707               bcr     0,%r7
	[ 4326.111015] Call Trace:
	[ 4326.111017]  [<000001e1297369f2>] __list_del_entry_valid_or_report+0x112/0x130
	[ 4326.111019] ([<000001e1297369ee>] __list_del_entry_valid_or_report+0x10e/0x130)
	[ 4326.111022]  [<000001e129130d58>] move_linked_works+0x68/0xe0
	[ 4326.111027]  [<000001e129134554>] worker_thread+0x1f4/0x440
	[ 4326.111030]  [<000001e12914036c>] kthread+0x12c/0x280
	[ 4326.111034]  [<000001e1290b9d6c>] __ret_from_fork+0x3c/0x140
	[ 4326.111038]  [<000001e129bfa11a>] ret_from_fork+0xa/0x30
	[ 4326.111042] Last Breaking-Event-Address:
	[ 4326.111043]  [<000001e1290a66e4>] _printk+0x4c/0x58
	[ 4326.111048] Kernel panic - not syncing: Fatal exception: panic_on_oops

Something I did find interesting and saw many times that might help:

	[ 4325.112850] nvme nvme0: qid 0: authenticated
	[ 4325.113206] nvme nvme0: NVME-FC{0}: controller connect complete
	[ 4325.113745] nvme nvme0: NVME-FC{0}: new ctrl: NQN "blktests-subsystem-1", hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
	[ 4325.129993] (NULL device *): {0:0} Association freed
	[ 4325.130010] (NULL device *): Disconnect LS failed: No Association
	[ 4325.178364] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
	[ 4325.410043] (NULL device *): {0:1} Association deleted
	[ 4325.430596] nvme nvme0: NVME-FC{0}: create association : host wwpn 0x20001100aa000001  rport wwpn 0x20001100ab000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
	[ 4325.430754] (NULL device *): queue 0 connect admin queue failed (-111).
	[ 4325.430758] nvme nvme0: NVME-FC{0}: reset: Reconnect attempt failed (-111)
	[ 4325.430761] nvme nvme0: NVME-FC{0}: Reconnect attempt in 2 seconds
	[ 4325.430767] nvme nvme0: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", hostnqn: nqn.2014-08.org.nvmexpress:uuid:2543b704-63fd-45d1-bccd-32e33962e07b
	[ 4325.470264] (NULL device *): {0:1} Association freed
	[ 4325.470282] (NULL device *): Disconnect LS failed: No Association
	[ 4325.112825] nvme nvme0: qid 0: authenticated with hash hmac(sha512) dhgroup ffdhe8192

(the "NULL device *" part)

I agree my patch does look like fixing the symptom, but I haven't touched
the nvme code before, so it's very possible I'm missing the bigger picture.

-- 
Aristeu