nvme-6.3: blktests nvme/003 null pointer for fc transport

Wed Feb 1 00:10:03 PST 2023

On Tue, Jan 31, 2023 at 03:27:20PM -0800, James Smart wrote:
> On 1/30/2023 3:15 AM, Daniel Wagner wrote:
> > Hi,
> > 
> > Hannes added support to blktest to enable the FC transport for it. And as result
> > blktests is able to crash the kernel. Haven't startet to look into it yet.
> > 
> >   run blktests nvme/003 at 2023-01-30 12:08:28
> >   nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> >   BUG: kernel NULL pointer dereference, address: 0000000000000010
> >   #PF: supervisor read access in kernel mode
> >   #PF: error_code(0x0000) - not-present page
> >   PGD 0 P4D 0
> >   Oops: 0000 [#1] PREEMPT SMP NOPTI
> >   CPU: 3 PID: 14200 Comm: nvme Not tainted 6.2.0-rc4+ #1 c9afca57e89c2aa3234427c2cd6cd4dd2984ae8c
> >   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> >   RIP: 0010:nvme_alloc_admin_tag_set+0x38/0x120 [nvme_core]
> 
> bingo...
> 
> The cleanup patch "nvme-fc: use the tagset alloc/free helpers" introduced
> it.
> 
> See the fix patch "nvme-fc: Fix initialization order" about 10 days ago that
> corrects it.

Good call! This is problem is gone. Now I am running into the next problem:

run blktests nvme/003 at 2023-02-01 09:01:58
nvmet: adding nsid 1 to subsystem blktests-subsystem-1
debugfs: Directory 'nvme0' with parent '/' already present!
nvme0: failed to create debugfs attr
nvme nvme0: NVME-FC{0}: create association : host wwpn 0x20001100aa000002  rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
(NULL device *): {0:0} Association created
nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:2cd2c43b-a90a-45c1-a8cd-86b33ab273b5.
nvme nvme0: NVME-FC{0}: controller connect complete
nvme nvme0: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
nvme nvme1: NVME-FC{1}: create association : host wwpn 0x20001100aa000002  rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
(NULL device *): {0:1} Association created
nvmet: creating nvm controller 2 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:2cd2c43b-a90a-45c1-a8cd-86b33ab273b5.
nvme nvme1: NVME-FC{1}: controller connect complete
nvme nvme1: NVME-FC{1}: new ctrl: NQN "blktests-subsystem-1"
nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
nvme nvme1: rescanning namespaces.
(NULL device *): {0:0} Association deleted
(NULL device *): {0:0} Association freed
(NULL device *): Disconnect LS failed: No Association
nvme nvme1: NVME-FC{1}: io failed due to lldd error 6
nvme nvme1: NVME-FC{1}: transport association event: transport detected io error
nvme nvme1: NVME-FC{1}: resetting controller
debugfs: Directory 'nvme0' with parent '/' already present!
nvme0: failed to create debugfs attr
nvme nvme0: NVME-FC{0}: create association : host wwpn 0x20001100aa000002  rport wwpn 0x20001100aa000001: NQN "nqn.2014-08.org.nvmexpress.discovery"
(NULL device *): {0:0} Association created
nvmet: connect request for invalid subsystem nqn.2014-08.org.nvmexpress.discovery!
nvme nvme0: Connect Invalid Data Parameter, subsysnqn "nqn.2014-08.org.nvmexpress.discovery"
nvme nvme0: NVME-FC{0}: reset: Reconnect attempt failed (16770)
nvme nvme0: NVME-FC{0}: reconnect failure
nvme nvme0: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
nvme nvme1: NVME-FC{1}: create association : host wwpn 0x20001100aa000002  rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
(NULL device *): {0:2} Association created
nvmet: connect request for invalid subsystem blktests-subsystem-1!
nvme nvme1: Connect Invalid Data Parameter, subsysnqn "blktests-subsystem-1"
nvme nvme1: NVME-FC{1}: reset: Reconnect attempt failed (16770)
nvme nvme1: NVME-FC{1}: reconnect failure
nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"
(NULL device *): {0:1} Association deleted
(NULL device *): {0:1} Association freed
(NULL device *): Disconnect LS failed: No Association
(NULL device *): {0:0} Association deleted
(NULL device *): {0:0} Association freed
(NULL device *): Disconnect LS failed: No Association
(NULL device *): {0:2} Association deleted
(NULL device *): {0:2} Association freed
===========================================
WARNING: possible recursive locking detected
6.2.0-rc4+ #1 Not tainted
--------------------------------------------
kworker/3:13/21736 is trying to acquire lock:
ffff93fe60e251e8 ((work_completion)(&tport->ls_work)){+.+.}-{0:0}, at: __flush_work+0x38/0xc0

but task is already holding lock:
ffffa693c1a4fe68 ((work_completion)(&tport->ls_work)){+.+.}-{0:0}, at: process_one_work+0x236/0x410

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock((work_completion)(&tport->ls_work));
  lock((work_completion)(&tport->ls_work));

 *** DEADLOCK ***

 May be due to missing lock nesting notation

2 locks held by kworker/3:13/21736:
 #0: ffff93fe9882ff48 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x210/0x410
 #1: ffffa693c1a4fe68 ((work_completion)(&tport->ls_work)){+.+.}-{0:0}, at: process_one_work+0x236/0x410

stack backtrace:
CPU: 3 PID: 21736 Comm: kworker/3:13 Not tainted 6.2.0-rc4+ #1 c9afca57e89c2aa3234427c2cd6cd4dd2984ae8c
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Workqueue: nvmet-wq fcloop_tport_lsrqst_work [nvme_fcloop]
Call Trace:
 <TASK>
 dump_stack_lvl+0x5a/0x80
 validate_chain+0x1e02/0x2730
 ? lockdep_unlock+0x51/0xc0
 ? lockdep_unlock+0x51/0xc0
 ? validate_chain+0x7c8/0x2730
 ? validate_chain+0x7c8/0x2730
 ? lock_release+0xf7/0x400
 ? __lock_acquire+0x8f7/0xbc0
 ? lock_is_held_type+0x8b/0x110
 __lock_acquire+0x8f7/0xbc0
 ? lock_is_held_type+0x8b/0x110
 lock_acquire+0xb0/0x180
 ? __flush_work+0x38/0xc0
 ? lock_is_held_type+0x8b/0x110
 __flush_work+0x57/0xc0
 ? __flush_work+0x38/0xc0
 ? lockdep_hardirqs_on_prepare+0x16e/0x220
 ? _raw_spin_unlock_irqrestore+0x32/0x50
 ? __free_pages_ok+0x289/0x4a0
 fcloop_targetport_delete+0x15/0xa0 [nvme_fcloop 2f0f07bf6746e7b4ec6dff8d226e919dac60e0f1]
 nvmet_fc_tgtport_put+0x132/0x190 [nvmet_fc cec46c808c67bf2a4fefe925da2828f0b3f2334a]
 nvmet_fc_disconnect_assoc_done+0x7c/0xa0 [nvmet_fc cec46c808c67bf2a4fefe925da2828f0b3f2334a]
 fcloop_tport_lsrqst_work+0x6c/0xa0 [nvme_fcloop 2f0f07bf6746e7b4ec6dff8d226e919dac60e0f1]
 process_one_work+0x26a/0x410
 worker_thread+0x21d/0x490
 ? rcu_lock_release+0x20/0x20
 kthread+0xe1/0x100
 ? kthread_blkcg+0x30/0x30
 ret_from_fork+0x1f/0x30
 </TASK>
nvme_fcloop: fcloop_exit: Failed deleting target port
nvme_fcloop: fcloop_exit: Failed deleting remote port
nvme_fcloop: fcloop_exit: Failed deleting target port
nvme_fcloop: fcloop_exit: Failed deleting remote port
nvme_fcloop: fcloop_exit: Failed deleting target port
nvme_fcloop: fcloop_exit: Failed deleting remote port

The last last to message never stop.