[PATCH] nvmet-rdma: Release connections synchronously

Sun May 14 03:38:09 PDT 2023

On May 11, 2023 / 08:03, Bart Van Assche wrote:
> Lockdep complains about flushing queue->release_work in the NVMe RDMA target
> driver because all instances of queue->release_work have the same lock class
> key. Assigning a different lock class key to each release_work instance is
> difficult because dynamically registered lock class keys must be unregistered
> and the last use of the work_struct lock class key happens after the work
> callback has returned. Hence rework the code for releasing connections as
> follows:
> * Add an argument to __nvmet_rdma_queue_disconnect() and also to
>   nvmet_rdma_queue_disconnect() that indicates whether or not these functions
>   are called from inside a CM handler callback.
> * Do not call rdma_destroy_id() if called from inside a CM handler callback.
> * Let these functions return a negative value if the caller should free the
>   CM ID.
> * Let the CM handler return a negative value if its caller should free the
>   CM ID.
> * Remove queue->release_work since releasing connections now happens
>   synchronously.
> 
> This patch suppresses the following lockdep complaint:

Bart, thanks for the v2 patch. I applied to kernel v6.3 and ran blktests nvme
test group with siw transport. Unfortunately, I observed test process hang at
nvme/003 [1]. The kernel reported lockdep WARN of id_priv->handler_mutex. The
patch adds the "from_cm_handler" argument to __nvmet_rdma_queue_disconnect() to
skip rdma_destroy_id() if the call is from CM handler. However,
rdma_destroy_id() is still called in the call path below:

__nvmet_rdma_queue_disconnect()
  nvmet_rdma_free_queue()
    nvmet_rdma_destroy_queue_ib()
      rdma_destroy_id()

So I think, the from_cm_handler should be added to nvmet_rdma_free_queue() and
nvmet_rdma_destroy_queue_ib() also, so that from_cm_handler is propagated and
rdma_destroy_id() is not called in nvmet_rdma_destroy_queue_ib(). I created a
quick patch with this approach [2], and observed it avoids the process hang.

However, when I did further test using the v2 patch and the fix [1], I observed
another process hang at nvme/009 [3]. This time, deadlock of subsys->lock
happened in nvmet_ctrl_free(). I created a dirty patch to avoid it [4], but it
caused lockdep WARN for possible deadlock between subsys->lock and
id_priv->handler_mutex. I also observed hang at nvme/044 related
nvmet_fatal_error_handler.

It looks that the removal of queue->release_work looks triggering some more
lock related issues. Do you think this approach is the only way to address the
original issue? If so, it will need multiple follow-up fixes, I think.


[1]

[   44.824261] run blktests nvme/003 at 2023-05-14 16:32:23
[   44.925161] SoftiWARP attached
[   45.013065] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[   45.022749] iwpm_register_pid: Unable to send a nlmsg (client = 2)
[   45.025434] nvmet_rdma: enabling port 0 (10.0.2.15:4420)
[   45.072814] nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:00000000-0000-0000-0000-000000000000.
[   45.078662] nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 10.0.2.15:4420
[   55.141297] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"

[   55.172082] ============================================
[   55.172563] WARNING: possible recursive locking detected
[   55.173018] 6.3.0+ #5 Tainted: G            E     
[   55.173426] --------------------------------------------
[   55.173876] kworker/u8:4/88 is trying to acquire lock:
[   55.174312] ffff88811ecd2420 (&id_priv->handler_mutex){+.+.}-{3:3}, at: rdma_destroy_id+0x17/0x20 [rdma_cm]
[   55.175099] 
               but task is already holding lock:
[   55.175667] ffff88811ecd2420 (&id_priv->handler_mutex){+.+.}-{3:3}, at: cma_iw_handler+0xac/0x500 [rdma_cm]
[   55.176445] 
               other info that might help us debug this:
[   55.177027]  Possible unsafe locking scenario:

[   55.177567]        CPU0
[   55.177873]        ----
[   55.178113]   lock(&id_priv->handler_mutex);
[   55.178488]   lock(&id_priv->handler_mutex);
[   55.178864] 
                *** DEADLOCK ***

[   55.179482]  May be due to missing lock nesting notation

[   55.180102] 3 locks held by kworker/u8:4/88:
[   55.180516]  #0: ffff888129888138 ((wq_completion)iw_cm_wq){+.+.}-{0:0}, at: process_one_work+0x706/0x1390
[   55.181318]  #1: ffff88811cadfdb0 ((work_completion)(&work->work)#2){+.+.}-{0:0}, at: process_one_work+0x733/0x1390
[   55.182152]  #2: ffff88811ecd2420 (&id_priv->handler_mutex){+.+.}-{3:3}, at: cma_iw_handler+0xac/0x500 [rdma_cm]
[   55.183007] 
               stack backtrace:
[   55.183458] CPU: 1 PID: 88 Comm: kworker/u8:4 Tainted: G            E      6.3.0+ #5
[   55.184099] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[   55.185570] Workqueue: iw_cm_wq cm_work_handler [iw_cm]
[   55.186823] Call Trace:
[   55.187832]  <TASK>
[   55.188813]  dump_stack_lvl+0x57/0x90
[   55.189866]  __lock_acquire+0x12f5/0x5eb0
[   55.190934]  ? __pfx___lock_acquire+0x10/0x10
[   55.192038]  ? static_obj+0x89/0xc0
[   55.193062]  ? lockdep_init_map_type+0x169/0x7c0
[   55.194175]  lock_acquire+0x1a7/0x4e0
[   55.195262]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[   55.196394]  ? __pfx_lock_acquire+0x10/0x10
[   55.197470]  ? lock_is_held_type+0xce/0x120
[   55.198541]  ? __pfx_ib_drain_qp_done+0x10/0x10 [ib_core]
[   55.199755]  ? __pfx___might_resched+0x10/0x10
[   55.200846]  __mutex_lock+0x185/0x18b0
[   55.201857]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[   55.202922]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[   55.203996]  ? __pfx___mutex_lock+0x10/0x10
[   55.204980]  ? rcu_is_watching+0x11/0xb0
[   55.205936]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[   55.206988]  rdma_destroy_id+0x17/0x20 [rdma_cm]
[   55.207994]  nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma]
[   55.209093]  __nvmet_rdma_queue_disconnect+0x3c0/0x7b0 [nvmet_rdma]
[   55.210215]  nvmet_rdma_cm_handler+0x765/0x3a10 [nvmet_rdma]
[   55.211304]  ? mark_lock+0x101/0x16d0
[   55.212201]  ? __pfx___might_resched+0x10/0x10
[   55.213109]  ? rcu_is_watching+0x11/0xb0
[   55.214040]  ? __pfx_nvmet_rdma_cm_handler+0x10/0x10 [nvmet_rdma]
[   55.215117]  ? cma_iw_handler+0xac/0x500 [rdma_cm]
[   55.216098]  ? __pfx___mutex_lock+0x10/0x10
[   55.216999]  ? mark_lock+0x101/0x16d0
[   55.217923]  ? __pfx_mark_lock+0x10/0x10
[   55.218837]  cma_cm_event_handler+0xb2/0x2f0 [rdma_cm]
[   55.219850]  cma_iw_handler+0x344/0x500 [rdma_cm]
[   55.220841]  ? __pfx_cma_iw_handler+0x10/0x10 [rdma_cm]
[   55.221874]  ? _raw_spin_unlock_irqrestore+0x4c/0x60
[   55.222889]  cm_work_handler+0xb57/0x1c50 [iw_cm]
[   55.223881]  ? __pfx_cm_work_handler+0x10/0x10 [iw_cm]
[   55.224916]  ? lock_release+0x378/0x650
[   55.225869]  ? __schedule+0x10c8/0x5c80
[   55.226816]  process_one_work+0x7e4/0x1390
[   55.227747]  ? __pfx_lock_acquire+0x10/0x10
[   55.228714]  ? __pfx_process_one_work+0x10/0x10
[   55.229689]  ? __pfx_do_raw_spin_lock+0x10/0x10
[   55.230614]  worker_thread+0xf7/0x12b0
[   55.231476]  ? __kthread_parkme+0xc1/0x1f0
[   55.232378]  ? __pfx_worker_thread+0x10/0x10
[   55.233261]  kthread+0x29e/0x340
[   55.234071]  ? __pfx_kthread+0x10/0x10
[   55.234938]  ret_from_fork+0x2c/0x50
[   55.235772]  </TASK>


[2]

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index cfd9d4368248..7d909a88d856 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -1333,23 +1333,25 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
 	goto out;
 }
 
-static void nvmet_rdma_destroy_queue_ib(struct nvmet_rdma_queue *queue)
+static void nvmet_rdma_destroy_queue_ib(struct nvmet_rdma_queue *queue,
+	bool from_cm_handler)
 {
 	ib_drain_qp(queue->qp);
-	if (queue->cm_id)
+	if (!from_cm_handler && queue->cm_id)
 		rdma_destroy_id(queue->cm_id);
 	ib_destroy_qp(queue->qp);
 	ib_cq_pool_put(queue->cq, queue->recv_queue_size + 2 *
 		       queue->send_queue_size + 1);
 }
 
-static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
+static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue,
+				  bool from_cm_handler)
 {
 	pr_debug("freeing queue %d\n", queue->idx);
 
 	nvmet_sq_destroy(&queue->nvme_sq);
 
-	nvmet_rdma_destroy_queue_ib(queue);
+	nvmet_rdma_destroy_queue_ib(queue, from_cm_handler);
 	if (!queue->nsrq) {
 		nvmet_rdma_free_cmds(queue->dev, queue->cmds,
 				queue->recv_queue_size,
@@ -1583,7 +1585,7 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
 	return 0;
 
 free_queue:
-	nvmet_rdma_free_queue(queue);
+	nvmet_rdma_free_queue(queue, true);
 put_device:
 	kref_put(&ndev->ref, nvmet_rdma_free_dev);
 
@@ -1659,9 +1661,7 @@ static int __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue,
 		rdma_disconnect(queue->cm_id);
 		if (from_cm_handler)
 			ret = -ENODEV;
-		else
-			rdma_destroy_id(queue->cm_id);
-		nvmet_rdma_free_queue(queue);
+		nvmet_rdma_free_queue(queue, from_cm_handler);
 		kref_put(&dev->ref, nvmet_rdma_free_dev);
 	}
 
@@ -1700,7 +1700,7 @@ static void nvmet_rdma_queue_connect_fail(struct rdma_cm_id *cm_id,
 
 	pr_err("failed to connect queue %d\n", queue->idx);
 
-	nvmet_rdma_free_queue(queue);
+	nvmet_rdma_free_queue(queue, true);
 	kref_put(&dev->ref, nvmet_rdma_free_dev);
 }
 


[3]

[   83.653727] run blktests nvme/009 at 2023-05-14 16:23:38
[   83.745101] SoftiWARP attached
[   83.818301] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[   83.839954] nvmet_rdma: enabling port 0 (10.0.2.15:4420)
[   83.878445] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:00000000-0000-0000-0000-000000000000.
[   83.885342] nvme nvme1: Please enable CONFIG_NVME_MULTIPATH for full support of multi-port devices.
[   83.888614] nvme nvme1: creating 4 I/O queues.
[   83.922458] nvme nvme1: mapped 4/0/0 default/read/poll queues.
[   83.927734] nvme nvme1: new ctrl: NQN "blktests-subsystem-1", addr 10.0.2.15:4420
[   85.333528] nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"

[   85.429371] ============================================
[   85.430868] WARNING: possible recursive locking detected
[   85.432353] 6.3.0+ #5 Tainted: G            E     
[   85.433785] --------------------------------------------
[   85.435283] rm/1789 is trying to acquire lock:
[   85.436688] ffff8881215cb870 (&subsys->lock#2){+.+.}-{3:3}, at: nvmet_ctrl_free+0xb4/0x800 [nvmet]
[   85.438606] 
               but task is already holding lock:
[   85.441070] ffff8881215cb870 (&subsys->lock#2){+.+.}-{3:3}, at: nvmet_port_del_ctrls+0x2a/0x130 [nvmet]
[   85.443037] 
               other info that might help us debug this:
[   85.445618]  Possible unsafe locking scenario:

[   85.448068]        CPU0
[   85.449320]        ----
[   85.450545]   lock(&subsys->lock#2);
[   85.451891]   lock(&subsys->lock#2);
[   85.453211] 
                *** DEADLOCK ***

[   85.456225]  May be due to missing lock nesting notation

[   85.458498] 5 locks held by rm/1789:
[   85.459675]  #0: ffff8881218b0460 (sb_writers#15){.+.+}-{0:0}, at: do_unlinkat+0x182/0x590
[   85.461326]  #1: ffff8881089ac2b8 (&default_group_class[depth - 1]#4/1){+.+.}-{3:3}, at: do_unlinkat+0x202/0x590
[   85.463099]  #2: ffff88810913d1d0 (&sb->s_type->i_mutex_key#22){++++}-{3:3}, at: vfs_unlink+0xb7/0x810
[   85.464829]  #3: ffffffffc166fb90 (nvmet_config_sem){++++}-{3:3}, at: nvmet_port_subsys_drop_link+0x4c/0x2e0 [nvmet]
[   85.466702]  #4: ffff8881215cb870 (&subsys->lock#2){+.+.}-{3:3}, at: nvmet_port_del_ctrls+0x2a/0x130 [nvmet]
[   85.468507] 
               stack backtrace:
[   85.470497] CPU: 1 PID: 1789 Comm: rm Tainted: G            E      6.3.0+ #5
[   85.471962] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[   85.473666] Call Trace:
[   85.474722]  <TASK>
[   85.475753]  dump_stack_lvl+0x57/0x90
[   85.476912]  __lock_acquire+0x12f5/0x5eb0
[   85.478095]  ? __pfx_lock_release+0x10/0x10
[   85.479333]  ? __pfx___lock_acquire+0x10/0x10
[   85.480549]  ? mark_held_locks+0x96/0xe0
[   85.481722]  lock_acquire+0x1a7/0x4e0
[   85.482855]  ? nvmet_ctrl_free+0xb4/0x800 [nvmet]
[   85.484097]  ? __pfx_lock_acquire+0x10/0x10
[   85.485290]  ? lock_is_held_type+0xce/0x120
[   85.486421]  ? kasan_set_track+0x21/0x30
[   85.487507]  ? kasan_save_free_info+0x2a/0x50
[   85.488599]  ? __pfx___might_resched+0x10/0x10
[   85.489727]  ? nvmet_sq_destroy+0xbd/0x3f0 [nvmet]
[   85.490876]  ? nvmet_rdma_free_queue+0x30/0x390 [nvmet_rdma]
[   85.492123]  ? __nvmet_rdma_queue_disconnect+0x3cd/0x780 [nvmet_rdma]
[   85.493471]  __mutex_lock+0x185/0x18b0
[   85.494536]  ? nvmet_ctrl_free+0xb4/0x800 [nvmet]
[   85.495680]  ? nvmet_ctrl_free+0xb4/0x800 [nvmet]
[   85.496774]  ? __pfx_mark_lock+0x10/0x10
[   85.497794]  ? __pfx___mutex_lock+0x10/0x10
[   85.498892]  ? __pfx_timer_delete+0x10/0x10
[   85.499955]  ? nvmet_ctrl_free+0xb4/0x800 [nvmet]
[   85.501058]  nvmet_ctrl_free+0xb4/0x800 [nvmet]
[   85.502144]  ? __cancel_work+0x13d/0x220
[   85.503184]  ? lockdep_hardirqs_on+0x7d/0x100
[   85.504248]  ? __pfx_nvmet_ctrl_free+0x10/0x10 [nvmet]
[   85.505385]  ? __pfx___cancel_work+0x10/0x10
[   85.506418]  ? nvmet_sq_destroy+0xbd/0x3f0 [nvmet]
[   85.507510]  ? rcu_is_watching+0x11/0xb0
[   85.508499]  nvmet_sq_destroy+0x238/0x3f0 [nvmet]
[   85.509598]  nvmet_rdma_free_queue+0x30/0x390 [nvmet_rdma]
[   85.510753]  __nvmet_rdma_queue_disconnect+0x3cd/0x780 [nvmet_rdma]
[   85.512023]  nvmet_rdma_delete_ctrl+0x1a0/0x250 [nvmet_rdma]
[   85.513266]  nvmet_port_del_ctrls+0xe8/0x130 [nvmet]
[   85.514422]  nvmet_port_subsys_drop_link+0x1be/0x2e0 [nvmet]
[   85.515657]  ? __pfx_nvmet_port_subsys_drop_link+0x10/0x10 [nvmet]
[   85.516935]  configfs_unlink+0x382/0x7a0
[   85.518006]  vfs_unlink+0x29e/0x810
[   85.519019]  do_unlinkat+0x30c/0x590
[   85.520014]  ? __pfx_do_unlinkat+0x10/0x10
[   85.521072]  ? strncpy_from_user+0x66/0x2b0
[   85.522119]  ? getname_flags.part.0+0x8d/0x430
[   85.523184]  __x64_sys_unlinkat+0x92/0xe0
[   85.524200]  do_syscall_64+0x5c/0x90
[   85.525105]  ? asm_exc_page_fault+0x22/0x30
[   85.526099]  ? lockdep_hardirqs_on+0x7d/0x100
[   85.527084]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   85.528157] RIP: 0033:0x7ffb731f9adb
[   85.529079] Code: 73 01 c3 48 8b 0d 55 63 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 07 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 25 63 0d 00 f7 d8 64 89 01 48
[   85.531986] RSP: 002b:00007ffdf5cca688 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
[   85.533397] RAX: ffffffffffffffda RBX: 0000561b08d0d810 RCX: 00007ffb731f9adb
[   85.534706] RDX: 0000000000000000 RSI: 0000561b08d0c5f0 RDI: 00000000ffffff9c
[   85.536022] RBP: 0000561b08d0c560 R08: 0000000000000003 R09: 0000000000000000
[   85.537384] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000000
[   85.538740] R13: 00007ffdf5cca7a0 R14: 0000000000000000 R15: 00007ffdf5cca7a0
[   85.540079]  </TASK>
[   94.320147] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
[   94.326018] nvmet: ctrl 1 fatal error occurred!


[4]

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 3935165048e7..4589412c28c8 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -306,13 +306,15 @@ EXPORT_SYMBOL_GPL(nvmet_unregister_transport);
 
 void nvmet_port_del_ctrls(struct nvmet_port *port, struct nvmet_subsys *subsys)
 {
-	struct nvmet_ctrl *ctrl;
+	struct nvmet_ctrl *ctrl, *tmp;
 
 	mutex_lock(&subsys->lock);
-	list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+	subsys->ctrls_owner = current;
+	list_for_each_entry_safe(ctrl, tmp, &subsys->ctrls, subsys_entry) {
 		if (ctrl->port == port)
 			ctrl->ops->delete_ctrl(ctrl);
 	}
+	subsys->ctrls_owner = NULL;
 	mutex_unlock(&subsys->lock);
 }
 
@@ -1476,11 +1478,18 @@ static void nvmet_ctrl_free(struct kref *ref)
 {
 	struct nvmet_ctrl *ctrl = container_of(ref, struct nvmet_ctrl, ref);
 	struct nvmet_subsys *subsys = ctrl->subsys;
+	bool recursive_lock = false;
 
-	mutex_lock(&subsys->lock);
+	if (!mutex_trylock(&subsys->lock)) {
+		if (subsys->ctrls_owner == current)
+			recursive_lock = true;
+		else
+			mutex_lock(&subsys->lock);
+	}
 	nvmet_release_p2p_ns_map(ctrl);
 	list_del(&ctrl->subsys_entry);
-	mutex_unlock(&subsys->lock);
+	if (!recursive_lock)
+		mutex_unlock(&subsys->lock);
 
 	nvmet_stop_keep_alive_timer(ctrl);
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 89bedfcd974c..03569c712ae1 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -246,6 +246,7 @@ struct nvmet_subsys {
 	u16			cntlid_max;
 
 	struct list_head	ctrls;
+	struct task_struct	*ctrls_owner;
 
 	struct list_head	hosts;
 	bool			allow_any_host;