[PATCH] nvmet-rdma: Release connections synchronously
Shinichiro Kawasaki
shinichiro.kawasaki at wdc.com
Sun May 14 03:38:09 PDT 2023
On May 11, 2023 / 08:03, Bart Van Assche wrote:
> Lockdep complains about flushing queue->release_work in the NVMe RDMA target
> driver because all instances of queue->release_work have the same lock class
> key. Assigning a different lock class key to each release_work instance is
> difficult because dynamically registered lock class keys must be unregistered
> and the last use of the work_struct lock class key happens after the work
> callback has returned. Hence rework the code for releasing connections as
> follows:
> * Add an argument to __nvmet_rdma_queue_disconnect() and also to
> nvmet_rdma_queue_disconnect() that indicates whether or not these functions
> are called from inside a CM handler callback.
> * Do not call rdma_destroy_id() if called from inside a CM handler callback.
> * Let these functions return a negative value if the caller should free the
> CM ID.
> * Let the CM handler return a negative value if its caller should free the
> CM ID.
> * Remove queue->release_work since releasing connections now happens
> synchronously.
>
> This patch suppresses the following lockdep complaint:
Bart, thanks for the v2 patch. I applied to kernel v6.3 and ran blktests nvme
test group with siw transport. Unfortunately, I observed test process hang at
nvme/003 [1]. The kernel reported lockdep WARN of id_priv->handler_mutex. The
patch adds the "from_cm_handler" argument to __nvmet_rdma_queue_disconnect() to
skip rdma_destroy_id() if the call is from CM handler. However,
rdma_destroy_id() is still called in the call path below:
__nvmet_rdma_queue_disconnect()
nvmet_rdma_free_queue()
nvmet_rdma_destroy_queue_ib()
rdma_destroy_id()
So I think, the from_cm_handler should be added to nvmet_rdma_free_queue() and
nvmet_rdma_destroy_queue_ib() also, so that from_cm_handler is propagated and
rdma_destroy_id() is not called in nvmet_rdma_destroy_queue_ib(). I created a
quick patch with this approach [2], and observed it avoids the process hang.
However, when I did further test using the v2 patch and the fix [1], I observed
another process hang at nvme/009 [3]. This time, deadlock of subsys->lock
happened in nvmet_ctrl_free(). I created a dirty patch to avoid it [4], but it
caused lockdep WARN for possible deadlock between subsys->lock and
id_priv->handler_mutex. I also observed hang at nvme/044 related
nvmet_fatal_error_handler.
It looks that the removal of queue->release_work looks triggering some more
lock related issues. Do you think this approach is the only way to address the
original issue? If so, it will need multiple follow-up fixes, I think.
[1]
[ 44.824261] run blktests nvme/003 at 2023-05-14 16:32:23
[ 44.925161] SoftiWARP attached
[ 45.013065] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 45.022749] iwpm_register_pid: Unable to send a nlmsg (client = 2)
[ 45.025434] nvmet_rdma: enabling port 0 (10.0.2.15:4420)
[ 45.072814] nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:00000000-0000-0000-0000-000000000000.
[ 45.078662] nvme nvme1: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 10.0.2.15:4420
[ 55.141297] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[ 55.172082] ============================================
[ 55.172563] WARNING: possible recursive locking detected
[ 55.173018] 6.3.0+ #5 Tainted: G E
[ 55.173426] --------------------------------------------
[ 55.173876] kworker/u8:4/88 is trying to acquire lock:
[ 55.174312] ffff88811ecd2420 (&id_priv->handler_mutex){+.+.}-{3:3}, at: rdma_destroy_id+0x17/0x20 [rdma_cm]
[ 55.175099]
but task is already holding lock:
[ 55.175667] ffff88811ecd2420 (&id_priv->handler_mutex){+.+.}-{3:3}, at: cma_iw_handler+0xac/0x500 [rdma_cm]
[ 55.176445]
other info that might help us debug this:
[ 55.177027] Possible unsafe locking scenario:
[ 55.177567] CPU0
[ 55.177873] ----
[ 55.178113] lock(&id_priv->handler_mutex);
[ 55.178488] lock(&id_priv->handler_mutex);
[ 55.178864]
*** DEADLOCK ***
[ 55.179482] May be due to missing lock nesting notation
[ 55.180102] 3 locks held by kworker/u8:4/88:
[ 55.180516] #0: ffff888129888138 ((wq_completion)iw_cm_wq){+.+.}-{0:0}, at: process_one_work+0x706/0x1390
[ 55.181318] #1: ffff88811cadfdb0 ((work_completion)(&work->work)#2){+.+.}-{0:0}, at: process_one_work+0x733/0x1390
[ 55.182152] #2: ffff88811ecd2420 (&id_priv->handler_mutex){+.+.}-{3:3}, at: cma_iw_handler+0xac/0x500 [rdma_cm]
[ 55.183007]
stack backtrace:
[ 55.183458] CPU: 1 PID: 88 Comm: kworker/u8:4 Tainted: G E 6.3.0+ #5
[ 55.184099] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[ 55.185570] Workqueue: iw_cm_wq cm_work_handler [iw_cm]
[ 55.186823] Call Trace:
[ 55.187832] <TASK>
[ 55.188813] dump_stack_lvl+0x57/0x90
[ 55.189866] __lock_acquire+0x12f5/0x5eb0
[ 55.190934] ? __pfx___lock_acquire+0x10/0x10
[ 55.192038] ? static_obj+0x89/0xc0
[ 55.193062] ? lockdep_init_map_type+0x169/0x7c0
[ 55.194175] lock_acquire+0x1a7/0x4e0
[ 55.195262] ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[ 55.196394] ? __pfx_lock_acquire+0x10/0x10
[ 55.197470] ? lock_is_held_type+0xce/0x120
[ 55.198541] ? __pfx_ib_drain_qp_done+0x10/0x10 [ib_core]
[ 55.199755] ? __pfx___might_resched+0x10/0x10
[ 55.200846] __mutex_lock+0x185/0x18b0
[ 55.201857] ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[ 55.202922] ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[ 55.203996] ? __pfx___mutex_lock+0x10/0x10
[ 55.204980] ? rcu_is_watching+0x11/0xb0
[ 55.205936] ? rdma_destroy_id+0x17/0x20 [rdma_cm]
[ 55.206988] rdma_destroy_id+0x17/0x20 [rdma_cm]
[ 55.207994] nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma]
[ 55.209093] __nvmet_rdma_queue_disconnect+0x3c0/0x7b0 [nvmet_rdma]
[ 55.210215] nvmet_rdma_cm_handler+0x765/0x3a10 [nvmet_rdma]
[ 55.211304] ? mark_lock+0x101/0x16d0
[ 55.212201] ? __pfx___might_resched+0x10/0x10
[ 55.213109] ? rcu_is_watching+0x11/0xb0
[ 55.214040] ? __pfx_nvmet_rdma_cm_handler+0x10/0x10 [nvmet_rdma]
[ 55.215117] ? cma_iw_handler+0xac/0x500 [rdma_cm]
[ 55.216098] ? __pfx___mutex_lock+0x10/0x10
[ 55.216999] ? mark_lock+0x101/0x16d0
[ 55.217923] ? __pfx_mark_lock+0x10/0x10
[ 55.218837] cma_cm_event_handler+0xb2/0x2f0 [rdma_cm]
[ 55.219850] cma_iw_handler+0x344/0x500 [rdma_cm]
[ 55.220841] ? __pfx_cma_iw_handler+0x10/0x10 [rdma_cm]
[ 55.221874] ? _raw_spin_unlock_irqrestore+0x4c/0x60
[ 55.222889] cm_work_handler+0xb57/0x1c50 [iw_cm]
[ 55.223881] ? __pfx_cm_work_handler+0x10/0x10 [iw_cm]
[ 55.224916] ? lock_release+0x378/0x650
[ 55.225869] ? __schedule+0x10c8/0x5c80
[ 55.226816] process_one_work+0x7e4/0x1390
[ 55.227747] ? __pfx_lock_acquire+0x10/0x10
[ 55.228714] ? __pfx_process_one_work+0x10/0x10
[ 55.229689] ? __pfx_do_raw_spin_lock+0x10/0x10
[ 55.230614] worker_thread+0xf7/0x12b0
[ 55.231476] ? __kthread_parkme+0xc1/0x1f0
[ 55.232378] ? __pfx_worker_thread+0x10/0x10
[ 55.233261] kthread+0x29e/0x340
[ 55.234071] ? __pfx_kthread+0x10/0x10
[ 55.234938] ret_from_fork+0x2c/0x50
[ 55.235772] </TASK>
[2]
diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index cfd9d4368248..7d909a88d856 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -1333,23 +1333,25 @@ static int nvmet_rdma_create_queue_ib(struct nvmet_rdma_queue *queue)
goto out;
}
-static void nvmet_rdma_destroy_queue_ib(struct nvmet_rdma_queue *queue)
+static void nvmet_rdma_destroy_queue_ib(struct nvmet_rdma_queue *queue,
+ bool from_cm_handler)
{
ib_drain_qp(queue->qp);
- if (queue->cm_id)
+ if (!from_cm_handler && queue->cm_id)
rdma_destroy_id(queue->cm_id);
ib_destroy_qp(queue->qp);
ib_cq_pool_put(queue->cq, queue->recv_queue_size + 2 *
queue->send_queue_size + 1);
}
-static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue)
+static void nvmet_rdma_free_queue(struct nvmet_rdma_queue *queue,
+ bool from_cm_handler)
{
pr_debug("freeing queue %d\n", queue->idx);
nvmet_sq_destroy(&queue->nvme_sq);
- nvmet_rdma_destroy_queue_ib(queue);
+ nvmet_rdma_destroy_queue_ib(queue, from_cm_handler);
if (!queue->nsrq) {
nvmet_rdma_free_cmds(queue->dev, queue->cmds,
queue->recv_queue_size,
@@ -1583,7 +1585,7 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id *cm_id,
return 0;
free_queue:
- nvmet_rdma_free_queue(queue);
+ nvmet_rdma_free_queue(queue, true);
put_device:
kref_put(&ndev->ref, nvmet_rdma_free_dev);
@@ -1659,9 +1661,7 @@ static int __nvmet_rdma_queue_disconnect(struct nvmet_rdma_queue *queue,
rdma_disconnect(queue->cm_id);
if (from_cm_handler)
ret = -ENODEV;
- else
- rdma_destroy_id(queue->cm_id);
- nvmet_rdma_free_queue(queue);
+ nvmet_rdma_free_queue(queue, from_cm_handler);
kref_put(&dev->ref, nvmet_rdma_free_dev);
}
@@ -1700,7 +1700,7 @@ static void nvmet_rdma_queue_connect_fail(struct rdma_cm_id *cm_id,
pr_err("failed to connect queue %d\n", queue->idx);
- nvmet_rdma_free_queue(queue);
+ nvmet_rdma_free_queue(queue, true);
kref_put(&dev->ref, nvmet_rdma_free_dev);
}
[3]
[ 83.653727] run blktests nvme/009 at 2023-05-14 16:23:38
[ 83.745101] SoftiWARP attached
[ 83.818301] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 83.839954] nvmet_rdma: enabling port 0 (10.0.2.15:4420)
[ 83.878445] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:00000000-0000-0000-0000-000000000000.
[ 83.885342] nvme nvme1: Please enable CONFIG_NVME_MULTIPATH for full support of multi-port devices.
[ 83.888614] nvme nvme1: creating 4 I/O queues.
[ 83.922458] nvme nvme1: mapped 4/0/0 default/read/poll queues.
[ 83.927734] nvme nvme1: new ctrl: NQN "blktests-subsystem-1", addr 10.0.2.15:4420
[ 85.333528] nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"
[ 85.429371] ============================================
[ 85.430868] WARNING: possible recursive locking detected
[ 85.432353] 6.3.0+ #5 Tainted: G E
[ 85.433785] --------------------------------------------
[ 85.435283] rm/1789 is trying to acquire lock:
[ 85.436688] ffff8881215cb870 (&subsys->lock#2){+.+.}-{3:3}, at: nvmet_ctrl_free+0xb4/0x800 [nvmet]
[ 85.438606]
but task is already holding lock:
[ 85.441070] ffff8881215cb870 (&subsys->lock#2){+.+.}-{3:3}, at: nvmet_port_del_ctrls+0x2a/0x130 [nvmet]
[ 85.443037]
other info that might help us debug this:
[ 85.445618] Possible unsafe locking scenario:
[ 85.448068] CPU0
[ 85.449320] ----
[ 85.450545] lock(&subsys->lock#2);
[ 85.451891] lock(&subsys->lock#2);
[ 85.453211]
*** DEADLOCK ***
[ 85.456225] May be due to missing lock nesting notation
[ 85.458498] 5 locks held by rm/1789:
[ 85.459675] #0: ffff8881218b0460 (sb_writers#15){.+.+}-{0:0}, at: do_unlinkat+0x182/0x590
[ 85.461326] #1: ffff8881089ac2b8 (&default_group_class[depth - 1]#4/1){+.+.}-{3:3}, at: do_unlinkat+0x202/0x590
[ 85.463099] #2: ffff88810913d1d0 (&sb->s_type->i_mutex_key#22){++++}-{3:3}, at: vfs_unlink+0xb7/0x810
[ 85.464829] #3: ffffffffc166fb90 (nvmet_config_sem){++++}-{3:3}, at: nvmet_port_subsys_drop_link+0x4c/0x2e0 [nvmet]
[ 85.466702] #4: ffff8881215cb870 (&subsys->lock#2){+.+.}-{3:3}, at: nvmet_port_del_ctrls+0x2a/0x130 [nvmet]
[ 85.468507]
stack backtrace:
[ 85.470497] CPU: 1 PID: 1789 Comm: rm Tainted: G E 6.3.0+ #5
[ 85.471962] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc38 04/01/2014
[ 85.473666] Call Trace:
[ 85.474722] <TASK>
[ 85.475753] dump_stack_lvl+0x57/0x90
[ 85.476912] __lock_acquire+0x12f5/0x5eb0
[ 85.478095] ? __pfx_lock_release+0x10/0x10
[ 85.479333] ? __pfx___lock_acquire+0x10/0x10
[ 85.480549] ? mark_held_locks+0x96/0xe0
[ 85.481722] lock_acquire+0x1a7/0x4e0
[ 85.482855] ? nvmet_ctrl_free+0xb4/0x800 [nvmet]
[ 85.484097] ? __pfx_lock_acquire+0x10/0x10
[ 85.485290] ? lock_is_held_type+0xce/0x120
[ 85.486421] ? kasan_set_track+0x21/0x30
[ 85.487507] ? kasan_save_free_info+0x2a/0x50
[ 85.488599] ? __pfx___might_resched+0x10/0x10
[ 85.489727] ? nvmet_sq_destroy+0xbd/0x3f0 [nvmet]
[ 85.490876] ? nvmet_rdma_free_queue+0x30/0x390 [nvmet_rdma]
[ 85.492123] ? __nvmet_rdma_queue_disconnect+0x3cd/0x780 [nvmet_rdma]
[ 85.493471] __mutex_lock+0x185/0x18b0
[ 85.494536] ? nvmet_ctrl_free+0xb4/0x800 [nvmet]
[ 85.495680] ? nvmet_ctrl_free+0xb4/0x800 [nvmet]
[ 85.496774] ? __pfx_mark_lock+0x10/0x10
[ 85.497794] ? __pfx___mutex_lock+0x10/0x10
[ 85.498892] ? __pfx_timer_delete+0x10/0x10
[ 85.499955] ? nvmet_ctrl_free+0xb4/0x800 [nvmet]
[ 85.501058] nvmet_ctrl_free+0xb4/0x800 [nvmet]
[ 85.502144] ? __cancel_work+0x13d/0x220
[ 85.503184] ? lockdep_hardirqs_on+0x7d/0x100
[ 85.504248] ? __pfx_nvmet_ctrl_free+0x10/0x10 [nvmet]
[ 85.505385] ? __pfx___cancel_work+0x10/0x10
[ 85.506418] ? nvmet_sq_destroy+0xbd/0x3f0 [nvmet]
[ 85.507510] ? rcu_is_watching+0x11/0xb0
[ 85.508499] nvmet_sq_destroy+0x238/0x3f0 [nvmet]
[ 85.509598] nvmet_rdma_free_queue+0x30/0x390 [nvmet_rdma]
[ 85.510753] __nvmet_rdma_queue_disconnect+0x3cd/0x780 [nvmet_rdma]
[ 85.512023] nvmet_rdma_delete_ctrl+0x1a0/0x250 [nvmet_rdma]
[ 85.513266] nvmet_port_del_ctrls+0xe8/0x130 [nvmet]
[ 85.514422] nvmet_port_subsys_drop_link+0x1be/0x2e0 [nvmet]
[ 85.515657] ? __pfx_nvmet_port_subsys_drop_link+0x10/0x10 [nvmet]
[ 85.516935] configfs_unlink+0x382/0x7a0
[ 85.518006] vfs_unlink+0x29e/0x810
[ 85.519019] do_unlinkat+0x30c/0x590
[ 85.520014] ? __pfx_do_unlinkat+0x10/0x10
[ 85.521072] ? strncpy_from_user+0x66/0x2b0
[ 85.522119] ? getname_flags.part.0+0x8d/0x430
[ 85.523184] __x64_sys_unlinkat+0x92/0xe0
[ 85.524200] do_syscall_64+0x5c/0x90
[ 85.525105] ? asm_exc_page_fault+0x22/0x30
[ 85.526099] ? lockdep_hardirqs_on+0x7d/0x100
[ 85.527084] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 85.528157] RIP: 0033:0x7ffb731f9adb
[ 85.529079] Code: 73 01 c3 48 8b 0d 55 63 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 07 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 25 63 0d 00 f7 d8 64 89 01 48
[ 85.531986] RSP: 002b:00007ffdf5cca688 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
[ 85.533397] RAX: ffffffffffffffda RBX: 0000561b08d0d810 RCX: 00007ffb731f9adb
[ 85.534706] RDX: 0000000000000000 RSI: 0000561b08d0c5f0 RDI: 00000000ffffff9c
[ 85.536022] RBP: 0000561b08d0c560 R08: 0000000000000003 R09: 0000000000000000
[ 85.537384] R10: 0000000000000100 R11: 0000000000000246 R12: 0000000000000000
[ 85.538740] R13: 00007ffdf5cca7a0 R14: 0000000000000000 R15: 00007ffdf5cca7a0
[ 85.540079] </TASK>
[ 94.320147] nvmet: ctrl 1 keep-alive timer (5 seconds) expired!
[ 94.326018] nvmet: ctrl 1 fatal error occurred!
[4]
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 3935165048e7..4589412c28c8 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -306,13 +306,15 @@ EXPORT_SYMBOL_GPL(nvmet_unregister_transport);
void nvmet_port_del_ctrls(struct nvmet_port *port, struct nvmet_subsys *subsys)
{
- struct nvmet_ctrl *ctrl;
+ struct nvmet_ctrl *ctrl, *tmp;
mutex_lock(&subsys->lock);
- list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ subsys->ctrls_owner = current;
+ list_for_each_entry_safe(ctrl, tmp, &subsys->ctrls, subsys_entry) {
if (ctrl->port == port)
ctrl->ops->delete_ctrl(ctrl);
}
+ subsys->ctrls_owner = NULL;
mutex_unlock(&subsys->lock);
}
@@ -1476,11 +1478,18 @@ static void nvmet_ctrl_free(struct kref *ref)
{
struct nvmet_ctrl *ctrl = container_of(ref, struct nvmet_ctrl, ref);
struct nvmet_subsys *subsys = ctrl->subsys;
+ bool recursive_lock = false;
- mutex_lock(&subsys->lock);
+ if (!mutex_trylock(&subsys->lock)) {
+ if (subsys->ctrls_owner == current)
+ recursive_lock = true;
+ else
+ mutex_lock(&subsys->lock);
+ }
nvmet_release_p2p_ns_map(ctrl);
list_del(&ctrl->subsys_entry);
- mutex_unlock(&subsys->lock);
+ if (!recursive_lock)
+ mutex_unlock(&subsys->lock);
nvmet_stop_keep_alive_timer(ctrl);
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 89bedfcd974c..03569c712ae1 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -246,6 +246,7 @@ struct nvmet_subsys {
u16 cntlid_max;
struct list_head ctrls;
+ struct task_struct *ctrls_owner;
struct list_head hosts;
bool allow_any_host;
More information about the Linux-nvme
mailing list