[bug report] WARNING: possible circular locking dependency detected with blktests nvme-tcp nvme/013
Chris Leech
cleech at redhat.com
Mon Feb 14 20:21:03 PST 2022
On 9/29/21 4:06 AM, Sagi Grimberg wrote:
>> Hello
>>
>> Bellow warning was triggered with blktests nvme-tcp on latest
>> 5.15.0-rc3, pls help check it.
>
> Hey Yi, thanks for reporting..
>
> I'm trying to understand the circular locking here. I'm trying to
> understand if this is due to having an allocation in the
> network send path that may cause a circular dependency.
Hi. I've spent a bit of time trying to make sense of this lockdep
complaint, and I have a few observations and a few experiments.
First, the stack trace to where lockdep complains is a direct send
from nvme_tcp_queue_request, and holds an impresive number of locks as
you'd expect from a single stack trace that goes from a vfs syscall,
through XFS, the block_mq layers, the nvme driver, and finally into
the tcp code for a sendpage call.
Second, the hypothetical conflicting lock chain involves a setsockopt
syscall which could cause a page fault during a copy_from_user while
holding the sk_lock. This seems, to me, to be the key to how lockdep
thinks the sk_lock could be held before the various xfs locks.
If that could happen, I suppose that holding a sk_lock waiting for a
page fault to be handled, while that same socket was the transport
path to the storage that needed to be accessed, would be a very bad
thing indeed. But the socket used by nvme-tcp is never exposed to
userspace, so this looks like a false positive. And lockdep isn't
complaining about the lock recursing, but a circular dependancy with
the xfs locks.
Experiment #1, disable the direct mode to always send from workqueue
context. This seperates the sk_lock hold by kernel_sendpage from the
filesystem locks. A quick hack, but it confirms that this breaks the
lockdep cycle.
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -294,7 +294,7 @@ static inline void nvme_tcp_queue_request(struct
nvme_tcp_request *req,
* directly, otherwise queue io_work. Also, only do that if we
* are on the same cpu, so we don't introduce contention.
*/
- if (queue->io_cpu == raw_smp_processor_id() &&
+ if (false && queue->io_cpu == raw_smp_processor_id() &&
sync && empty && mutex_trylock(&queue->send_mutex)) {
queue->more_requests = !last;
nvme_tcp_send_all(queue);
Experiment #2, create a dedicated lock_class_key for the socket used
by nvme-tcp. This seperates lockdep analysis from general AF_INET
socket use. Becuase the socket used by nvme-tcp is only used from the
kernel interfaces, there's never a copy_from_user with the sk_lock
held. As an example, the same thing is done by the ceph networking
code for similar reasons.
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -14,12 +14,17 @@
#include <linux/blk-mq.h>
#include <crypto/hash.h>
#include <net/busy_poll.h>
+#include <linux/lockdep.h>
#include "nvme.h"
#include "fabrics.h"
struct nvme_tcp_queue;
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key nvme_tcp_socket_class;
+#endif
+
/* Define the socket priority to use for connections were it is desirable
* that the NIC consider performing optimized packet processing or
filtering.
* A non-zero value being sufficient to indicate general
consideration of any
@@ -1427,6 +1432,10 @@ static int nvme_tcp_alloc_queue(struct
nvme_ctrl *nctrl,
goto err_destroy_mutex;
}
+#ifdef CONFIG_LOCKDEP
+ lockdep_set_class(&queue->sock->sk->sk_lock, &nvme_tcp_socket_class);
+#endif
+
/* Single syn retry */
tcp_sock_set_syncnt(queue->sock->sk, 1);
I'm not sure why I can't reproduce this with ext4 as the filesystem.
Also, I might have a sleeping function called in atomic context report
with a similar enough call trace to take a closer look at the direct send.
- Chris Leech
>> [ 2153.718873] run blktests nvme/013 at 2021-09-28 07:14:57
>> [ 2154.082894] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>> [ 2154.116627] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
>> [ 2154.174352] nvmet: creating controller 1 for subsystem
>> blktests-subsystem-1 for NQN
>> nqn.2014-08.org.nvmexpress:uuid:c83e4244a1d04d318ced2447da679f67.
>> [ 2154.196889] nvme nvme0: creating 40 I/O queues.
>> [ 2154.222389] nvme nvme0: mapped 40/0/0 default/read/poll queues.
>> [ 2154.275779] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr
>> 127.0.0.1:4420
>> [ 2155.648483] XFS (nvme0n1): Mounting V5 Filesystem
>> [ 2155.717630] XFS (nvme0n1): Ending clean mount
>> [ 2155.734834] xfs filesystem being mounted at /mnt/blktests supports
>> timestamps until 2038 (0x7fffffff)
>>
>> [ 2155.932675] ======================================================
>> [ 2155.939575] WARNING: possible circular locking dependency detected
>> [ 2155.946467] 5.15.0-rc3 #1 Not tainted
>> [ 2155.950551] ------------------------------------------------------
>> [ 2155.957447] fio/1496 is trying to acquire lock:
>> [ 2155.962504] ffff93d58ffe2470 (sk_lock-AF_INET){+.+.}-{0:0}, at:
>> tcp_sendpage+0x23/0x80
>> [ 2155.971355]
>> but task is already holding lock:
>> [ 2155.977866] ffff93d59bb7f8f0 (&xfs_dir_ilock_class/5){+.+.}-{3:3},
>> at: xfs_ilock+0xcf/0x290 [xfs]
>> [ 2155.987885]
>> which lock already depends on the new lock.
>>
>> [ 2155.997012]
>> the existing dependency chain (in reverse order) is:
>> [ 2156.005362]
>> -> #3 (&xfs_dir_ilock_class/5){+.+.}-{3:3}:
>> [ 2156.012843] down_write_nested+0x49/0x120
>> [ 2156.017904] xfs_ilock+0xcf/0x290 [xfs]
>> [ 2156.022850] xfs_create+0x1e8/0x660 [xfs]
>> [ 2156.027989] xfs_generic_create+0xfa/0x340 [xfs]
>> [ 2156.033805] vfs_mkdir+0x12b/0x1e0
>> [ 2156.038194] do_mkdirat+0x10a/0x130
>> [ 2156.042669] __x64_sys_mkdir+0x48/0x70
>> [ 2156.047434] do_syscall_64+0x3b/0x90
>> [ 2156.052013] entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [ 2156.058238]
>> -> #2 (sb_internal){.+.+}-{0:0}:
>> [ 2156.064654] xfs_trans_alloc+0x1a1/0x3c0 [xfs]
>> [ 2156.070281] xfs_vn_update_time+0xc6/0x260 [xfs]
>> [ 2156.076097] touch_atime+0x129/0x230
>> [ 2156.080672] xfs_file_mmap+0x84/0x90 [xfs]
>> [ 2156.085906] mmap_region+0x497/0x6b0
>> [ 2156.090480] do_mmap+0x362/0x560
>> [ 2156.094663] vm_mmap_pgoff+0xb8/0x150
>> [ 2156.099330] ksys_mmap_pgoff+0x198/0x1f0
>> [ 2156.104288] do_syscall_64+0x3b/0x90
>> [ 2156.108860] entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [ 2156.115078]
>> -> #1 (&mm->mmap_lock#2){++++}-{3:3}:
>> [ 2156.121978] __might_fault+0x56/0x80
>> [ 2156.126553] _copy_from_user+0x20/0xa0
>> [ 2156.131327] ip_setsockopt+0x123d/0x15c0
>> [ 2156.136287] __sys_setsockopt+0xba/0x1a0
>> [ 2156.141250] __x64_sys_setsockopt+0x1b/0x20
>> [ 2156.146499] do_syscall_64+0x3b/0x90
>> [ 2156.151069] entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [ 2156.157289]
>> -> #0 (sk_lock-AF_INET){+.+.}-{0:0}:
>> [ 2156.164093] __lock_acquire+0x11fe/0x1e00
>> [ 2156.169154] lock_acquire+0xc4/0x2e0
>> [ 2156.173727] lock_sock_nested+0x2e/0x80
>> [ 2156.178593] tcp_sendpage+0x23/0x80
>> [ 2156.183068] inet_sendpage+0x4f/0x80
>> [ 2156.187645] kernel_sendpage+0x62/0xf0
>> [ 2156.192413] nvme_tcp_try_send+0x128/0x810 [nvme_tcp]
>> [ 2156.198633] nvme_tcp_queue_rq+0x35a/0x380 [nvme_tcp]
>> [ 2156.204853] __blk_mq_try_issue_directly+0x120/0x1c0
>> [ 2156.210983] blk_mq_try_issue_list_directly+0x157/0x250
>> [ 2156.217400] blk_mq_sched_insert_requests+0x241/0x2a0
>> [ 2156.223620] blk_mq_flush_plug_list+0x114/0x210
>> [ 2156.229256] blk_finish_plug+0x21/0x30
>> [ 2156.234028] _xfs_buf_ioapply+0x30a/0x3c0 [xfs]
>> [ 2156.239747] __xfs_buf_submit+0x92/0x260 [xfs]
>> [ 2156.245360] xfs_buf_read_map+0xe9/0x2a0 [xfs]
>> [ 2156.250969] xfs_trans_read_buf_map+0x1bb/0x550 [xfs]
>> [ 2156.257274] xfs_btree_read_buf_block.constprop.0+0x75/0xb0
>> [xfs]
>> [ 2156.264732] xfs_btree_lookup_get_block+0x87/0x160 [xfs]
>> [ 2156.271316] xfs_btree_lookup+0x123/0x4d0 [xfs]
>> [ 2156.277027] xfs_dialloc_ag_update_inobt+0x3a/0x180 [xfs]
>> [ 2156.283709] xfs_dialloc+0x43f/0x800 [xfs]
>> [ 2156.288938] xfs_create+0x404/0x660 [xfs]
>> [ 2156.294075] xfs_generic_create+0xfa/0x340 [xfs]
>> [ 2156.299891] lookup_open.isra.0+0x537/0x670
>> [ 2156.305141] path_openat+0x2aa/0xa20
>> [ 2156.309713] do_filp_open+0x9f/0x130
>> [ 2156.314290] do_sys_openat2+0x7a/0x140
>> [ 2156.319057] __x64_sys_openat+0x45/0x70
>> [ 2156.323918] do_syscall_64+0x3b/0x90
>> [ 2156.328491] entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [ 2156.334712]
>> other info that might help us debug this:
>>
>> [ 2156.343644] Chain exists of:
>> sk_lock-AF_INET --> sb_internal -->
>> &xfs_dir_ilock_class/5
>>
>> [ 2156.355972] Possible unsafe locking scenario:
>>
>> [ 2156.362579] CPU0 CPU1
>> [ 2156.367632] ---- ----
>> [ 2156.372685] lock(&xfs_dir_ilock_class/5);
>> [ 2156.377351] lock(sb_internal);
>> [ 2156.383764]
>> lock(&xfs_dir_ilock_class/5);
>> [ 2156.391242] lock(sk_lock-AF_INET);
>> [ 2156.395230]
>> *** DEADLOCK ***
>>
>> [ 2156.401834] 6 locks held by fio/1496:
>> [ 2156.405919] #0: ffff93d584c36498 (sb_writers#13){.+.+}-{0:0}, at:
>> path_openat+0x9fc/0xa20
>> [ 2156.415147] #1: ffff93d59bb7fb30
>> (&inode->i_sb->s_type->i_mutex_dir_key){++++}-{3:3}, at:
>> path_openat+0x296/0xa20
>> [ 2156.426693] #2: ffff93d584c366b8 (sb_internal){.+.+}-{0:0}, at:
>> xfs_trans_alloc_icreate+0x41/0xd0 [xfs]
>> [ 2156.437363] #3: ffff93d59bb7f8f0
>> (&xfs_dir_ilock_class/5){+.+.}-{3:3}, at: xfs_ilock+0xcf/0x290 [xfs]
>> [ 2156.447838] #4: ffff93d5d75f4c60 (hctx->srcu){....}-{0:0}, at:
>> hctx_lock+0x51/0xd0
>> [ 2156.456386] #5: ffff93d59d7bc3f0 (&queue->send_mutex){+.+.}-{3:3},
>> at: nvme_tcp_queue_rq+0x33e/0x380 [nvme_tcp]
>> [ 2156.467748]
>> stack backtrace:
>> [ 2156.472612] CPU: 27 PID: 1496 Comm: fio Not tainted 5.15.0-rc3 #1
>> [ 2156.479410] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
>> RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
>> [ 2156.490578] Call Trace:
>> [ 2156.493312] dump_stack_lvl+0x57/0x72
>> [ 2156.497404] check_noncircular+0xdf/0x100
>> [ 2156.501879] ? usage_match+0x20/0x20
>> [ 2156.505867] ? __bfs+0xf5/0x230
>> [ 2156.509373] __lock_acquire+0x11fe/0x1e00
>> [ 2156.513850] lock_acquire+0xc4/0x2e0
>> [ 2156.517829] ? tcp_sendpage+0x23/0x80
>> [ 2156.521917] lock_sock_nested+0x2e/0x80
>> [ 2156.526199] ? tcp_sendpage+0x23/0x80
>> [ 2156.530285] tcp_sendpage+0x23/0x80
>> [ 2156.534179] inet_sendpage+0x4f/0x80
>> [ 2156.538170] kernel_sendpage+0x62/0xf0
>> [ 2156.542355] nvme_tcp_try_send+0x128/0x810 [nvme_tcp]
>> [ 2156.547994] ? mutex_trylock+0x13f/0x160
>> [ 2156.552374] ? nvme_tcp_queue_rq+0x33e/0x380 [nvme_tcp]
>> [ 2156.558210] nvme_tcp_queue_rq+0x35a/0x380 [nvme_tcp]
>> [ 2156.563850] __blk_mq_try_issue_directly+0x120/0x1c0
>> [ 2156.569395] blk_mq_try_issue_list_directly+0x157/0x250
>> [ 2156.575230] blk_mq_sched_insert_requests+0x241/0x2a0
>> [ 2156.580869] blk_mq_flush_plug_list+0x114/0x210
>> [ 2156.585926] blk_finish_plug+0x21/0x30
>> [ 2156.590110] _xfs_buf_ioapply+0x30a/0x3c0 [xfs]
>> [ 2156.595252] __xfs_buf_submit+0x92/0x260 [xfs]
>> [ 2156.600295] xfs_buf_read_map+0xe9/0x2a0 [xfs]
>> [ 2156.605335] xfs_trans_read_buf_map+0x1bb/0x550 [xfs]
>> [ 2156.611060] ? xfs_btree_read_buf_block.constprop.0+0x75/0xb0 [xfs]
>> [ 2156.618129] xfs_btree_read_buf_block.constprop.0+0x75/0xb0 [xfs]
>> [ 2156.625006] xfs_btree_lookup_get_block+0x87/0x160 [xfs]
>> [ 2156.631008] xfs_btree_lookup+0x123/0x4d0 [xfs]
>> [ 2156.636137] ? rcu_read_lock_sched_held+0x3f/0x80
>> [ 2156.641395] ? kmem_cache_alloc+0x28a/0x340
>> [ 2156.646070] xfs_dialloc_ag_update_inobt+0x3a/0x180 [xfs]
>> [ 2156.652172] xfs_dialloc+0x43f/0x800 [xfs]
>> [ 2156.656822] ? lock_acquired+0xbc/0x380
>> [ 2156.661104] xfs_create+0x404/0x660 [xfs]
>> [ 2156.665663] xfs_generic_create+0xfa/0x340 [xfs]
>> [ 2156.670900] lookup_open.isra.0+0x537/0x670
>> [ 2156.675573] path_openat+0x2aa/0xa20
>> [ 2156.679564] do_filp_open+0x9f/0x130
>> [ 2156.683554] ? lock_release+0x28a/0x400
>> [ 2156.687840] ? _raw_spin_unlock+0x1f/0x30
>> [ 2156.692314] ? alloc_fd+0x12f/0x1f0
>> [ 2156.696210] do_sys_openat2+0x7a/0x140
>> [ 2156.700394] __x64_sys_openat+0x45/0x70
>> [ 2156.704676] do_syscall_64+0x3b/0x90
>> [ 2156.708658] entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [ 2156.714297] RIP: 0033:0x7fcb40bc7584
>> [ 2156.718290] Code: 24 20 eb 8f 66 90 44 89 54 24 0c e8 56 8b f8 ff
>> 44 8b 54 24 0c 44 89 e2 48 89 ee 41 89 c0 bf 9c ff ff ff b8 01 01 00
>> 00 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 89 44 24 0c e8 98 8b f8 ff
>> 8b 44
>> [ 2156.739247] RSP: 002b:00007ffda2d2e600 EFLAGS: 00000293 ORIG_RAX:
>> 0000000000000101
>> [ 2156.747698] RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
>> 00007fcb40bc7584
>> [ 2156.755660] RDX: 0000000000000041 RSI: 00007fcb3fad72b0 RDI:
>> 00000000ffffff9c
>> [ 2156.763626] RBP: 00007fcb3fad72b0 R08: 0000000000000000 R09:
>> 0000000000000000
>> [ 2156.771591] R10: 00000000000001a4 R11: 0000000000000293 R12:
>> 0000000000000041
>> [ 2156.779553] R13: 0000000000000000 R14: 00007fcb3fad7110 R15:
>> 000000003b600000
>>
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
>
More information about the Linux-nvme
mailing list