[PATCH 1/1] nvme-tcp: lockdep: annotate in-kernel sockets
Yi Zhang
yi.zhang at redhat.com
Mon Feb 21 02:16:23 PST 2022
Hi Chris
Thanks for your patch, the WARNING issue was fixed, but the tests
nvme/012 nvme/013 nvme/014 nvme/015 nvme/038 will take more time than
before, you can check the execution log below:
#nvme_trtype=tcp ./check nvme/
nvme/003 (test if we're sending keep-alives to a discovery controller) [passed]
runtime 10.403s ... 10.399s
nvme/004 (test nvme and nvmet UUID NS descriptors) [passed]
runtime 1.539s ... 1.552s
nvme/005 (reset local loopback target) [passed]
runtime 1.740s ... 1.738s
nvme/006 (create an NVMeOF target with a block device-backed ns) [passed]
runtime 0.266s ... 0.259s
nvme/007 (create an NVMeOF target with a file-backed ns) [passed]
runtime 0.312s ... 0.315s
nvme/008 (create an NVMeOF host with a block device-backed ns) [passed]
runtime 1.602s ... 1.575s
nvme/009 (create an NVMeOF host with a file-backed ns) [passed]
runtime 1.623s ... 1.624s
nvme/010 (run data verification fio job on NVMeOF block device-backed
ns) [passed]
runtime 112.965s ... 112.267s
nvme/011 (run data verification fio job on NVMeOF file-backed ns) [passed]
runtime 219.577s ... 219.101s
nvme/012 (run mkfs and data verification fio job on NVMeOF block
device-backed ns) [passed]
runtime 88.482s ... 180.961s
nvme/013 (run mkfs and data verification fio job on NVMeOF file-backed
ns) [passed]
runtime 142.780s ... 257.008s
nvme/014 (flush a NVMeOF block device-backed ns) [passed]
runtime 9.771s ... 18.027s
nvme/015 (unit test for NVMe flush for file backed ns) [passed]
runtime 9.157s ... 15.595s
nvme/018 (unit test NVMe-oF out of range access on a file backend) [passed]
runtime 1.587s ... 1.784s
nvme/019 (test NVMe DSM Discard command on NVMeOF block-device ns) [passed]
runtime 1.360s ... 1.571s
nvme/020 (test NVMe DSM Discard command on NVMeOF file-backed ns) [passed]
runtime 1.578s ... 1.752s
nvme/021 (test NVMe list command on NVMeOF file-backed ns) [passed]
runtime 1.571s ... 1.756s
nvme/022 (test NVMe reset command on NVMeOF file-backed ns) [passed]
runtime 1.715s ... 1.947s
nvme/023 (test NVMe smart-log command on NVMeOF block-device ns) [passed]
runtime 1.379s ... 1.553s
nvme/024 (test NVMe smart-log command on NVMeOF file-backed ns) [passed]
runtime 1.585s ... 1.758s
nvme/025 (test NVMe effects-log command on NVMeOF file-backed ns) [passed]
runtime 1.570s ... 1.762s
nvme/026 (test NVMe ns-descs command on NVMeOF file-backed ns) [passed]
runtime 1.550s ... 1.740s
nvme/027 (test NVMe ns-rescan command on NVMeOF file-backed ns) [passed]
runtime 1.558s ... 1.723s
nvme/028 (test NVMe list-subsys command on NVMeOF file-backed ns) [passed]
runtime 1.535s ... 1.720s
nvme/029 (test userspace IO via nvme-cli read/write interface) [passed]
runtime 1.580s ... 2.058s
nvme/030 (ensure the discovery generation counter is updated
appropriately) [passed]
runtime 0.408s ... 0.467s
nvme/031 (test deletion of NVMeOF controllers immediately after setup) [passed]
runtime 2.116s ... 2.820s
nvme/038 (test deletion of NVMeOF subsystem without enabling) [passed]
runtime 0.087s ... 0.140s
On Wed, Feb 16, 2022 at 10:26 AM Chris Leech <cleech at redhat.com> wrote:
>
> Put NVMe/TCP sockets in their own class to avoid some lockdep warnings.
> Sockets created by nvme-tcp are not exposed to user-space, and will not
> trigger certain code paths that the general socket API exposes.
>
> Lockdep complains about a circular dependency between the socket and
> filesystem locks, because setsockopt can trigger a page fault with a
> socket lock held, but nvme-tcp sends requests on the socket while file
> system locks are held.
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 5.15.0-rc3 #1 Not tainted
> ------------------------------------------------------
> fio/1496 is trying to acquire lock:
> (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_sendpage+0x23/0x80
>
> but task is already holding lock:
> (&xfs_dir_ilock_class/5){+.+.}-{3:3}, at: xfs_ilock+0xcf/0x290 [xfs]
>
> which lock already depends on the new lock.
>
> other info that might help us debug this:
>
> chain exists of:
> sk_lock-AF_INET --> sb_internal --> &xfs_dir_ilock_class/5
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&xfs_dir_ilock_class/5);
> lock(sb_internal);
> lock(&xfs_dir_ilock_class/5);
> lock(sk_lock-AF_INET);
>
> *** DEADLOCK ***
>
> 6 locks held by fio/1496:
> #0: (sb_writers#13){.+.+}-{0:0}, at: path_openat+0x9fc/0xa20
> #1: (&inode->i_sb->s_type->i_mutex_dir_key){++++}-{3:3}, at: path_openat+0x296/0xa20
> #2: (sb_internal){.+.+}-{0:0}, at: xfs_trans_alloc_icreate+0x41/0xd0 [xfs]
> #3: (&xfs_dir_ilock_class/5){+.+.}-{3:3}, at: xfs_ilock+0xcf/0x290 [xfs]
> #4: (hctx->srcu){....}-{0:0}, at: hctx_lock+0x51/0xd0
> #5: (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0x33e/0x380 [nvme_tcp]
>
> This annotation lets lockdep analyze nvme-tcp controlled sockets
> independently of what the user-space sockets API does.
>
> Link: https://lore.kernel.org/linux-nvme/CAHj4cs9MDYLJ+q+2_GXUK9HxFizv2pxUryUR0toX974M040z7g@mail.gmail.com/
>
> Signed-off-by: Chris Leech <cleech at redhat.com>
> ---
> drivers/nvme/host/tcp.c | 40 ++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 40 insertions(+)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 891a36d02e7c..f1b27e8b9b70 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -30,6 +30,44 @@ static int so_priority;
> module_param(so_priority, int, 0644);
> MODULE_PARM_DESC(so_priority, "nvme tcp socket optimize priority");
>
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +/* lockdep can detect a circular dependency of the form
> + * sk_lock -> mmap_lock (page fault) -> fs locks -> sk_lock
> + * because dependencies are tracked for both nvme-tcp and user contexts. Using
> + * a separate class prevents lockdep from conflating nvme-tcp socket use with
> + * user-space socket API use.
> + */
> +static struct lock_class_key nvme_tcp_sk_key[2];
> +static struct lock_class_key nvme_tcp_slock_key[2];
> +
> +static void nvme_tcp_reclassify_socket(struct socket *sock)
> +{
> + struct sock *sk = sock->sk;
> +
> + if (WARN_ON_ONCE(!sock_allow_reclassification(sk)))
> + return;
> +
> + switch (sk->sk_family) {
> + case AF_INET:
> + sock_lock_init_class_and_name(sk, "slock-AF_INET-NVME",
> + &nvme_tcp_slock_key[0],
> + "sk_lock-AF_INET-NVME",
> + &nvme_tcp_sk_key[0]);
> + break;
> + case AF_INET6:
> + sock_lock_init_class_and_name(sk, "slock-AF_INET6-NVME",
> + &nvme_tcp_slock_key[1],
> + "sk_lock-AF_INET6-NVME",
> + &nvme_tcp_sk_key[1]);
> + break;
> + default:
> + WARN_ON_ONCE(1);
> + }
> +}
> +#else
> +static void nvme_tcp_reclassify_socket(struct socket *sock) { }
> +#endif
> +
> enum nvme_tcp_send_state {
> NVME_TCP_SEND_CMD_PDU = 0,
> NVME_TCP_SEND_H2C_PDU,
> @@ -1435,6 +1473,8 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
> goto err_destroy_mutex;
> }
>
> + nvme_tcp_reclassify_socket(queue->sock);
> +
> /* Single syn retry */
> tcp_sock_set_syncnt(queue->sock->sk, 1);
>
> --
> 2.34.1
>
>
--
Best Regards,
Yi Zhang
More information about the Linux-nvme
mailing list