[PATCH 1/1] nvme-tcp: lockdep: annotate in-kernel sockets

Mon Feb 21 02:16:23 PST 2022

Hi Chris
Thanks for your patch, the WARNING issue was fixed, but the tests
nvme/012 nvme/013 nvme/014 nvme/015 nvme/038 will take more time than
before, you can check the execution log below:


#nvme_trtype=tcp  ./check nvme/
nvme/003 (test if we're sending keep-alives to a discovery controller) [passed]
    runtime  10.403s  ...  10.399s
nvme/004 (test nvme and nvmet UUID NS descriptors)           [passed]
    runtime  1.539s  ...  1.552s
nvme/005 (reset local loopback target)                       [passed]
    runtime  1.740s  ...  1.738s
nvme/006 (create an NVMeOF target with a block device-backed ns) [passed]
    runtime  0.266s  ...  0.259s
nvme/007 (create an NVMeOF target with a file-backed ns)     [passed]
    runtime  0.312s  ...  0.315s
nvme/008 (create an NVMeOF host with a block device-backed ns) [passed]
    runtime  1.602s  ...  1.575s
nvme/009 (create an NVMeOF host with a file-backed ns)       [passed]
    runtime  1.623s  ...  1.624s
nvme/010 (run data verification fio job on NVMeOF block device-backed
ns) [passed]
    runtime  112.965s  ...  112.267s
nvme/011 (run data verification fio job on NVMeOF file-backed ns) [passed]
    runtime  219.577s  ...  219.101s
nvme/012 (run mkfs and data verification fio job on NVMeOF block
device-backed ns) [passed]
    runtime  88.482s  ...  180.961s
nvme/013 (run mkfs and data verification fio job on NVMeOF file-backed
ns) [passed]
    runtime  142.780s  ...  257.008s
nvme/014 (flush a NVMeOF block device-backed ns)             [passed]
    runtime  9.771s  ...  18.027s
nvme/015 (unit test for NVMe flush for file backed ns)       [passed]
    runtime  9.157s  ...  15.595s
nvme/018 (unit test NVMe-oF out of range access on a file backend) [passed]
    runtime  1.587s  ...  1.784s
nvme/019 (test NVMe DSM Discard command on NVMeOF block-device ns) [passed]
    runtime  1.360s  ...  1.571s
nvme/020 (test NVMe DSM Discard command on NVMeOF file-backed ns) [passed]
    runtime  1.578s  ...  1.752s
nvme/021 (test NVMe list command on NVMeOF file-backed ns)   [passed]
    runtime  1.571s  ...  1.756s
nvme/022 (test NVMe reset command on NVMeOF file-backed ns)  [passed]
    runtime  1.715s  ...  1.947s
nvme/023 (test NVMe smart-log command on NVMeOF block-device ns) [passed]
    runtime  1.379s  ...  1.553s
nvme/024 (test NVMe smart-log command on NVMeOF file-backed ns) [passed]
    runtime  1.585s  ...  1.758s
nvme/025 (test NVMe effects-log command on NVMeOF file-backed ns) [passed]
    runtime  1.570s  ...  1.762s
nvme/026 (test NVMe ns-descs command on NVMeOF file-backed ns) [passed]
    runtime  1.550s  ...  1.740s
nvme/027 (test NVMe ns-rescan command on NVMeOF file-backed ns) [passed]
    runtime  1.558s  ...  1.723s
nvme/028 (test NVMe list-subsys command on NVMeOF file-backed ns) [passed]
    runtime  1.535s  ...  1.720s
nvme/029 (test userspace IO via nvme-cli read/write interface) [passed]
    runtime  1.580s  ...  2.058s
nvme/030 (ensure the discovery generation counter is updated
appropriately) [passed]
    runtime  0.408s  ...  0.467s
nvme/031 (test deletion of NVMeOF controllers immediately after setup) [passed]
    runtime  2.116s  ...  2.820s
nvme/038 (test deletion of NVMeOF subsystem without enabling) [passed]
    runtime  0.087s  ...  0.140s
On Wed, Feb 16, 2022 at 10:26 AM Chris Leech <cleech at redhat.com> wrote:
>
> Put NVMe/TCP sockets in their own class to avoid some lockdep warnings.
> Sockets created by nvme-tcp are not exposed to user-space, and will not
> trigger certain code paths that the general socket API exposes.
>
> Lockdep complains about a circular dependency between the socket and
> filesystem locks, because setsockopt can trigger a page fault with a
> socket lock held, but nvme-tcp sends requests on the socket while file
> system locks are held.
>
>   ======================================================
>   WARNING: possible circular locking dependency detected
>   5.15.0-rc3 #1 Not tainted
>   ------------------------------------------------------
>   fio/1496 is trying to acquire lock:
>   (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_sendpage+0x23/0x80
>
>   but task is already holding lock:
>   (&xfs_dir_ilock_class/5){+.+.}-{3:3}, at: xfs_ilock+0xcf/0x290 [xfs]
>
>   which lock already depends on the new lock.
>
>   other info that might help us debug this:
>
>   chain exists of:
>    sk_lock-AF_INET --> sb_internal --> &xfs_dir_ilock_class/5
>
>   Possible unsafe locking scenario:
>
>         CPU0                    CPU1
>         ----                    ----
>    lock(&xfs_dir_ilock_class/5);
>                                 lock(sb_internal);
>                                 lock(&xfs_dir_ilock_class/5);
>    lock(sk_lock-AF_INET);
>
>   *** DEADLOCK ***
>
>   6 locks held by fio/1496:
>    #0: (sb_writers#13){.+.+}-{0:0}, at: path_openat+0x9fc/0xa20
>    #1: (&inode->i_sb->s_type->i_mutex_dir_key){++++}-{3:3}, at: path_openat+0x296/0xa20
>    #2: (sb_internal){.+.+}-{0:0}, at: xfs_trans_alloc_icreate+0x41/0xd0 [xfs]
>    #3: (&xfs_dir_ilock_class/5){+.+.}-{3:3}, at: xfs_ilock+0xcf/0x290 [xfs]
>    #4: (hctx->srcu){....}-{0:0}, at: hctx_lock+0x51/0xd0
>    #5: (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0x33e/0x380 [nvme_tcp]
>
> This annotation lets lockdep analyze nvme-tcp controlled sockets
> independently of what the user-space sockets API does.
>
> Link: https://lore.kernel.org/linux-nvme/CAHj4cs9MDYLJ+q+2_GXUK9HxFizv2pxUryUR0toX974M040z7g@mail.gmail.com/
>
> Signed-off-by: Chris Leech <cleech at redhat.com>
> ---
>  drivers/nvme/host/tcp.c | 40 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 891a36d02e7c..f1b27e8b9b70 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -30,6 +30,44 @@ static int so_priority;
>  module_param(so_priority, int, 0644);
>  MODULE_PARM_DESC(so_priority, "nvme tcp socket optimize priority");
>
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +/* lockdep can detect a circular dependency of the form
> + *   sk_lock -> mmap_lock (page fault) -> fs locks -> sk_lock
> + * because dependencies are tracked for both nvme-tcp and user contexts. Using
> + * a separate class prevents lockdep from conflating nvme-tcp socket use with
> + * user-space socket API use.
> + */
> +static struct lock_class_key nvme_tcp_sk_key[2];
> +static struct lock_class_key nvme_tcp_slock_key[2];
> +
> +static void nvme_tcp_reclassify_socket(struct socket *sock)
> +{
> +       struct sock *sk = sock->sk;
> +
> +       if (WARN_ON_ONCE(!sock_allow_reclassification(sk)))
> +               return;
> +
> +       switch (sk->sk_family) {
> +       case AF_INET:
> +               sock_lock_init_class_and_name(sk, "slock-AF_INET-NVME",
> +                                             &nvme_tcp_slock_key[0],
> +                                             "sk_lock-AF_INET-NVME",
> +                                             &nvme_tcp_sk_key[0]);
> +               break;
> +       case AF_INET6:
> +               sock_lock_init_class_and_name(sk, "slock-AF_INET6-NVME",
> +                                             &nvme_tcp_slock_key[1],
> +                                             "sk_lock-AF_INET6-NVME",
> +                                             &nvme_tcp_sk_key[1]);
> +               break;
> +       default:
> +               WARN_ON_ONCE(1);
> +       }
> +}
> +#else
> +static void nvme_tcp_reclassify_socket(struct socket *sock) { }
> +#endif
> +
>  enum nvme_tcp_send_state {
>         NVME_TCP_SEND_CMD_PDU = 0,
>         NVME_TCP_SEND_H2C_PDU,
> @@ -1435,6 +1473,8 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
>                 goto err_destroy_mutex;
>         }
>
> +       nvme_tcp_reclassify_socket(queue->sock);
> +
>         /* Single syn retry */
>         tcp_sock_set_syncnt(queue->sock->sk, 1);
>
> --
> 2.34.1
>
>


-- 
Best Regards,
  Yi Zhang