[PATCH 1/1] nvme-tcp: lockdep: annotate in-kernel sockets

Sagi Grimberg sagi at grimberg.me
Mon Feb 21 03:37:35 PST 2022


> Hi Chris
> Thanks for your patch, the WARNING issue was fixed, but the tests
> nvme/012 nvme/013 nvme/014 nvme/015 nvme/038 will take more time than
> before, you can check the execution log below:

Strange... Chris, can you look into the difference?

> 
> 
> #nvme_trtype=tcp  ./check nvme/
> nvme/003 (test if we're sending keep-alives to a discovery controller) [passed]
>      runtime  10.403s  ...  10.399s
> nvme/004 (test nvme and nvmet UUID NS descriptors)           [passed]
>      runtime  1.539s  ...  1.552s
> nvme/005 (reset local loopback target)                       [passed]
>      runtime  1.740s  ...  1.738s
> nvme/006 (create an NVMeOF target with a block device-backed ns) [passed]
>      runtime  0.266s  ...  0.259s
> nvme/007 (create an NVMeOF target with a file-backed ns)     [passed]
>      runtime  0.312s  ...  0.315s
> nvme/008 (create an NVMeOF host with a block device-backed ns) [passed]
>      runtime  1.602s  ...  1.575s
> nvme/009 (create an NVMeOF host with a file-backed ns)       [passed]
>      runtime  1.623s  ...  1.624s
> nvme/010 (run data verification fio job on NVMeOF block device-backed
> ns) [passed]
>      runtime  112.965s  ...  112.267s
> nvme/011 (run data verification fio job on NVMeOF file-backed ns) [passed]
>      runtime  219.577s  ...  219.101s
> nvme/012 (run mkfs and data verification fio job on NVMeOF block
> device-backed ns) [passed]
>      runtime  88.482s  ...  180.961s
> nvme/013 (run mkfs and data verification fio job on NVMeOF file-backed
> ns) [passed]
>      runtime  142.780s  ...  257.008s
> nvme/014 (flush a NVMeOF block device-backed ns)             [passed]
>      runtime  9.771s  ...  18.027s
> nvme/015 (unit test for NVMe flush for file backed ns)       [passed]
>      runtime  9.157s  ...  15.595s
> nvme/018 (unit test NVMe-oF out of range access on a file backend) [passed]
>      runtime  1.587s  ...  1.784s
> nvme/019 (test NVMe DSM Discard command on NVMeOF block-device ns) [passed]
>      runtime  1.360s  ...  1.571s
> nvme/020 (test NVMe DSM Discard command on NVMeOF file-backed ns) [passed]
>      runtime  1.578s  ...  1.752s
> nvme/021 (test NVMe list command on NVMeOF file-backed ns)   [passed]
>      runtime  1.571s  ...  1.756s
> nvme/022 (test NVMe reset command on NVMeOF file-backed ns)  [passed]
>      runtime  1.715s  ...  1.947s
> nvme/023 (test NVMe smart-log command on NVMeOF block-device ns) [passed]
>      runtime  1.379s  ...  1.553s
> nvme/024 (test NVMe smart-log command on NVMeOF file-backed ns) [passed]
>      runtime  1.585s  ...  1.758s
> nvme/025 (test NVMe effects-log command on NVMeOF file-backed ns) [passed]
>      runtime  1.570s  ...  1.762s
> nvme/026 (test NVMe ns-descs command on NVMeOF file-backed ns) [passed]
>      runtime  1.550s  ...  1.740s
> nvme/027 (test NVMe ns-rescan command on NVMeOF file-backed ns) [passed]
>      runtime  1.558s  ...  1.723s
> nvme/028 (test NVMe list-subsys command on NVMeOF file-backed ns) [passed]
>      runtime  1.535s  ...  1.720s
> nvme/029 (test userspace IO via nvme-cli read/write interface) [passed]
>      runtime  1.580s  ...  2.058s
> nvme/030 (ensure the discovery generation counter is updated
> appropriately) [passed]
>      runtime  0.408s  ...  0.467s
> nvme/031 (test deletion of NVMeOF controllers immediately after setup) [passed]
>      runtime  2.116s  ...  2.820s
> nvme/038 (test deletion of NVMeOF subsystem without enabling) [passed]
>      runtime  0.087s  ...  0.140s
> On Wed, Feb 16, 2022 at 10:26 AM Chris Leech <cleech at redhat.com> wrote:
>>
>> Put NVMe/TCP sockets in their own class to avoid some lockdep warnings.
>> Sockets created by nvme-tcp are not exposed to user-space, and will not
>> trigger certain code paths that the general socket API exposes.
>>
>> Lockdep complains about a circular dependency between the socket and
>> filesystem locks, because setsockopt can trigger a page fault with a
>> socket lock held, but nvme-tcp sends requests on the socket while file
>> system locks are held.
>>
>>    ======================================================
>>    WARNING: possible circular locking dependency detected
>>    5.15.0-rc3 #1 Not tainted
>>    ------------------------------------------------------
>>    fio/1496 is trying to acquire lock:
>>    (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_sendpage+0x23/0x80
>>
>>    but task is already holding lock:
>>    (&xfs_dir_ilock_class/5){+.+.}-{3:3}, at: xfs_ilock+0xcf/0x290 [xfs]
>>
>>    which lock already depends on the new lock.
>>
>>    other info that might help us debug this:
>>
>>    chain exists of:
>>     sk_lock-AF_INET --> sb_internal --> &xfs_dir_ilock_class/5
>>
>>    Possible unsafe locking scenario:
>>
>>          CPU0                    CPU1
>>          ----                    ----
>>     lock(&xfs_dir_ilock_class/5);
>>                                  lock(sb_internal);
>>                                  lock(&xfs_dir_ilock_class/5);
>>     lock(sk_lock-AF_INET);
>>
>>    *** DEADLOCK ***
>>
>>    6 locks held by fio/1496:
>>     #0: (sb_writers#13){.+.+}-{0:0}, at: path_openat+0x9fc/0xa20
>>     #1: (&inode->i_sb->s_type->i_mutex_dir_key){++++}-{3:3}, at: path_openat+0x296/0xa20
>>     #2: (sb_internal){.+.+}-{0:0}, at: xfs_trans_alloc_icreate+0x41/0xd0 [xfs]
>>     #3: (&xfs_dir_ilock_class/5){+.+.}-{3:3}, at: xfs_ilock+0xcf/0x290 [xfs]
>>     #4: (hctx->srcu){....}-{0:0}, at: hctx_lock+0x51/0xd0
>>     #5: (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0x33e/0x380 [nvme_tcp]
>>
>> This annotation lets lockdep analyze nvme-tcp controlled sockets
>> independently of what the user-space sockets API does.
>>
>> Link: https://lore.kernel.org/linux-nvme/CAHj4cs9MDYLJ+q+2_GXUK9HxFizv2pxUryUR0toX974M040z7g@mail.gmail.com/
>>
>> Signed-off-by: Chris Leech <cleech at redhat.com>
>> ---
>>   drivers/nvme/host/tcp.c | 40 ++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 40 insertions(+)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 891a36d02e7c..f1b27e8b9b70 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -30,6 +30,44 @@ static int so_priority;
>>   module_param(so_priority, int, 0644);
>>   MODULE_PARM_DESC(so_priority, "nvme tcp socket optimize priority");
>>
>> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
>> +/* lockdep can detect a circular dependency of the form
>> + *   sk_lock -> mmap_lock (page fault) -> fs locks -> sk_lock
>> + * because dependencies are tracked for both nvme-tcp and user contexts. Using
>> + * a separate class prevents lockdep from conflating nvme-tcp socket use with
>> + * user-space socket API use.
>> + */
>> +static struct lock_class_key nvme_tcp_sk_key[2];
>> +static struct lock_class_key nvme_tcp_slock_key[2];
>> +
>> +static void nvme_tcp_reclassify_socket(struct socket *sock)
>> +{
>> +       struct sock *sk = sock->sk;
>> +
>> +       if (WARN_ON_ONCE(!sock_allow_reclassification(sk)))
>> +               return;
>> +
>> +       switch (sk->sk_family) {
>> +       case AF_INET:
>> +               sock_lock_init_class_and_name(sk, "slock-AF_INET-NVME",
>> +                                             &nvme_tcp_slock_key[0],
>> +                                             "sk_lock-AF_INET-NVME",
>> +                                             &nvme_tcp_sk_key[0]);
>> +               break;
>> +       case AF_INET6:
>> +               sock_lock_init_class_and_name(sk, "slock-AF_INET6-NVME",
>> +                                             &nvme_tcp_slock_key[1],
>> +                                             "sk_lock-AF_INET6-NVME",
>> +                                             &nvme_tcp_sk_key[1]);
>> +               break;
>> +       default:
>> +               WARN_ON_ONCE(1);
>> +       }
>> +}
>> +#else
>> +static void nvme_tcp_reclassify_socket(struct socket *sock) { }
>> +#endif
>> +
>>   enum nvme_tcp_send_state {
>>          NVME_TCP_SEND_CMD_PDU = 0,
>>          NVME_TCP_SEND_H2C_PDU,
>> @@ -1435,6 +1473,8 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
>>                  goto err_destroy_mutex;
>>          }
>>
>> +       nvme_tcp_reclassify_socket(queue->sock);
>> +
>>          /* Single syn retry */
>>          tcp_sock_set_syncnt(queue->sock->sk, 1);
>>
>> --
>> 2.34.1
>>
>>
> 
> 



More information about the Linux-nvme mailing list