[PATCH] nvme-tcp: Fix possible circular locking when deleting a controller under memory pressure

Mon Oct 24 05:19:38 PDT 2022

On Sun, Oct 23, 2022 at 11:04:43AM +0300, Sagi Grimberg wrote:
> When destroying a queue, when calling sock_release, the network stack
> might need to allocate an skb to send a FIN/RST. When that happens
> during memory pressure, there is a need to reclaim memory, which
> in turn may ask the nvme-tcp device to write out dirty pages, however
> this is not possible due to a ctrl teardown that is going on.
> 
> Set PF_MEMALLOC to the task that releases the socket to grant access
> to PF_MEMALLOC reserves. In addition, do the same for the nvme-tcp
> thread as this may also originate from the swap itself and should
> be more resilient to memory pressure situations.
> 
> This fixes the following lockdep complaint:
> --
> ======================================================
>  WARNING: possible circular locking dependency detected
>  6.0.0-rc2+ #25 Tainted: G        W
>  ------------------------------------------------------
>  kswapd0/92 is trying to acquire lock:
>  ffff888114003240 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_sendpage+0x23/0xa0
> 
>  but task is already holding lock:
>  ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0
> 
>  which lock already depends on the new lock.
> 
>  the existing dependency chain (in reverse order) is:
> 
>  -> #1 (fs_reclaim){+.+.}-{0:0}:
>         fs_reclaim_acquire+0x11e/0x160
>         kmem_cache_alloc_node+0x44/0x530
>         __alloc_skb+0x158/0x230
>         tcp_send_active_reset+0x7e/0x730
>         tcp_disconnect+0x1272/0x1ae0
>         __tcp_close+0x707/0xd90
>         tcp_close+0x26/0x80
>         inet_release+0xfa/0x220
>         sock_release+0x85/0x1a0
>         nvme_tcp_free_queue+0x1fd/0x470 [nvme_tcp]
>         nvme_do_delete_ctrl+0x130/0x13d [nvme_core]
>         nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
>         kernfs_fop_write_iter+0x356/0x530
>         vfs_write+0x4e8/0xce0
>         ksys_write+0xfd/0x1d0
>         do_syscall_64+0x58/0x80
>         entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
>  -> #0 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
>         __lock_acquire+0x2a0c/0x5690
>         lock_acquire+0x18e/0x4f0
>         lock_sock_nested+0x37/0xc0
>         tcp_sendpage+0x23/0xa0
>         inet_sendpage+0xad/0x120
>         kernel_sendpage+0x156/0x440
>         nvme_tcp_try_send+0x48a/0x2630 [nvme_tcp]
>         nvme_tcp_queue_rq+0xefb/0x17e0 [nvme_tcp]
>         __blk_mq_try_issue_directly+0x452/0x660
>         blk_mq_plug_issue_direct.constprop.0+0x207/0x700
>         blk_mq_flush_plug_list+0x6f5/0xc70
>         __blk_flush_plug+0x264/0x410
>         blk_finish_plug+0x4b/0xa0
>         shrink_lruvec+0x1263/0x1ea0
>         shrink_node+0x736/0x1a80
>         balance_pgdat+0x740/0x10d0
>         kswapd+0x5f2/0xaf0
>         kthread+0x256/0x2f0
>         ret_from_fork+0x1f/0x30
> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(fs_reclaim);
>                                lock(sk_lock-AF_INET-NVME);
>                                lock(fs_reclaim);
>   lock(sk_lock-AF_INET-NVME);
> 
>  *** DEADLOCK ***
> 
> 3 locks held by kswapd0/92:
>  #0: ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0
>  #1: ffff88811f21b0b0 (q->srcu){....}-{0:0}, at: blk_mq_flush_plug_list+0x6b3/0xc70
>  #2: ffff888170b11470 (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0xeb9/0x17e0 [nvme_tcp]
> 
> Reported-by: Daniel Wagner <dwagner at suse.de>
> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>

Tested-by: Daniel Wagner <dwagner at suse.de>