[PATCH V3] nvme-tcp: teardown circular lockng fixes

Daniel Wagner dwagner at suse.de
Thu Feb 26 23:56:38 PST 2026


On Wed, Feb 25, 2026 at 06:56:58PM -0800, Chaitanya Kulkarni wrote:
> When a controller reset is triggered via sysfs (by writing to
> /sys/class/nvme/<nvmedev>/reset_controller), the reset work tears down
> and re-establishes all queues. The socket release using fput() defers
> the actual cleanup to task_work delayed_fput workqueue. This deferred
> cleanup can race with the subsequent queue re-allocation during reset,
> potentially leading to use-after-free or resource conflicts.
> 
> Replace fput() with __fput_sync() to ensure synchronous socket release,
> guaranteeing that all socket resources are fully cleaned up before the
> function returns. This prevents races during controller reset where
> new queue setup may begin before the old socket is fully released.
> 
> * Call chain during reset:
>   nvme_reset_ctrl_work()
>     -> nvme_tcp_teardown_ctrl()
>       -> nvme_tcp_teardown_io_queues()
>         -> nvme_tcp_free_io_queues()
>           -> nvme_tcp_free_queue()       <-- fput() -> __fput_sync()
>       -> nvme_tcp_teardown_admin_queue()
>         -> nvme_tcp_free_admin_queue()
>           -> nvme_tcp_free_queue()       <-- fput() -> __fput_sync()
>     -> nvme_tcp_setup_ctrl()             <-- race with deferred fput
> 
> memalloc_noreclaim_save() sets PF_MEMALLOC which is intended for tasks
> performing memory reclaim work that need reserve access. While PF_MEMALLOC
> prevents the task from entering direct reclaim (causing __need_reclaim() to
> return false), it does not strip __GFP_IO from gfp flags. The allocator can
> therefore still trigger writeback I/O when __GFP_IO remains set, which is
> unsafe when the caller holds block layer locks.
> 
> Switch to memalloc_noio_save() which sets PF_MEMALLOC_NOIO. This causes
> current_gfp_context() to strip __GFP_IO|__GFP_FS from every allocation in
> the scope, making it safe to allocate memory while holding elevator_lock and
> set->srcu.

I gave it a test run on my local machine with blktests and it passed.
But then, I was not able to reproduce first either.

Reviewed-by: Daniel Wagner <dwagner at suse.de>



More information about the Linux-nvme mailing list