[Bug Report] nvmet-tcp: unbalanced percpu_ref_put on data digest error after nvmet_req_init failure causes refcount underflow, use-after-free, and permanent workqueue deadlock

Chaitanya Kulkarni chaitanyak at nvidia.com
Tue Mar 24 21:29:20 PDT 2026


+linux-nvme list as well.

On 3/24/26 21:22, Shivam Kumar wrote:
> Hi Greg, Christoph, Sagi, Chaitanya
>
> Following up on my previous report (percpu_ref underflow in
> nvmet_tcp_try_recv_ddgst). During testing with the percpu_ref patch
> applied, I identified a separate vulnerability triggered by the same
> PoC. This one affects the queue kref (not the SQ percpu_ref) and has a
> different root cause.
>
> # Bug Overview
>
> nvmet_tcp_handle_icreq() sets queue->state = NVMET_TCP_Q_LIVE without
> holding state_lock and without checking whether teardown has already
> started. When a client closes the connection immediately after sending
> an ICReq and a malformed CapsuleCmd, the FIN can arrive and be
> processed before io_work reads the ICReq from the socket buffer. The
> FIN handler sets queue->state = NVMET_TCP_Q_DISCONNECTING and calls
> kref_put (1->0). Then io_work processes the ICReq and handle_icreq
> blindly overwrites the state back to NVMET_TCP_Q_LIVE. This defeats
> the guard in nvmet_tcp_schedule_release_queue(), which relies on state
> == NVMET_TCP_Q_DISCONNECTING to prevent a second kref_put.
>
> When the subsequent digest error triggers nvmet_tcp_fatal_error() ->
> kernel_sock_shutdown(), the socket teardown synchronously processes
> any pending TCP packets in the socket backlog via release_sock() ->
> tcp_done() -> nvmet_tcp_state_change() ->
> nvmet_tcp_schedule_release_queue(). Because the state was overwritten
> to LIVE, the guard passes and kref_put is called on an already-zero
> kref, triggering the underflow.
>
> # Root Cause
>
> The vulnerable code in nvmet_tcp_handle_icreq() (tcp.c):
>
> queue->state = NVMET_TCP_Q_LIVE;
> nvmet_prepare_receive_pdu(queue);
> return 0;
>
> This state transition is not protected by state_lock and does not
> check the current state. All other state transitions to
> NVMET_TCP_Q_DISCONNECTING in nvmet_tcp_schedule_release_queue() are
> protected by the spinlock, but handle_icreq bypasses this
> synchronization entirely.
>
> # Race Sequence
>
> The PoC sends ICReq + CapsuleCmd + close() in rapid succession. On the wire:
>
> [TCP handshake] -> [ICReq] -> [CapsuleCmd] -> [FIN]
>
> All data arrives in order in the socket buffer, but the FIN triggers
> a TCP state change callback (softirq context) that races with io_work
> (process context) reading the application data.
>
> # ftrace evidence (queue 29)
>
> Step 1: FIN arrives, teardown begins (softirq context):
>
> kworker/-74 1b..2. 125826488us : nvmet_tcp_state_change: DEBUG queue
> 29: In schedule_release before checking state variable state is 0,
> kref=1, caller=tcp_fin+0x278/0x4f0
> kworker/-74 1b..2. 125826492us : nvmet_tcp_state_change: DEBUG queue
> 29: kref_put in schedule_release state changed to 3, kref=1,
> caller=tcp_fin+0x278/0x4f0
>
> State transitions from CONNECTING(0) to DISCONNECTING(3), kref decremented 1->0.
>
> Step 2: io_work processes ICReq, overwrites state (process context):
>
> kworker/-74 1..... 125826529us : nvmet_tcp_try_recv_pdu: DEBUG queue
> 29: Inside nvmet_tcp_handle_icreq_pdu before changing state is 3,
> kref=0, caller=nvmet_tcp_io_work+0x12b/0x1fc0
> kworker/-74 1..... 125826532us : nvmet_tcp_try_recv_pdu: DEBUG queue
> 29: Inside nvmet_tcp_handle_icreq_pdu after changing state is 2,
> kref=0, caller=nvmet_tcp_io_work+0x12b/0x1fc0
>
> handle_icreq see's state=DISCONNECTING(3) but overwrites it to LIVE(2)
> without holding state_lock. kref is already 0.
>
> Step 3: Digest error triggers fatal_error -> kernel_sock_shutdown
> (process context):
>
> kworker/-74 1..... 125831291us : nvmet_tcp_fatal_error: DEBUG queue
> 29: fatal_error, state=2, ctrl=0000000000000000,
> caller=nvmet_tcp_try_recv_ddgst+0x74f/0x9a0
>
> Step 4: kernel_sock_shutdown -> release_sock processes backlog -> tcp_done
> re-enters state_change, guard bypassed because state is LIVE(2):
>
> kworker/-74 1b..2. 125831626us : nvmet_tcp_state_change: DEBUG queue
> 29: In schedule_release before checking state variable state is 2,
> kref=0, caller=tcp_done+0x1d7/0x350
> kworker/-74 1b..2. 125831629us : nvmet_tcp_state_change: DEBUG queue
> 29: kref_put in schedule_release state changed to 3, kref=0,
> caller=tcp_done+0x1d7/0x350
>
> schedule_release_queue see's state=LIVE(2), guard passes,
> kref_put called on kref=0 -> underflow -> CRASH.
>
> The critical evidence is Step 2: handle_icreq reads state=3(DISCONNECTING)
> and writes state=2(LIVE) — proving the unsynchronized overwrite.
>
> # Additional Observation: Double fatal_error Call
>
> The ftrace also shows that fatal_error is called twice per error: once
> from nvmet_tcp_try_recv_ddgst (digest mismatch) and again from
> nvmet_tcp_socket_error in nvmet_tcp_io_work (propagated error return).
> Both calls invoke kernel_sock_shutdown, doubling the window for the
> reentrant state_change.
>
> # Crash Trace (with percpu_ref patch applied)
>
> refcount_t: underflow; use-after-free.
> WARNING: lib/refcount.c:28 at refcount_warn_saturate+0xc5/0x110
> Workqueue: nvmet_tcp_wq nvmet_tcp_io_work
> Call Trace:
> <TASK>
> nvmet_tcp_state_change+0x3e6/0x470
> tcp_done+0x1d7/0x350
> tcp_rcv_state_process+0x392c/0x6d80
> tcp_v4_do_rcv+0x1b1/0xa80
> __release_sock+0x2ab/0x340
> release_sock+0x58/0x1c0
> inet_shutdown+0x1f6/0x3c0
> nvmet_tcp_fatal_error+0x13c/0x1c0
> nvmet_tcp_try_recv_ddgst+0x74f/0x9a0
> nvmet_tcp_io_work+0xd43/0x1fc0
> process_one_work+0x8cf/0x1a40
>
> # Affected Code
>
> - Subsystem: drivers/nvme/target/tcp.c
> - Function: nvmet_tcp_handle_icreq()
> - Tested on: Linux 7.0-rc3
>
> # Suggested Fix Direction
>
> The state transition in nvmet_tcp_handle_icreq() needs to be
> synchronized under state_lock with a check for
> NVMET_TCP_Q_DISCONNECTING. If the queue is already tearing down,
> handle_icreq should not set the state to LIVE and should stop
> processing. I'll defer to you on the exact error handling semantics
> (return value and cleanup path).
>
> This bug is independent of the percpu_ref underflow in the previous
> report. The percpu_ref patch fixes the SQ refcount underflow; this bug
> affects the queue kref through an unsynchronized state transition.
>
> I am available for further discussion and testing.
>
> Thanks,
> Shivam Kumar
> PhD Student, Computer Science
> SYNE Lab, Syracuse University
> skumar47 at syr.edu

I'll try to have a look at this over the weekend.

-ck




More information about the Linux-nvme mailing list