[Bug Report] nvmet-tcp: unbalanced percpu_ref_put on data digest error after nvmet_req_init failure causes refcount underflow, use-after-free, and permanent workqueue deadlock

Sun Apr 5 12:55:19 PDT 2026

Hi all,

Just following up on this report. Please let me know if there are any
questions about the analysis or if any additional information would be
helpful.

Thanks,
Shivam Kumar

On Wed, Mar 25, 2026 at 12:29 AM Chaitanya Kulkarni
<chaitanyak at nvidia.com> wrote:
>
> +linux-nvme list as well.
>
> On 3/24/26 21:22, Shivam Kumar wrote:
> > Hi Greg, Christoph, Sagi, Chaitanya
> >
> > Following up on my previous report (percpu_ref underflow in
> > nvmet_tcp_try_recv_ddgst). During testing with the percpu_ref patch
> > applied, I identified a separate vulnerability triggered by the same
> > PoC. This one affects the queue kref (not the SQ percpu_ref) and has a
> > different root cause.
> >
> > # Bug Overview
> >
> > nvmet_tcp_handle_icreq() sets queue->state = NVMET_TCP_Q_LIVE without
> > holding state_lock and without checking whether teardown has already
> > started. When a client closes the connection immediately after sending
> > an ICReq and a malformed CapsuleCmd, the FIN can arrive and be
> > processed before io_work reads the ICReq from the socket buffer. The
> > FIN handler sets queue->state = NVMET_TCP_Q_DISCONNECTING and calls
> > kref_put (1->0). Then io_work processes the ICReq and handle_icreq
> > blindly overwrites the state back to NVMET_TCP_Q_LIVE. This defeats
> > the guard in nvmet_tcp_schedule_release_queue(), which relies on state
> > == NVMET_TCP_Q_DISCONNECTING to prevent a second kref_put.
> >
> > When the subsequent digest error triggers nvmet_tcp_fatal_error() ->
> > kernel_sock_shutdown(), the socket teardown synchronously processes
> > any pending TCP packets in the socket backlog via release_sock() ->
> > tcp_done() -> nvmet_tcp_state_change() ->
> > nvmet_tcp_schedule_release_queue(). Because the state was overwritten
> > to LIVE, the guard passes and kref_put is called on an already-zero
> > kref, triggering the underflow.
> >
> > # Root Cause
> >
> > The vulnerable code in nvmet_tcp_handle_icreq() (tcp.c):
> >
> > queue->state = NVMET_TCP_Q_LIVE;
> > nvmet_prepare_receive_pdu(queue);
> > return 0;
> >
> > This state transition is not protected by state_lock and does not
> > check the current state. All other state transitions to
> > NVMET_TCP_Q_DISCONNECTING in nvmet_tcp_schedule_release_queue() are
> > protected by the spinlock, but handle_icreq bypasses this
> > synchronization entirely.
> >
> > # Race Sequence
> >
> > The PoC sends ICReq + CapsuleCmd + close() in rapid succession. On the wire:
> >
> > [TCP handshake] -> [ICReq] -> [CapsuleCmd] -> [FIN]
> >
> > All data arrives in order in the socket buffer, but the FIN triggers
> > a TCP state change callback (softirq context) that races with io_work
> > (process context) reading the application data.
> >
> > # ftrace evidence (queue 29)
> >
> > Step 1: FIN arrives, teardown begins (softirq context):
> >
> > kworker/-74 1b..2. 125826488us : nvmet_tcp_state_change: DEBUG queue
> > 29: In schedule_release before checking state variable state is 0,
> > kref=1, caller=tcp_fin+0x278/0x4f0
> > kworker/-74 1b..2. 125826492us : nvmet_tcp_state_change: DEBUG queue
> > 29: kref_put in schedule_release state changed to 3, kref=1,
> > caller=tcp_fin+0x278/0x4f0
> >
> > State transitions from CONNECTING(0) to DISCONNECTING(3), kref decremented 1->0.
> >
> > Step 2: io_work processes ICReq, overwrites state (process context):
> >
> > kworker/-74 1..... 125826529us : nvmet_tcp_try_recv_pdu: DEBUG queue
> > 29: Inside nvmet_tcp_handle_icreq_pdu before changing state is 3,
> > kref=0, caller=nvmet_tcp_io_work+0x12b/0x1fc0
> > kworker/-74 1..... 125826532us : nvmet_tcp_try_recv_pdu: DEBUG queue
> > 29: Inside nvmet_tcp_handle_icreq_pdu after changing state is 2,
> > kref=0, caller=nvmet_tcp_io_work+0x12b/0x1fc0
> >
> > handle_icreq see's state=DISCONNECTING(3) but overwrites it to LIVE(2)
> > without holding state_lock. kref is already 0.
> >
> > Step 3: Digest error triggers fatal_error -> kernel_sock_shutdown
> > (process context):
> >
> > kworker/-74 1..... 125831291us : nvmet_tcp_fatal_error: DEBUG queue
> > 29: fatal_error, state=2, ctrl=0000000000000000,
> > caller=nvmet_tcp_try_recv_ddgst+0x74f/0x9a0
> >
> > Step 4: kernel_sock_shutdown -> release_sock processes backlog -> tcp_done
> > re-enters state_change, guard bypassed because state is LIVE(2):
> >
> > kworker/-74 1b..2. 125831626us : nvmet_tcp_state_change: DEBUG queue
> > 29: In schedule_release before checking state variable state is 2,
> > kref=0, caller=tcp_done+0x1d7/0x350
> > kworker/-74 1b..2. 125831629us : nvmet_tcp_state_change: DEBUG queue
> > 29: kref_put in schedule_release state changed to 3, kref=0,
> > caller=tcp_done+0x1d7/0x350
> >
> > schedule_release_queue see's state=LIVE(2), guard passes,
> > kref_put called on kref=0 -> underflow -> CRASH.
> >
> > The critical evidence is Step 2: handle_icreq reads state=3(DISCONNECTING)
> > and writes state=2(LIVE) — proving the unsynchronized overwrite.
> >
> > # Additional Observation: Double fatal_error Call
> >
> > The ftrace also shows that fatal_error is called twice per error: once
> > from nvmet_tcp_try_recv_ddgst (digest mismatch) and again from
> > nvmet_tcp_socket_error in nvmet_tcp_io_work (propagated error return).
> > Both calls invoke kernel_sock_shutdown, doubling the window for the
> > reentrant state_change.
> >
> > # Crash Trace (with percpu_ref patch applied)
> >
> > refcount_t: underflow; use-after-free.
> > WARNING: lib/refcount.c:28 at refcount_warn_saturate+0xc5/0x110
> > Workqueue: nvmet_tcp_wq nvmet_tcp_io_work
> > Call Trace:
> > <TASK>
> > nvmet_tcp_state_change+0x3e6/0x470
> > tcp_done+0x1d7/0x350
> > tcp_rcv_state_process+0x392c/0x6d80
> > tcp_v4_do_rcv+0x1b1/0xa80
> > __release_sock+0x2ab/0x340
> > release_sock+0x58/0x1c0
> > inet_shutdown+0x1f6/0x3c0
> > nvmet_tcp_fatal_error+0x13c/0x1c0
> > nvmet_tcp_try_recv_ddgst+0x74f/0x9a0
> > nvmet_tcp_io_work+0xd43/0x1fc0
> > process_one_work+0x8cf/0x1a40
> >
> > # Affected Code
> >
> > - Subsystem: drivers/nvme/target/tcp.c
> > - Function: nvmet_tcp_handle_icreq()
> > - Tested on: Linux 7.0-rc3
> >
> > # Suggested Fix Direction
> >
> > The state transition in nvmet_tcp_handle_icreq() needs to be
> > synchronized under state_lock with a check for
> > NVMET_TCP_Q_DISCONNECTING. If the queue is already tearing down,
> > handle_icreq should not set the state to LIVE and should stop
> > processing. I'll defer to you on the exact error handling semantics
> > (return value and cleanup path).
> >
> > This bug is independent of the percpu_ref underflow in the previous
> > report. The percpu_ref patch fixes the SQ refcount underflow; this bug
> > affects the queue kref through an unsynchronized state transition.
> >
> > I am available for further discussion and testing.
> >
> > Thanks,
> > Shivam Kumar
> > PhD Student, Computer Science
> > SYNE Lab, Syracuse University
> > skumar47 at syr.edu
>
> I'll try to have a look at this over the weekend.
>
> -ck
>
>