[Bug Report] nvmet-tcp: unbalanced percpu_ref_put on data digest error after nvmet_req_init failure causes refcount underflow, use-after-free, and permanent workqueue deadlock

Sun Apr 5 21:23:45 PDT 2026

Shivam,


On 4/5/26 12:55, Shivam Kumar wrote:
> Hi all,
>
> Just following up on this report. Please let me know if there are any
> questions about the analysis or if any additional information would be
> helpful.
>
> Thanks,
> Shivam Kumar
>
> On Wed, Mar 25, 2026 at 12:29 AM Chaitanya Kulkarni
> <chaitanyak at nvidia.com> wrote:
>> +linux-nvme list as well.
>>
>> On 3/24/26 21:22, Shivam Kumar wrote:
>>> Hi Greg, Christoph, Sagi, Chaitanya
>>>
>>> Following up on my previous report (percpu_ref underflow in
>>> nvmet_tcp_try_recv_ddgst). During testing with the percpu_ref patch
>>> applied, I identified a separate vulnerability triggered by the same
>>> PoC. This one affects the queue kref (not the SQ percpu_ref) and has a
>>> different root cause.
>>>
>>> # Bug Overview
>>>
>>> nvmet_tcp_handle_icreq() sets queue->state = NVMET_TCP_Q_LIVE without
>>> holding state_lock and without checking whether teardown has already
>>> started. When a client closes the connection immediately after sending
>>> an ICReq and a malformed CapsuleCmd, the FIN can arrive and be
>>> processed before io_work reads the ICReq from the socket buffer. The
>>> FIN handler sets queue->state = NVMET_TCP_Q_DISCONNECTING and calls
>>> kref_put (1->0). Then io_work processes the ICReq and handle_icreq
>>> blindly overwrites the state back to NVMET_TCP_Q_LIVE. This defeats
>>> the guard in nvmet_tcp_schedule_release_queue(), which relies on state
>>> == NVMET_TCP_Q_DISCONNECTING to prevent a second kref_put.
>>>
>>> When the subsequent digest error triggers nvmet_tcp_fatal_error() ->
>>> kernel_sock_shutdown(), the socket teardown synchronously processes
>>> any pending TCP packets in the socket backlog via release_sock() ->
>>> tcp_done() -> nvmet_tcp_state_change() ->
>>> nvmet_tcp_schedule_release_queue(). Because the state was overwritten
>>> to LIVE, the guard passes and kref_put is called on an already-zero
>>> kref, triggering the underflow.
>>>
>>> # Root Cause
>>>
>>> The vulnerable code in nvmet_tcp_handle_icreq() (tcp.c):
>>>
>>> queue->state = NVMET_TCP_Q_LIVE;
>>> nvmet_prepare_receive_pdu(queue);
>>> return 0;
>>>
>>> This state transition is not protected by state_lock and does not
>>> check the current state. All other state transitions to
>>> NVMET_TCP_Q_DISCONNECTING in nvmet_tcp_schedule_release_queue() are
>>> protected by the spinlock, but handle_icreq bypasses this
>>> synchronization entirely.
>>>
>>> # Race Sequence
>>>
>>> The PoC sends ICReq + CapsuleCmd + close() in rapid succession. On the wire:
>>>
>>> [TCP handshake] -> [ICReq] -> [CapsuleCmd] -> [FIN]
>>>
>>> All data arrives in order in the socket buffer, but the FIN triggers
>>> a TCP state change callback (softirq context) that races with io_work
>>> (process context) reading the application data.
>>>
>>> # ftrace evidence (queue 29)
>>>
>>> Step 1: FIN arrives, teardown begins (softirq context):
>>>
>>> kworker/-74 1b..2. 125826488us : nvmet_tcp_state_change: DEBUG queue
>>> 29: In schedule_release before checking state variable state is 0,
>>> kref=1, caller=tcp_fin+0x278/0x4f0
>>> kworker/-74 1b..2. 125826492us : nvmet_tcp_state_change: DEBUG queue
>>> 29: kref_put in schedule_release state changed to 3, kref=1,
>>> caller=tcp_fin+0x278/0x4f0
>>>
>>> State transitions from CONNECTING(0) to DISCONNECTING(3), kref decremented 1->0.
>>>
>>> Step 2: io_work processes ICReq, overwrites state (process context):
>>>
>>> kworker/-74 1..... 125826529us : nvmet_tcp_try_recv_pdu: DEBUG queue
>>> 29: Inside nvmet_tcp_handle_icreq_pdu before changing state is 3,
>>> kref=0, caller=nvmet_tcp_io_work+0x12b/0x1fc0
>>> kworker/-74 1..... 125826532us : nvmet_tcp_try_recv_pdu: DEBUG queue
>>> 29: Inside nvmet_tcp_handle_icreq_pdu after changing state is 2,
>>> kref=0, caller=nvmet_tcp_io_work+0x12b/0x1fc0
>>>
>>> handle_icreq see's state=DISCONNECTING(3) but overwrites it to LIVE(2)
>>> without holding state_lock. kref is already 0.
>>>
>>> Step 3: Digest error triggers fatal_error -> kernel_sock_shutdown
>>> (process context):
>>>
>>> kworker/-74 1..... 125831291us : nvmet_tcp_fatal_error: DEBUG queue
>>> 29: fatal_error, state=2, ctrl=0000000000000000,
>>> caller=nvmet_tcp_try_recv_ddgst+0x74f/0x9a0
>>>
>>> Step 4: kernel_sock_shutdown -> release_sock processes backlog -> tcp_done
>>> re-enters state_change, guard bypassed because state is LIVE(2):
>>>
>>> kworker/-74 1b..2. 125831626us : nvmet_tcp_state_change: DEBUG queue
>>> 29: In schedule_release before checking state variable state is 2,
>>> kref=0, caller=tcp_done+0x1d7/0x350
>>> kworker/-74 1b..2. 125831629us : nvmet_tcp_state_change: DEBUG queue
>>> 29: kref_put in schedule_release state changed to 3, kref=0,
>>> caller=tcp_done+0x1d7/0x350
>>>
>>> schedule_release_queue see's state=LIVE(2), guard passes,
>>> kref_put called on kref=0 -> underflow -> CRASH.
>>>
>>> The critical evidence is Step 2: handle_icreq reads state=3(DISCONNECTING)
>>> and writes state=2(LIVE) — proving the unsynchronized overwrite.
>>>
>>> # Additional Observation: Double fatal_error Call
>>>
>>> The ftrace also shows that fatal_error is called twice per error: once
>>> from nvmet_tcp_try_recv_ddgst (digest mismatch) and again from
>>> nvmet_tcp_socket_error in nvmet_tcp_io_work (propagated error return).
>>> Both calls invoke kernel_sock_shutdown, doubling the window for the
>>> reentrant state_change.
>>>
>>> # Crash Trace (with percpu_ref patch applied)
>>>
>>> refcount_t: underflow; use-after-free.
>>> WARNING: lib/refcount.c:28 at refcount_warn_saturate+0xc5/0x110
>>> Workqueue: nvmet_tcp_wq nvmet_tcp_io_work
>>> Call Trace:
>>> <TASK>
>>> nvmet_tcp_state_change+0x3e6/0x470
>>> tcp_done+0x1d7/0x350
>>> tcp_rcv_state_process+0x392c/0x6d80
>>> tcp_v4_do_rcv+0x1b1/0xa80
>>> __release_sock+0x2ab/0x340
>>> release_sock+0x58/0x1c0
>>> inet_shutdown+0x1f6/0x3c0
>>> nvmet_tcp_fatal_error+0x13c/0x1c0
>>> nvmet_tcp_try_recv_ddgst+0x74f/0x9a0
>>> nvmet_tcp_io_work+0xd43/0x1fc0
>>> process_one_work+0x8cf/0x1a40
>>>
>>> # Affected Code
>>>
>>> - Subsystem: drivers/nvme/target/tcp.c
>>> - Function: nvmet_tcp_handle_icreq()
>>> - Tested on: Linux 7.0-rc3
>>>
>>> # Suggested Fix Direction
>>>
>>> The state transition in nvmet_tcp_handle_icreq() needs to be
>>> synchronized under state_lock with a check for
>>> NVMET_TCP_Q_DISCONNECTING. If the queue is already tearing down,
>>> handle_icreq should not set the state to LIVE and should stop
>>> processing. I'll defer to you on the exact error handling semantics
>>> (return value and cleanup path).
>>>
>>> This bug is independent of the percpu_ref underflow in the previous
>>> report. The percpu_ref patch fixes the SQ refcount underflow; this bug
>>> affects the queue kref through an unsynchronized state transition.
>>>
>>> I am available for further discussion and testing.
>>>
>>> Thanks,
>>> Shivam Kumar
>>> PhD Student, Computer Science
>>> SYNE Lab, Syracuse University
>>> skumar47 at syr.edu
>> I'll try to have a look at this over the weekend.
>>
>> -ck
>>
>>
Let's avoid top posting. Sorry for the delay.

Can the following patch fix the ref count underflow issue ?

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 4b8b02341ddc..69e971b179ae 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1310,7 +1310,8 @@ static int nvmet_tcp_try_recv_ddgst(struct nvmet_tcp_queue *queue)
  			queue->idx, cmd->req.cmd->common.command_id,
  			queue->pdu.cmd.hdr.type, le32_to_cpu(cmd->recv_ddgst),
  			le32_to_cpu(cmd->exp_ddgst));
-		nvmet_req_uninit(&cmd->req);
+		if (!(cmd->flags & NVMET_TCP_F_INIT_FAILED))
+			nvmet_req_uninit(&cmd->req);
  		nvmet_tcp_free_cmd_buffers(cmd);
  		nvmet_tcp_fatal_error(queue);
  		ret = -EPROTO;
-- 
2.39.5

and something like following can fix the race between ICReq handling
and queue teardown ?

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 69e971b179ae..5c03a6505319 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -408,6 +408,8 @@ static void nvmet_tcp_fatal_error(struct nvmet_tcp_queue *queue)
  static void nvmet_tcp_socket_error(struct nvmet_tcp_queue *queue, int status)
  {
  	queue->rcv_state = NVMET_TCP_RECV_ERR;
+	if (status == -ESHUTDOWN)
+		return;
  	if (status == -EPIPE || status == -ECONNRESET)
  		kernel_sock_shutdown(queue->sock, SHUT_RDWR);
  	else
@@ -922,11 +924,21 @@ static int nvmet_tcp_handle_icreq(struct nvmet_tcp_queue *queue)
  	iov.iov_len = sizeof(*icresp);
  	ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
  	if (ret < 0) {
-		queue->state = NVMET_TCP_Q_FAILED;
+		spin_lock_bh(&queue->state_lock);
+		if (queue->state != NVMET_TCP_Q_DISCONNECTING)
+			queue->state = NVMET_TCP_Q_FAILED;
+		spin_unlock_bh(&queue->state_lock);
  		return ret; /* queue removal will cleanup */
  	}
  
+	spin_lock_bh(&queue->state_lock);
+	if (queue->state == NVMET_TCP_Q_DISCONNECTING) {
+		spin_unlock_bh(&queue->state_lock);
+		/* Tell nvmet_tcp_socket_error() teardown is already in progress. */
+		return -ESHUTDOWN;
+	}
  	queue->state = NVMET_TCP_Q_LIVE;
+	spin_unlock_bh(&queue->state_lock);
  	nvmet_prepare_receive_pdu(queue);
  	return 0;
  }
-- 
2.39.5


-ck