[Bug Report] nvmet-tcp: unbalanced percpu_ref_put on data digest error after nvmet_req_init failure causes refcount underflow, use-after-free, and permanent workqueue deadlock

Shivam Kumar kumar.shivam43666 at gmail.com
Mon Apr 6 12:25:37 PDT 2026


On Mon, Apr 6, 2026 at 12:24 AM Chaitanya Kulkarni
<chaitanyak at nvidia.com> wrote:
>
> Shivam,
>
>
> On 4/5/26 12:55, Shivam Kumar wrote:
> > Hi all,
> >
> > Just following up on this report. Please let me know if there are any
> > questions about the analysis or if any additional information would be
> > helpful.
> >
> > Thanks,
> > Shivam Kumar
> >
> > On Wed, Mar 25, 2026 at 12:29 AM Chaitanya Kulkarni
> > <chaitanyak at nvidia.com> wrote:
> >> +linux-nvme list as well.
> >>
> >> On 3/24/26 21:22, Shivam Kumar wrote:
> >>> Hi Greg, Christoph, Sagi, Chaitanya
> >>>
> >>> Following up on my previous report (percpu_ref underflow in
> >>> nvmet_tcp_try_recv_ddgst). During testing with the percpu_ref patch
> >>> applied, I identified a separate vulnerability triggered by the same
> >>> PoC. This one affects the queue kref (not the SQ percpu_ref) and has a
> >>> different root cause.
> >>>
> >>> # Bug Overview
> >>>
> >>> nvmet_tcp_handle_icreq() sets queue->state = NVMET_TCP_Q_LIVE without
> >>> holding state_lock and without checking whether teardown has already
> >>> started. When a client closes the connection immediately after sending
> >>> an ICReq and a malformed CapsuleCmd, the FIN can arrive and be
> >>> processed before io_work reads the ICReq from the socket buffer. The
> >>> FIN handler sets queue->state = NVMET_TCP_Q_DISCONNECTING and calls
> >>> kref_put (1->0). Then io_work processes the ICReq and handle_icreq
> >>> blindly overwrites the state back to NVMET_TCP_Q_LIVE. This defeats
> >>> the guard in nvmet_tcp_schedule_release_queue(), which relies on state
> >>> == NVMET_TCP_Q_DISCONNECTING to prevent a second kref_put.
> >>>
> >>> When the subsequent digest error triggers nvmet_tcp_fatal_error() ->
> >>> kernel_sock_shutdown(), the socket teardown synchronously processes
> >>> any pending TCP packets in the socket backlog via release_sock() ->
> >>> tcp_done() -> nvmet_tcp_state_change() ->
> >>> nvmet_tcp_schedule_release_queue(). Because the state was overwritten
> >>> to LIVE, the guard passes and kref_put is called on an already-zero
> >>> kref, triggering the underflow.
> >>>
> >>> # Root Cause
> >>>
> >>> The vulnerable code in nvmet_tcp_handle_icreq() (tcp.c):
> >>>
> >>> queue->state = NVMET_TCP_Q_LIVE;
> >>> nvmet_prepare_receive_pdu(queue);
> >>> return 0;
> >>>
> >>> This state transition is not protected by state_lock and does not
> >>> check the current state. All other state transitions to
> >>> NVMET_TCP_Q_DISCONNECTING in nvmet_tcp_schedule_release_queue() are
> >>> protected by the spinlock, but handle_icreq bypasses this
> >>> synchronization entirely.
> >>>
> >>> # Race Sequence
> >>>
> >>> The PoC sends ICReq + CapsuleCmd + close() in rapid succession. On the wire:
> >>>
> >>> [TCP handshake] -> [ICReq] -> [CapsuleCmd] -> [FIN]
> >>>
> >>> All data arrives in order in the socket buffer, but the FIN triggers
> >>> a TCP state change callback (softirq context) that races with io_work
> >>> (process context) reading the application data.
> >>>
> >>> # ftrace evidence (queue 29)
> >>>
> >>> Step 1: FIN arrives, teardown begins (softirq context):
> >>>
> >>> kworker/-74 1b..2. 125826488us : nvmet_tcp_state_change: DEBUG queue
> >>> 29: In schedule_release before checking state variable state is 0,
> >>> kref=1, caller=tcp_fin+0x278/0x4f0
> >>> kworker/-74 1b..2. 125826492us : nvmet_tcp_state_change: DEBUG queue
> >>> 29: kref_put in schedule_release state changed to 3, kref=1,
> >>> caller=tcp_fin+0x278/0x4f0
> >>>
> >>> State transitions from CONNECTING(0) to DISCONNECTING(3), kref decremented 1->0.
> >>>
> >>> Step 2: io_work processes ICReq, overwrites state (process context):
> >>>
> >>> kworker/-74 1..... 125826529us : nvmet_tcp_try_recv_pdu: DEBUG queue
> >>> 29: Inside nvmet_tcp_handle_icreq_pdu before changing state is 3,
> >>> kref=0, caller=nvmet_tcp_io_work+0x12b/0x1fc0
> >>> kworker/-74 1..... 125826532us : nvmet_tcp_try_recv_pdu: DEBUG queue
> >>> 29: Inside nvmet_tcp_handle_icreq_pdu after changing state is 2,
> >>> kref=0, caller=nvmet_tcp_io_work+0x12b/0x1fc0
> >>>
> >>> handle_icreq see's state=DISCONNECTING(3) but overwrites it to LIVE(2)
> >>> without holding state_lock. kref is already 0.
> >>>
> >>> Step 3: Digest error triggers fatal_error -> kernel_sock_shutdown
> >>> (process context):
> >>>
> >>> kworker/-74 1..... 125831291us : nvmet_tcp_fatal_error: DEBUG queue
> >>> 29: fatal_error, state=2, ctrl=0000000000000000,
> >>> caller=nvmet_tcp_try_recv_ddgst+0x74f/0x9a0
> >>>
> >>> Step 4: kernel_sock_shutdown -> release_sock processes backlog -> tcp_done
> >>> re-enters state_change, guard bypassed because state is LIVE(2):
> >>>
> >>> kworker/-74 1b..2. 125831626us : nvmet_tcp_state_change: DEBUG queue
> >>> 29: In schedule_release before checking state variable state is 2,
> >>> kref=0, caller=tcp_done+0x1d7/0x350
> >>> kworker/-74 1b..2. 125831629us : nvmet_tcp_state_change: DEBUG queue
> >>> 29: kref_put in schedule_release state changed to 3, kref=0,
> >>> caller=tcp_done+0x1d7/0x350
> >>>
> >>> schedule_release_queue see's state=LIVE(2), guard passes,
> >>> kref_put called on kref=0 -> underflow -> CRASH.
> >>>
> >>> The critical evidence is Step 2: handle_icreq reads state=3(DISCONNECTING)
> >>> and writes state=2(LIVE) — proving the unsynchronized overwrite.
> >>>
> >>> # Additional Observation: Double fatal_error Call
> >>>
> >>> The ftrace also shows that fatal_error is called twice per error: once
> >>> from nvmet_tcp_try_recv_ddgst (digest mismatch) and again from
> >>> nvmet_tcp_socket_error in nvmet_tcp_io_work (propagated error return).
> >>> Both calls invoke kernel_sock_shutdown, doubling the window for the
> >>> reentrant state_change.
> >>>
> >>> # Crash Trace (with percpu_ref patch applied)
> >>>
> >>> refcount_t: underflow; use-after-free.
> >>> WARNING: lib/refcount.c:28 at refcount_warn_saturate+0xc5/0x110
> >>> Workqueue: nvmet_tcp_wq nvmet_tcp_io_work
> >>> Call Trace:
> >>> <TASK>
> >>> nvmet_tcp_state_change+0x3e6/0x470
> >>> tcp_done+0x1d7/0x350
> >>> tcp_rcv_state_process+0x392c/0x6d80
> >>> tcp_v4_do_rcv+0x1b1/0xa80
> >>> __release_sock+0x2ab/0x340
> >>> release_sock+0x58/0x1c0
> >>> inet_shutdown+0x1f6/0x3c0
> >>> nvmet_tcp_fatal_error+0x13c/0x1c0
> >>> nvmet_tcp_try_recv_ddgst+0x74f/0x9a0
> >>> nvmet_tcp_io_work+0xd43/0x1fc0
> >>> process_one_work+0x8cf/0x1a40
> >>>
> >>> # Affected Code
> >>>
> >>> - Subsystem: drivers/nvme/target/tcp.c
> >>> - Function: nvmet_tcp_handle_icreq()
> >>> - Tested on: Linux 7.0-rc3
> >>>
> >>> # Suggested Fix Direction
> >>>
> >>> The state transition in nvmet_tcp_handle_icreq() needs to be
> >>> synchronized under state_lock with a check for
> >>> NVMET_TCP_Q_DISCONNECTING. If the queue is already tearing down,
> >>> handle_icreq should not set the state to LIVE and should stop
> >>> processing. I'll defer to you on the exact error handling semantics
> >>> (return value and cleanup path).
> >>>
> >>> This bug is independent of the percpu_ref underflow in the previous
> >>> report. The percpu_ref patch fixes the SQ refcount underflow; this bug
> >>> affects the queue kref through an unsynchronized state transition.
> >>>
> >>> I am available for further discussion and testing.
> >>>
> >>> Thanks,
> >>> Shivam Kumar
> >>> PhD Student, Computer Science
> >>> SYNE Lab, Syracuse University
> >>> skumar47 at syr.edu
> >> I'll try to have a look at this over the weekend.
> >>
> >> -ck
> >>
> >>
> Let's avoid top posting. Sorry for the delay.
>
> Can the following patch fix the ref count underflow issue ?
>
> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
> index 4b8b02341ddc..69e971b179ae 100644
> --- a/drivers/nvme/target/tcp.c
> +++ b/drivers/nvme/target/tcp.c
> @@ -1310,7 +1310,8 @@ static int nvmet_tcp_try_recv_ddgst(struct nvmet_tcp_queue *queue)
>                         queue->idx, cmd->req.cmd->common.command_id,
>                         queue->pdu.cmd.hdr.type, le32_to_cpu(cmd->recv_ddgst),
>                         le32_to_cpu(cmd->exp_ddgst));
> -               nvmet_req_uninit(&cmd->req);
> +               if (!(cmd->flags & NVMET_TCP_F_INIT_FAILED))
> +                       nvmet_req_uninit(&cmd->req);
>                 nvmet_tcp_free_cmd_buffers(cmd);
>                 nvmet_tcp_fatal_error(queue);
>                 ret = -EPROTO;
> --
> 2.39.5
>
> and something like following can fix the race between ICReq handling
> and queue teardown ?
>
> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
> index 69e971b179ae..5c03a6505319 100644
> --- a/drivers/nvme/target/tcp.c
> +++ b/drivers/nvme/target/tcp.c
> @@ -408,6 +408,8 @@ static void nvmet_tcp_fatal_error(struct nvmet_tcp_queue *queue)
>   static void nvmet_tcp_socket_error(struct nvmet_tcp_queue *queue, int status)
>   {
>         queue->rcv_state = NVMET_TCP_RECV_ERR;
> +       if (status == -ESHUTDOWN)
> +               return;
>         if (status == -EPIPE || status == -ECONNRESET)
>                 kernel_sock_shutdown(queue->sock, SHUT_RDWR);
>         else
> @@ -922,11 +924,21 @@ static int nvmet_tcp_handle_icreq(struct nvmet_tcp_queue *queue)
>         iov.iov_len = sizeof(*icresp);
>         ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
>         if (ret < 0) {
> -               queue->state = NVMET_TCP_Q_FAILED;
> +               spin_lock_bh(&queue->state_lock);
> +               if (queue->state != NVMET_TCP_Q_DISCONNECTING)
> +                       queue->state = NVMET_TCP_Q_FAILED;
> +               spin_unlock_bh(&queue->state_lock);
>                 return ret; /* queue removal will cleanup */
>         }
>
> +       spin_lock_bh(&queue->state_lock);
> +       if (queue->state == NVMET_TCP_Q_DISCONNECTING) {
> +               spin_unlock_bh(&queue->state_lock);
> +               /* Tell nvmet_tcp_socket_error() teardown is already in progress. */
> +               return -ESHUTDOWN;
> +       }
>         queue->state = NVMET_TCP_Q_LIVE;
> +       spin_unlock_bh(&queue->state_lock);
>         nvmet_prepare_receive_pdu(queue);
>         return 0;
>   }
> --
> 2.39.5
>
>
> -ck
>
>

Hi Chaitanya

I tested both the patches; these patches fix both the crash paths.

Patch 1 is the same fix I sent earlier, which has Christoph's Reviewed-by.

Tested-by: Shivam Kumar<kumar.shivam43666 at gmail.com>

Thanks,
Shivam Kumar



More information about the Linux-nvme mailing list