[Bug Report] nvmet-tcp: unbalanced percpu_ref_put on data digest error after nvmet_req_init failure causes refcount underflow, use-after-free, and permanent workqueue deadlock

Chaitanya Kulkarni chaitanyak at nvidia.com
Mon Apr 6 15:16:10 PDT 2026


Sagi,

On 4/6/26 12:25 PM, Shivam Kumar wrote:

>> Can the following patch fix the ref count underflow issue ?
>>
>> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
>> index 4b8b02341ddc..69e971b179ae 100644
>> --- a/drivers/nvme/target/tcp.c
>> +++ b/drivers/nvme/target/tcp.c
>> @@ -1310,7 +1310,8 @@ static int nvmet_tcp_try_recv_ddgst(struct nvmet_tcp_queue *queue)
>>                          queue->idx, cmd->req.cmd->common.command_id,
>>                          queue->pdu.cmd.hdr.type, le32_to_cpu(cmd->recv_ddgst),
>>                          le32_to_cpu(cmd->exp_ddgst));
>> -               nvmet_req_uninit(&cmd->req);
>> +               if (!(cmd->flags & NVMET_TCP_F_INIT_FAILED))
>> +                       nvmet_req_uninit(&cmd->req);
>>                  nvmet_tcp_free_cmd_buffers(cmd);
>>                  nvmet_tcp_fatal_error(queue);
>>                  ret = -EPROTO;
>> --
>> 2.39.5
>>
>> and something like following can fix the race between ICReq handling
>> and queue teardown ?
>>
>> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
>> index 69e971b179ae..5c03a6505319 100644
>> --- a/drivers/nvme/target/tcp.c
>> +++ b/drivers/nvme/target/tcp.c
>> @@ -408,6 +408,8 @@ static void nvmet_tcp_fatal_error(struct nvmet_tcp_queue *queue)
>>    static void nvmet_tcp_socket_error(struct nvmet_tcp_queue *queue, int status)
>>    {
>>          queue->rcv_state = NVMET_TCP_RECV_ERR;
>> +       if (status == -ESHUTDOWN)
>> +               return;
>>          if (status == -EPIPE || status == -ECONNRESET)
>>                  kernel_sock_shutdown(queue->sock, SHUT_RDWR);
>>          else
>> @@ -922,11 +924,21 @@ static int nvmet_tcp_handle_icreq(struct nvmet_tcp_queue *queue)
>>          iov.iov_len = sizeof(*icresp);
>>          ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
>>          if (ret < 0) {
>> -               queue->state = NVMET_TCP_Q_FAILED;
>> +               spin_lock_bh(&queue->state_lock);
>> +               if (queue->state != NVMET_TCP_Q_DISCONNECTING)
>> +                       queue->state = NVMET_TCP_Q_FAILED;
>> +               spin_unlock_bh(&queue->state_lock);
>>                  return ret; /* queue removal will cleanup */
>>          }
>>
>> +       spin_lock_bh(&queue->state_lock);
>> +       if (queue->state == NVMET_TCP_Q_DISCONNECTING) {
>> +               spin_unlock_bh(&queue->state_lock);
>> +               /* Tell nvmet_tcp_socket_error() teardown is already in progress. */
>> +               return -ESHUTDOWN;
>> +       }
>>          queue->state = NVMET_TCP_Q_LIVE;
>> +       spin_unlock_bh(&queue->state_lock);
>>          nvmet_prepare_receive_pdu(queue);
>>          return 0;
>>    }
>> --
>> 2.39.5
>>
>>
>> -ck
>>
>>
> Hi Chaitanya
>
> I tested both the patches; these patches fix both the crash paths.
>
> Patch 1 is the same fix I sent earlier, which has Christoph's Reviewed-by.
>
> Tested-by: Shivam Kumar<kumar.shivam43666 at gmail.com>
>
> Thanks,
> Shivam Kumar

Can you please take a look before I send a series ?

-ck




More information about the Linux-nvme mailing list