[Bug Report] nvmet-tcp: unbalanced percpu_ref_put on data digest error after nvmet_req_init failure causes refcount underflow, use-after-free, and permanent workqueue deadlock
Chaitanya Kulkarni
chaitanyak at nvidia.com
Mon Apr 6 15:16:10 PDT 2026
Sagi,
On 4/6/26 12:25 PM, Shivam Kumar wrote:
>> Can the following patch fix the ref count underflow issue ?
>>
>> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
>> index 4b8b02341ddc..69e971b179ae 100644
>> --- a/drivers/nvme/target/tcp.c
>> +++ b/drivers/nvme/target/tcp.c
>> @@ -1310,7 +1310,8 @@ static int nvmet_tcp_try_recv_ddgst(struct nvmet_tcp_queue *queue)
>> queue->idx, cmd->req.cmd->common.command_id,
>> queue->pdu.cmd.hdr.type, le32_to_cpu(cmd->recv_ddgst),
>> le32_to_cpu(cmd->exp_ddgst));
>> - nvmet_req_uninit(&cmd->req);
>> + if (!(cmd->flags & NVMET_TCP_F_INIT_FAILED))
>> + nvmet_req_uninit(&cmd->req);
>> nvmet_tcp_free_cmd_buffers(cmd);
>> nvmet_tcp_fatal_error(queue);
>> ret = -EPROTO;
>> --
>> 2.39.5
>>
>> and something like following can fix the race between ICReq handling
>> and queue teardown ?
>>
>> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
>> index 69e971b179ae..5c03a6505319 100644
>> --- a/drivers/nvme/target/tcp.c
>> +++ b/drivers/nvme/target/tcp.c
>> @@ -408,6 +408,8 @@ static void nvmet_tcp_fatal_error(struct nvmet_tcp_queue *queue)
>> static void nvmet_tcp_socket_error(struct nvmet_tcp_queue *queue, int status)
>> {
>> queue->rcv_state = NVMET_TCP_RECV_ERR;
>> + if (status == -ESHUTDOWN)
>> + return;
>> if (status == -EPIPE || status == -ECONNRESET)
>> kernel_sock_shutdown(queue->sock, SHUT_RDWR);
>> else
>> @@ -922,11 +924,21 @@ static int nvmet_tcp_handle_icreq(struct nvmet_tcp_queue *queue)
>> iov.iov_len = sizeof(*icresp);
>> ret = kernel_sendmsg(queue->sock, &msg, &iov, 1, iov.iov_len);
>> if (ret < 0) {
>> - queue->state = NVMET_TCP_Q_FAILED;
>> + spin_lock_bh(&queue->state_lock);
>> + if (queue->state != NVMET_TCP_Q_DISCONNECTING)
>> + queue->state = NVMET_TCP_Q_FAILED;
>> + spin_unlock_bh(&queue->state_lock);
>> return ret; /* queue removal will cleanup */
>> }
>>
>> + spin_lock_bh(&queue->state_lock);
>> + if (queue->state == NVMET_TCP_Q_DISCONNECTING) {
>> + spin_unlock_bh(&queue->state_lock);
>> + /* Tell nvmet_tcp_socket_error() teardown is already in progress. */
>> + return -ESHUTDOWN;
>> + }
>> queue->state = NVMET_TCP_Q_LIVE;
>> + spin_unlock_bh(&queue->state_lock);
>> nvmet_prepare_receive_pdu(queue);
>> return 0;
>> }
>> --
>> 2.39.5
>>
>>
>> -ck
>>
>>
> Hi Chaitanya
>
> I tested both the patches; these patches fix both the crash paths.
>
> Patch 1 is the same fix I sent earlier, which has Christoph's Reviewed-by.
>
> Tested-by: Shivam Kumar<kumar.shivam43666 at gmail.com>
>
> Thanks,
> Shivam Kumar
Can you please take a look before I send a series ?
-ck
More information about the Linux-nvme
mailing list