[PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
Sagi Grimberg
sagi at grimberg.me
Sat Apr 19 09:10:33 PDT 2025
On 4/18/25 21:55, Kamaljit Singh wrote:
> Sagi,
>
> From: Sagi Grimberg <sagi at grimberg.me>
> Date: Friday, April 18, 2025 at 03:51
>>> Sagi,
>>> I tried both of these patches but looks like #1 causes an infinite loop. dmesg was full of panics.
>>> I had tried just #1 and later with #1+#2. Both failed the same way.
>
>> That's no good :)
>
>> Can you share the panics? This version should be identical to what
>> Hannes introduced in:
> Sorry, I'm posting ~200 lines at the bottom, starting from the beginning of the panic until the next "---cut here---". I can send the whole dmesg as attachment as well. Please let me know.
No need.
However the log prints are WARN messages that does not exist in any of
the patches.
>
> The patches I applied are these:
> 1: nvme-tcp: partial write tcp-q-wake-sender untested by Sagi
> 2: nvme-tcp: recv-side EAGAIN untested fix by Sagi
> 3: nvme-tcp: fix I/O stalls on congested sockets
The patch I suggested was a replacement for patch (3). Didn't you
get conflicts?
> 4: nvme-tcp: sanitize request list handling
> 5: nvme-tcp: avoid inline sending when handling R2T PDUs
> 6: tls: use independent record sz for data sends <--- this is my internal/temp fix for a TLS send issue we found
> You can probably ignore #6.
Ignoring.
> [4] nvme-tcp: sanitize request list handling
>
> Validate the request in nvme_tcp_handle_r2t() to ensure it's not
> part of any list, otherwise a malicious R2T PDU might inject a
> loop in request list processing.
>
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 40dce5dc3ff5..4647d5d547f8 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -455,6 +455,7 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue)
> }
>
> list_del(&req->entry);
> + init_llist_node(&req->lentry);
> return req;
> }
>
> @@ -562,6 +563,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
> req->queue = queue;
> nvme_req(rq)->ctrl = &ctrl->ctrl;
> nvme_req(rq)->cmd = &pdu->cmd;
> + init_llist_node(&req->lentry);
> + INIT_LIST_HEAD(&req->entry);
>
> return 0;
> }
> @@ -773,8 +776,12 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
>
> nvme_tcp_setup_h2c_data_pdu(req);
>
> + WARN_ON(queue->request == req);
> + WARN_ON(llist_on_list(&req->lentry));
> + WARN_ON(!list_empty(&req->entry));
These are not included in Hannes patch afaict.
I'm assuming that these are the stack traces that are firing...
Can you please share the output of gdb:
l *nvme_tcp_recv_skb+0x115e
My assumption is that in your case, queue->request is still referenced
with the current request when the r2t for it arrives, which is very much
possible...
So this means that the patch "nvme-tcp: sanitize request list handling"
is wrong, queue->request may still be referenced when its r2t arrives.
I think this patch needs the following:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index f2c829f1aff6..c2d4893ecb94 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -767,8 +767,7 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue
*queue,
return -EPROTO;
}
- if (queue->request == req ||
- llist_on_list(&req->lentry) ||
+ if (llist_on_list(&req->lentry) ||
!list_empty(&req->entry)) {
dev_err(queue->ctrl->ctrl.device,
"req %d unexpected r2t while processing request\n",
--
Or in your case - remove the above:
--
WARN_ON(queue->request == req);
--
> ================================
> Snapshot of panic messages:
>
> ...
> [2025-04-16 22:55:32.992] [Wed Apr 16 22:55:34 2025] nvme nvme0: QID 0x0: traddr=10.10.10.223,trsvcid=4420,sport=0x8ac6,dport=0x1144
> [2025-04-16 22:55:33.445] [Wed Apr 16 22:55:34 2025] nvme nvme0: D3 entry latency set to 16 seconds
> [2025-04-16 22:55:33.898] [Wed Apr 16 22:55:35 2025] nvme nvme0: queue_size 128 > ctrl sqsize 17, clamping down
> [2025-04-16 22:55:33.929] [Wed Apr 16 22:55:35 2025] nvme nvme0: creating 3 I/O queues.
> [2025-04-16 22:55:35.398] [Wed Apr 16 22:55:36 2025] nvme nvme0: mapped 3/0/0 default/read/poll queues.
> [2025-04-16 22:55:35.445] [Wed Apr 16 22:55:36 2025] nvme nvme0: QID 0x1: traddr=10.10.10.223,trsvcid=4420,sport=0x8ac8,dport=0x1144
> [2025-04-16 22:55:35.508] [Wed Apr 16 22:55:36 2025] nvme nvme0: QID 0x2: traddr=10.10.10.223,trsvcid=4420,sport=0x8ad4,dport=0x1144
> [2025-04-16 22:55:35.554] [Wed Apr 16 22:55:36 2025] nvme nvme0: QID 0x3: traddr=10.10.10.223,trsvcid=4420,sport=0x8ad8,dport=0x1144
> [2025-04-16 22:55:35.632] [Wed Apr 16 22:55:37 2025] nvme nvme0: new ctrl: NQN "nqn.2015-09.com.wdc:nvme.1", addr 10.10.10.223:4420, hostnqn: nqn.2014-08.org.nvmexpress:host.0
> [2025-04-16 22:55:36.195] [Wed Apr 16 22:55:37 2025] nvme nvme0: Failed to get ANA log: 16386
> [2025-04-16 22:55:43.023]
> [2025-04-16 22:55:43.492]
> [2025-04-16 22:56:24.758] [Wed Apr 16 22:56:26 2025] ------------[ cut here ]------------
> [2025-04-16 22:56:24.758] [Wed Apr 16 22:56:26 2025] WARNING: CPU: 22 PID: 134878 at tcp.c:782 nvme_tcp_recv_skb+0x115e/0x11c0 [nvme_tcp]
My assumption is that this is a result of
WARN_ON(queue->request == req);
More information about the Linux-nvme
mailing list