[PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets

Sat May 17 03:12:16 PDT 2025

On 17/05/2025 13:01, Sagi Grimberg wrote:
>
>
> On 14/05/2025 9:35, Hannes Reinecke wrote:
>> On 5/13/25 21:24, Kamaljit Singh wrote:
>>> Hi Sagi, Hannes,
>>>
>>> On 09/11/2025 02:11, Sagi Grimberg wrote:
>>>>> IO timeouts are still occurring with Writes. The only Read that timed
>>>>> out was most likely due to the path error. It takes ~4.5 hours to 
>>>>> fail.
>>>>>
>>>>> However, this test does not fail if either ECN is off or if digests
>>>>> are not enabled. These passing combinations were run for 16+ hours
>>>>> without any issues. Both ECN and Header+Data Digests need to be 
>>>>> turned
>>>>> on for it to fail.
>>>>>
>>>>> Do you have a failing test as well? If so, is it quicker to cause the
>>>>> failure? Would you mind sharing any details?
>>>>>
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 2 (f002) type 4 
>>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 1 (2001) type 4 
>>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 4 (c004) type 4 
>>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: starting error recovery
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 15 (000f) type 
>>>>> 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 6 (5006) type 4 
>>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 3 (2003) type 4 
>>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] block nvme1n3: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 8 (0008) type 4 
>>>>> opcode 0x2 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 14 (400e) type 
>>>>> 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 13 (100d) type 
>>>>> 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] block nvme1n4: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] block nvme1n4: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] block nvme1n4: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] block nvme1n2: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] block nvme1n4: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] block nvme1n2: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] block nvme1n2: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 5 (5005) type 4 
>>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 7 (0007) type 4 
>>>>> opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 11 (a00b) type 
>>>>> 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: I/O tag 12 (f00c) type 
>>>>> 4 opcode 0x1 (I/O Cmd) QID 4 timeout
>>>>>      [2025-05-07 19:57:13.295] block nvme1n1: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] block nvme1n1: no usable path - 
>>>>> requeuing I/O
>>>>>      [2025-05-07 19:57:13.295] nvme nvme1: Reconnecting in 10 
>>>>> seconds...
>>>>>
>>>>> In the current build I had these patches on top of the "nvme-6.16" 
>>>>> branch:
>>>>>      41b2c90a51bd nvme-tcp: sanitize request list handling
>>>>>      9260acd6c230 nvme-tcp: fix I/O stalls on congested sockets
>>>>
>>>> Kamaljit, with the prior version of the patchset (the proposal with 
>>>> the
>>>> wake_sender flag) did this not reproduce regardless of ECN?
>>>   With the last patchset, when ECN=off, we did not see any IO timeouts
>>> even with a weekend long test. This was true for both cases, i.e. with
>>> Inband Auth and with SecureConcat.
>>>
>>> With ECN=on & HD+DD=on IO timeout still persists for both Inband Auth &
>>> SC. I’m currently debugging a possible target side issue with ECN. I’ll
>>> let you know once I have some resolution.
>>>
>>> I don't have any clear indications of the original kernel issue to 
>>> be able to
>>> differentiate against the current target-side issue. So, if you want 
>>> to go
>>> ahead and merge those two patchsets that may be fine for now.
>>>
>> Thanks a lot for your confirmation.
>> We continue to have issues with high load or oversubscribed fabrics.
>> But this patchset is addressing the problem of I/O timeouts during
>> _connect_, which I would argue is a different story.
>
> We still need to hunt these down. I'm still puzzled why adding the 
> WAKE_SENDER
> flag was able to make this issue disappear? I'll have another look at 
> this patch.
>
> For now, I think we can go with this patchset, and then incrementally 
> fix the remains.

Kamaljit, can you check the following patch on top of the patchset from 
Hannes that
gets a reproduction?

--

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 835e29014841..2f5f2fcfb078 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1075,7 +1075,7 @@ static void nvme_tcp_write_space(struct sock *sk)

         read_lock_bh(&sk->sk_callback_lock);
         queue = sk->sk_user_data;
-       if (likely(queue && sk_stream_is_writeable(sk))) {
+      if (likely(queue)) {
                 clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
                 queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
         }
--

This think this may be preventing the scheduling of io_work. But now 
that io_work is
also ceasing based on sk_stream_is_writeable, we should probably still 
schedule it.