[PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets

Kamaljit Singh Kamaljit.Singh1 at wdc.com
Wed May 28 15:45:43 PDT 2025


Hi Hannes & Sagi,

On 5/27/25 23:33, Hannes Reinecke wrote:
>>> On 27/05/2025 2:49, Sagi Grimberg wrote:
>>>>>
>>>>> We still need to hunt these down. I'm still puzzled why adding the
>>>>> WAKE_SENDER
>>>>> flag was able to make this issue disappear? I'll have another look at
>>>>> this patch.
>>>>>
>>>>> For now, I think we can go with this patchset, and then incrementally
>>>>> fix the remains.
>>>>
>>>> Kamaljit, can you check the following patch on top of the patchset
>>>> from Hannes that
>>>> gets a reproduction?
>>>>
>>>> --
>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>> index 835e29014841..2f5f2fcfb078 100644
>>>> --- a/drivers/nvme/host/tcp.c
>>>> +++ b/drivers/nvme/host/tcp.c
>>>> @@ -1075,7 +1075,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>>>>
>>>>          read_lock_bh(&sk->sk_callback_lock);
>>>>          queue = sk->sk_user_data;
>>>> -       if (likely(queue && sk_stream_is_writeable(sk))) {
>>>> +      if (likely(queue)) {
>>>>                  clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>>>>                  queue_work_on(queue->io_cpu, nvme_tcp_wq,
>>>> &queue->io_work);
>>>>          }
>>>> --
>>>>
>>>> This think this may be preventing the scheduling of io_work. But now
>>>> that io_work is
>>>> also ceasing based on sk_stream_is_writeable, we should probably still
>>>> schedule it.
>>>
>>> Kamaljit? Any info on this?
>>
>> Sorry for the delayed reply. We've been hitting one issue after another,
>> hence the delays. Hoping that by tomorrow I should have clean results with
>> a clear a-b comparison of only the changes you suggested.
Sagi,
After several more runs, I still don't have any definitive results for the
driver change you suggested. The tests did fail with IO timeouts but
the rootcauses were clearly unrelated to the Kernel/driver.


>Just to make you aware, I've found another issue wrt
>sk_stream_is_writeable(); we are checking it from the data_ready
>callback, and do not schedule io_work if no write space is available.
>Problem here is that the data_ready callback indicates that there
>_should_ be space available, so by checking sk_stream_is_writeable()
>we already assume that the actual write space might change, and
>the callback might not be a reliable indicator.
>But from that follows that also the opposite might be true, namely
>that write space _might_ be available by the time io_work is scheduled.
>I'll repost the series.

Hannes, Thank you for the V4 patchset. Looks like V4 includes the last change
that Sagi suggested, i.e. remove sk_stream_is_writeable(sk) from
nvme_tcp_write_space. We will continue testing w/ your V4 patchset and
provided any feedback.

Thanks,
Kamaljit



More information about the Linux-nvme mailing list