[PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets

Hannes Reinecke hare at suse.de
Tue May 27 23:33:00 PDT 2025


On 5/28/25 03:43, Kamaljit Singh wrote:
> Hi Sagi,
> 
>> On 27/05/2025 2:49, Sagi Grimberg wrote:
>>>>
>>>> We still need to hunt these down. I'm still puzzled why adding the
>>>> WAKE_SENDER
>>>> flag was able to make this issue disappear? I'll have another look at
>>>> this patch.
>>>>
>>>> For now, I think we can go with this patchset, and then incrementally
>>>> fix the remains.
>>>
>>> Kamaljit, can you check the following patch on top of the patchset
>>> from Hannes that
>>> gets a reproduction?
>>>
>>> --
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 835e29014841..2f5f2fcfb078 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1075,7 +1075,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>>>
>>>          read_lock_bh(&sk->sk_callback_lock);
>>>          queue = sk->sk_user_data;
>>> -       if (likely(queue && sk_stream_is_writeable(sk))) {
>>> +      if (likely(queue)) {
>>>                  clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>>>                  queue_work_on(queue->io_cpu, nvme_tcp_wq,
>>> &queue->io_work);
>>>          }
>>> --
>>>
>>> This think this may be preventing the scheduling of io_work. But now
>>> that io_work is
>>> also ceasing based on sk_stream_is_writeable, we should probably still
>>> schedule it.
>>
>> Kamaljit? Any info on this?
> 
> Sorry for the delayed reply. We've been hitting one issue after another,
> hence the delays. Hoping that by tomorrow I should have clean results with
> a clear a-b comparison of only the changes you suggested.
> 
Just to make you aware, I've found another issue wrt 
sk_stream_is_writeable(); we are checking it from the data_ready
callback, and do not schedule io_work if no write space is available.
Problem here is that the data_ready callback indicates that there
_should_ be space available, so by checking sk_stream_is_writeable()
we already assume that the actual write space might change, and
the callback might not be a reliable indicator.
But from that follows that also the opposite might be true, namely
that write space _might_ be available by the time io_work is scheduled.
I'll repost the series.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



More information about the Linux-nvme mailing list