[PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets

Sagi Grimberg sagi at grimberg.me
Mon Apr 14 13:24:54 PDT 2025



On 14/04/2025 9:21, Hannes Reinecke wrote:
> On 4/14/25 01:09, Sagi Grimberg wrote:
>>
>>
>> On 03/04/2025 9:55, Hannes Reinecke wrote:
>>> When the socket is busy processing nvme_tcp_try_recv() might
>>> return -EAGAIN, but this doesn't automatically imply that
>>> the sending side is blocked, too.
>>> So check if there are pending requests once nvme_tcp_try_recv()
>>> returns -EAGAIN and continue with the sending loop to avoid
>>> I/O stalls.
>>>
>>> Acked-by: Chris Leech <cleech at redhat.com>
>>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>>> ---
>>>   drivers/nvme/host/tcp.c | 5 ++++-
>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 1a319cb86453..87f1d7a4ea06 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct 
>>> work_struct *w)
>>>           result = nvme_tcp_try_recv(queue);
>>>           if (result > 0)
>>>               pending = true;
>>> -        else if (unlikely(result < 0))
>>> +        else if (unlikely(result < 0) && result != -EAGAIN)
>>>               return;
>>
>> The way that the send path was done - is that EAGAIN returns 0 
>> (success returns >0, failure returns <0)
>> Perhaps we can make recv do the same?
>>
> I guess we can.
>
>>> +        if (nvme_tcp_queue_has_pending(queue))
>>> +            pending = true;
>>> +
>>
>> Something is not clear to me, this suggest that try_send was not able 
>> to send data on the socket, shouldn't a .write_space() callback wake 
>> you when
> > the socket send buffer gets some space? Why> do you immediately try 
> more even if you're sendmsg is returning EAGAIN?
>>
> But that's precisely the point: sendmsg() did _not_ return -EAGAIN.
> recvmsg() did.

What did try_send return? Why didn't it set pending?

>
>
> We just imply that nvme_tcp_try_send() will return -EAGAIN once 
> nvme_tcp_try_recv() did.
>
> Which is wrong; -EAGAIN on reception can be due to a number of factors,
> and does _not_ imply in any shape or form that sending will exhibit
> the same error.

I get that recv returning EAGAIN, we should not return from io_work(). 
The second check
is not clear to me. If try_recv got stuff, we set pending=1, if try_send 
was able to send, we set
pending=1. They should be independent. Why is this check in the end of 
the loop correct.

It can be that the queue has pending requests, but there is no room in 
the socket send buffer,
otherwise, it would have set pending=1 no?

I am missing something.



More information about the Linux-nvme mailing list