[PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets

Hannes Reinecke hare at suse.de
Sun Apr 13 23:21:59 PDT 2025


On 4/14/25 01:09, Sagi Grimberg wrote:
> 
> 
> On 03/04/2025 9:55, Hannes Reinecke wrote:
>> When the socket is busy processing nvme_tcp_try_recv() might
>> return -EAGAIN, but this doesn't automatically imply that
>> the sending side is blocked, too.
>> So check if there are pending requests once nvme_tcp_try_recv()
>> returns -EAGAIN and continue with the sending loop to avoid
>> I/O stalls.
>>
>> Acked-by: Chris Leech <cleech at redhat.com>
>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>> ---
>>   drivers/nvme/host/tcp.c | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 1a319cb86453..87f1d7a4ea06 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct work_struct 
>> *w)
>>           result = nvme_tcp_try_recv(queue);
>>           if (result > 0)
>>               pending = true;
>> -        else if (unlikely(result < 0))
>> +        else if (unlikely(result < 0) && result != -EAGAIN)
>>               return;
> 
> The way that the send path was done - is that EAGAIN returns 0 (success 
> returns >0, failure returns <0)
> Perhaps we can make recv do the same?
> 
I guess we can.

>> +        if (nvme_tcp_queue_has_pending(queue))
>> +            pending = true;
>> +
> 
> Something is not clear to me, this suggest that try_send was not able to 
> send data on the socket, shouldn't a .write_space() callback wake you when
 > the socket send buffer gets some space? Why> do you immediately try 
more even if you're sendmsg is returning EAGAIN?
> 
But that's precisely the point: sendmsg() did _not_ return -EAGAIN.
recvmsg() did.

We just imply that nvme_tcp_try_send() will return -EAGAIN once 
nvme_tcp_try_recv() did.

Which is wrong; -EAGAIN on reception can be due to a number of factors,
and does _not_ imply in any shape or form that sending will exhibit
the same error.

> This is specific to TLS I assume here?

Not necessarily; I've seen it with my performance testing on 10GigE
with normal TCP, too. TLS is just an easier way to reproduce it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



More information about the Linux-nvme mailing list