[PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets

Wed Apr 16 00:56:04 PDT 2025

On 4/15/25 23:35, Sagi Grimberg wrote:
> 
> 
> On 15/04/2025 10:07, Hannes Reinecke wrote:
>> On 4/3/25 08:55, Hannes Reinecke wrote:
>>> When the socket is busy processing nvme_tcp_try_recv() might
>>> return -EAGAIN, but this doesn't automatically imply that
>>> the sending side is blocked, too.
>>> So check if there are pending requests once nvme_tcp_try_recv()
>>> returns -EAGAIN and continue with the sending loop to avoid
>>> I/O stalls.
>>>
>>> Acked-by: Chris Leech <cleech at redhat.com>
>>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>>> ---
>>>   drivers/nvme/host/tcp.c | 5 ++++-
>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 1a319cb86453..87f1d7a4ea06 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct 
>>> work_struct *w)
>>>           result = nvme_tcp_try_recv(queue);
>>>           if (result > 0)
>>>               pending = true;
>>> -        else if (unlikely(result < 0))
>>> +        else if (unlikely(result < 0) && result != -EAGAIN)
>>>               return;
>>>   +        if (nvme_tcp_queue_has_pending(queue))
>>> +            pending = true;
>>> +
>>>           if (!pending || !queue->rd_enabled)
>>>               return;
>>
>> The various 'try_send' function will return -EAGAIN for a partial send.
>> But it doesn't indicate a blocked Tx, rather we should retry directly.
>> Hence this check.
>>
>> Unless you tell me differently and even a partial send will cause
>> ->write_space() to be invoked, then we wouldn't _need_ it.
> 
> Umm, that is my understanding. If you tried to send X and were able to
> send Y where Y < X, you shouldn't have to keep trying in a busy loop, 
> the stack should tell you when you can send again.
> 
Okay, I could try that.

>> It would
>> still be an optimisation as we're saving the round-trip via socket
>> callbacks.
> 
> But you are doing a busy loop on a socket that cannot accept new data, 
> there are other sockets that the kthread can be working on.
> 
But we might be _sending_ data, right?

>>
>> We could aim for a different error here, to differentiate between a
>> 'real' EAGAIN and a partial send.
>> Whatever you prefer.
> 
> I still don't understand why a partial send warrants a busy loop call to 
> sock_sendmsg...
> 
This is, it's not just sendmsg. It's the combination of send and recv.
In my tests I have seen sendmsg return with a partial/incomplete status,
consequently recvmsg has nothing to receive, and io_work stops.
And that is what the patch fixes.

I haven't checked the ->write_space() callback (which should've been 
triggered), but my feeling is that the ->write_space() callback
hit when we were still busy processing, so the queue_work() went
nowhere.

Maybe we can fix it with setting a flag when ->write_space()
triggers ('hey, more data to process'), to be evaluated during
the io_work() loop. But that would be pretty close to the
check 'nvme_tcp_queue_has_pending', so I'm not sure if we
gain anything.

And I really haven't seen any detrimental performance effects
with this patch; in the contrary, performance was on par,
and if anything standard deviation went down.

> My assumption is that the call right after the partial send, will see 
> EAGAIN error. But I may
> be missing something here... I just never expected that a partial write 
> means that we must busy loop sending to the socket.
> 
Well ... we do it in nvme_tcp_try_send_data().

> What does a blocking sendmsg do under the hood? does it also follow this 
> practice?

How would I know. But I haven't seen any negative effects
during performance testing with this patch.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich