[PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets
Hannes Reinecke
hare at suse.de
Wed Apr 16 00:56:04 PDT 2025
On 4/15/25 23:35, Sagi Grimberg wrote:
>
>
> On 15/04/2025 10:07, Hannes Reinecke wrote:
>> On 4/3/25 08:55, Hannes Reinecke wrote:
>>> When the socket is busy processing nvme_tcp_try_recv() might
>>> return -EAGAIN, but this doesn't automatically imply that
>>> the sending side is blocked, too.
>>> So check if there are pending requests once nvme_tcp_try_recv()
>>> returns -EAGAIN and continue with the sending loop to avoid
>>> I/O stalls.
>>>
>>> Acked-by: Chris Leech <cleech at redhat.com>
>>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>>> ---
>>> drivers/nvme/host/tcp.c | 5 ++++-
>>> 1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 1a319cb86453..87f1d7a4ea06 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct
>>> work_struct *w)
>>> result = nvme_tcp_try_recv(queue);
>>> if (result > 0)
>>> pending = true;
>>> - else if (unlikely(result < 0))
>>> + else if (unlikely(result < 0) && result != -EAGAIN)
>>> return;
>>> + if (nvme_tcp_queue_has_pending(queue))
>>> + pending = true;
>>> +
>>> if (!pending || !queue->rd_enabled)
>>> return;
>>
>> The various 'try_send' function will return -EAGAIN for a partial send.
>> But it doesn't indicate a blocked Tx, rather we should retry directly.
>> Hence this check.
>>
>> Unless you tell me differently and even a partial send will cause
>> ->write_space() to be invoked, then we wouldn't _need_ it.
>
> Umm, that is my understanding. If you tried to send X and were able to
> send Y where Y < X, you shouldn't have to keep trying in a busy loop,
> the stack should tell you when you can send again.
>
Okay, I could try that.
>> It would
>> still be an optimisation as we're saving the round-trip via socket
>> callbacks.
>
> But you are doing a busy loop on a socket that cannot accept new data,
> there are other sockets that the kthread can be working on.
>
But we might be _sending_ data, right?
>>
>> We could aim for a different error here, to differentiate between a
>> 'real' EAGAIN and a partial send.
>> Whatever you prefer.
>
> I still don't understand why a partial send warrants a busy loop call to
> sock_sendmsg...
>
This is, it's not just sendmsg. It's the combination of send and recv.
In my tests I have seen sendmsg return with a partial/incomplete status,
consequently recvmsg has nothing to receive, and io_work stops.
And that is what the patch fixes.
I haven't checked the ->write_space() callback (which should've been
triggered), but my feeling is that the ->write_space() callback
hit when we were still busy processing, so the queue_work() went
nowhere.
Maybe we can fix it with setting a flag when ->write_space()
triggers ('hey, more data to process'), to be evaluated during
the io_work() loop. But that would be pretty close to the
check 'nvme_tcp_queue_has_pending', so I'm not sure if we
gain anything.
And I really haven't seen any detrimental performance effects
with this patch; in the contrary, performance was on par,
and if anything standard deviation went down.
> My assumption is that the call right after the partial send, will see
> EAGAIN error. But I may
> be missing something here... I just never expected that a partial write
> means that we must busy loop sending to the socket.
>
Well ... we do it in nvme_tcp_try_send_data().
> What does a blocking sendmsg do under the hood? does it also follow this
> practice?
How would I know. But I haven't seen any negative effects
during performance testing with this patch.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
More information about the Linux-nvme
mailing list