[PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets

Wed Apr 16 15:09:12 PDT 2025

On 16/04/2025 10:56, Hannes Reinecke wrote:
> On 4/15/25 23:35, Sagi Grimberg wrote:
>>
>>
>> On 15/04/2025 10:07, Hannes Reinecke wrote:
>>> On 4/3/25 08:55, Hannes Reinecke wrote:
>>>> When the socket is busy processing nvme_tcp_try_recv() might
>>>> return -EAGAIN, but this doesn't automatically imply that
>>>> the sending side is blocked, too.
>>>> So check if there are pending requests once nvme_tcp_try_recv()
>>>> returns -EAGAIN and continue with the sending loop to avoid
>>>> I/O stalls.
>>>>
>>>> Acked-by: Chris Leech <cleech at redhat.com>
>>>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>>>> ---
>>>>   drivers/nvme/host/tcp.c | 5 ++++-
>>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>> index 1a319cb86453..87f1d7a4ea06 100644
>>>> --- a/drivers/nvme/host/tcp.c
>>>> +++ b/drivers/nvme/host/tcp.c
>>>> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct 
>>>> work_struct *w)
>>>>           result = nvme_tcp_try_recv(queue);
>>>>           if (result > 0)
>>>>               pending = true;
>>>> -        else if (unlikely(result < 0))
>>>> +        else if (unlikely(result < 0) && result != -EAGAIN)
>>>>               return;
>>>>   +        if (nvme_tcp_queue_has_pending(queue))
>>>> +            pending = true;
>>>> +
>>>>           if (!pending || !queue->rd_enabled)
>>>>               return;
>>>
>>> The various 'try_send' function will return -EAGAIN for a partial send.
>>> But it doesn't indicate a blocked Tx, rather we should retry directly.
>>> Hence this check.
>>>
>>> Unless you tell me differently and even a partial send will cause
>>> ->write_space() to be invoked, then we wouldn't _need_ it.
>>
>> Umm, that is my understanding. If you tried to send X and were able to
>> send Y where Y < X, you shouldn't have to keep trying in a busy loop, 
>> the stack should tell you when you can send again.
>>
> Okay, I could try that.
>
>>> It would
>>> still be an optimisation as we're saving the round-trip via socket
>>> callbacks.
>>
>> But you are doing a busy loop on a socket that cannot accept new 
>> data, there are other sockets that the kthread can be working on.
>>
> But we might be _sending_ data, right?

I'd say that odds are you're not in the next attempt...

>
>>>
>>> We could aim for a different error here, to differentiate between a
>>> 'real' EAGAIN and a partial send.
>>> Whatever you prefer.
>>
>> I still don't understand why a partial send warrants a busy loop call 
>> to sock_sendmsg...
>>
> This is, it's not just sendmsg. It's the combination of send and recv.

I agree with you that the recv fix is correct.

> In my tests I have seen sendmsg return with a partial/incomplete status,
> consequently recvmsg has nothing to receive, and io_work stops.
> And that is what the patch fixes.

The question is if the recv fix is sufficient?

>
> I haven't checked the ->write_space() callback (which should've been 
> triggered), but my feeling is that the ->write_space() callback
> hit when we were still busy processing, so the queue_work() went
> nowhere.

That makes sense.

>
> Maybe we can fix it with setting a flag when ->write_space()
> triggers ('hey, more data to process'), to be evaluated during
> the io_work() loop. But that would be pretty close to the
> check 'nvme_tcp_queue_has_pending', so I'm not sure if we
> gain anything.

How about checking sk_stream_is_writeable() instead?
I think we also need to flag the queue for the write_space during IO work...

>
> And I really haven't seen any detrimental performance effects
> with this patch; in the contrary, performance was on par,
> and if anything standard deviation went down.

Still you agree that busy looping a socket that may not have space is 
less efficient?
When there are other queues waiting to execute io_work?

I agree there is a problem here, but I am trying to figure out what we 
want to do here.

How about these two (untested) patches:
[1 based on your recv-side fix]:
--diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 72d260201d8c..4eb9a2dec07e 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1348,7 +1348,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue 
*queue)
         queue->nr_cqe = 0;
         consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
         release_sock(sk);
-       return consumed;
+       return consumed == -EAGAIN ? 0 : consumed;
  }

  static void nvme_tcp_io_work(struct work_struct *w)
--

[2 based on your partial write fix]:

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 4eb9a2dec07e..daf59e75cf15 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -129,6 +129,7 @@ enum nvme_tcp_queue_flags {
         NVME_TCP_Q_LIVE         = 1,
         NVME_TCP_Q_POLLING      = 2,
         NVME_TCP_Q_IO_CPU_SET   = 3,
+       NVME_TCP_Q_WAKE_SENDER  = 4,
  };

  enum nvme_tcp_recv_state {
@@ -1063,6 +1064,7 @@ static void nvme_tcp_write_space(struct sock *sk)
         queue = sk->sk_user_data;
         if (likely(queue && sk_stream_is_writeable(sk))) {
                 clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+               set_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
                 queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
         }
         read_unlock_bh(&sk->sk_callback_lock);
@@ -1357,6 +1359,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
                 container_of(w, struct nvme_tcp_queue, io_work);
         unsigned long deadline = jiffies + msecs_to_jiffies(1);

+       clear_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
         do {
                 bool pending = false;
                 int result;
@@ -1376,7 +1379,15 @@ static void nvme_tcp_io_work(struct work_struct *w)
                 else if (unlikely(result < 0))
                         return;

-               if (!pending || !queue->rd_enabled)
+               /* did we get some space after spending time in recv ? */
+               if (nvme_tcp_queue_has_pending(queue) &&
+                   sk_stream_is_writeable(queue->sock->sk))
+                       pending = true;
+
+               if (!queue->rd_enabled)
+                       return;
+
+               if (!pending && !test_bit(NVME_TCP_Q_WAKE_SENDER, 
&queue->flags))
                         return;

         } while (!time_after(jiffies, deadline)); /* quota is exhausted */
--