[PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets

Sagi Grimberg sagi at grimberg.me
Fri Apr 25 15:09:09 PDT 2025



On 26/04/2025 0:55, Sagi Grimberg wrote:
>
>
> On 24/04/2025 14:26, Hannes Reinecke wrote:
>> On 4/17/25 00:09, Sagi Grimberg wrote:
>>>
>>>
>>> On 16/04/2025 10:56, Hannes Reinecke wrote:
>>>> On 4/15/25 23:35, Sagi Grimberg wrote:
>>>>>
>>>>>
>>>>> On 15/04/2025 10:07, Hannes Reinecke wrote:
>>>>>> On 4/3/25 08:55, Hannes Reinecke wrote:
>>>>>>> When the socket is busy processing nvme_tcp_try_recv() might
>>>>>>> return -EAGAIN, but this doesn't automatically imply that
>>>>>>> the sending side is blocked, too.
>>>>>>> So check if there are pending requests once nvme_tcp_try_recv()
>>>>>>> returns -EAGAIN and continue with the sending loop to avoid
>>>>>>> I/O stalls.
>>>>>>>
>>>>>>> Acked-by: Chris Leech <cleech at redhat.com>
>>>>>>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>>>>>>> ---
>>>>>>>   drivers/nvme/host/tcp.c | 5 ++++-
>>>>>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>>>>> index 1a319cb86453..87f1d7a4ea06 100644
>>>>>>> --- a/drivers/nvme/host/tcp.c
>>>>>>> +++ b/drivers/nvme/host/tcp.c
>>>>>>> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct 
>>>>>>> work_struct *w)
>>>>>>>           result = nvme_tcp_try_recv(queue);
>>>>>>>           if (result > 0)
>>>>>>>               pending = true;
>>>>>>> -        else if (unlikely(result < 0))
>>>>>>> +        else if (unlikely(result < 0) && result != -EAGAIN)
>>>>>>>               return;
>>>>>>>   +        if (nvme_tcp_queue_has_pending(queue))
>>>>>>> +            pending = true;
>>>>>>> +
>>>>>>>           if (!pending || !queue->rd_enabled)
>>>>>>>               return;
>>>>>>
>>>>>> The various 'try_send' function will return -EAGAIN for a partial 
>>>>>> send.
>>>>>> But it doesn't indicate a blocked Tx, rather we should retry 
>>>>>> directly.
>>>>>> Hence this check.
>>>>>>
>>>>>> Unless you tell me differently and even a partial send will cause
>>>>>> ->write_space() to be invoked, then we wouldn't _need_ it.
>>>>>
>>>>> Umm, that is my understanding. If you tried to send X and were 
>>>>> able to
>>>>> send Y where Y < X, you shouldn't have to keep trying in a busy 
>>>>> loop, the stack should tell you when you can send again.
>>>>>
>>>> Okay, I could try that.
>>>>
>>>>>> It would
>>>>>> still be an optimisation as we're saving the round-trip via socket
>>>>>> callbacks.
>>>>>
>>>>> But you are doing a busy loop on a socket that cannot accept new 
>>>>> data, there are other sockets that the kthread can be working on.
>>>>>
>>>> But we might be _sending_ data, right?
>>>
>>> I'd say that odds are you're not in the next attempt...
>>>
>>>>
>>>>>>
>>>>>> We could aim for a different error here, to differentiate between a
>>>>>> 'real' EAGAIN and a partial send.
>>>>>> Whatever you prefer.
>>>>>
>>>>> I still don't understand why a partial send warrants a busy loop 
>>>>> call to sock_sendmsg...
>>>>>
>>>> This is, it's not just sendmsg. It's the combination of send and recv.
>>>
>>> I agree with you that the recv fix is correct.
>>>
>>>> In my tests I have seen sendmsg return with a partial/incomplete 
>>>> status,
>>>> consequently recvmsg has nothing to receive, and io_work stops.
>>>> And that is what the patch fixes.
>>>
>>> The question is if the recv fix is sufficient?
>>>
>>>>
>>>> I haven't checked the ->write_space() callback (which should've 
>>>> been triggered), but my feeling is that the ->write_space() callback
>>>> hit when we were still busy processing, so the queue_work() went
>>>> nowhere.
>>>
>>> That makes sense.
>>>
>>>>
>>>> Maybe we can fix it with setting a flag when ->write_space()
>>>> triggers ('hey, more data to process'), to be evaluated during
>>>> the io_work() loop. But that would be pretty close to the
>>>> check 'nvme_tcp_queue_has_pending', so I'm not sure if we
>>>> gain anything.
>>>
>>> How about checking sk_stream_is_writeable() instead?
>>> I think we also need to flag the queue for the write_space during IO 
>>> work...
>>>
>>>>
>>>> And I really haven't seen any detrimental performance effects
>>>> with this patch; in the contrary, performance was on par,
>>>> and if anything standard deviation went down.
>>>
>>> Still you agree that busy looping a socket that may not have space 
>>> is less efficient?
>>> When there are other queues waiting to execute io_work?
>>>
>>> I agree there is a problem here, but I am trying to figure out what 
>>> we want to do here.
>>>
>>> How about these two (untested) patches:
>>> [1 based on your recv-side fix]:
>>> --diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 72d260201d8c..4eb9a2dec07e 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -1348,7 +1348,7 @@ static int nvme_tcp_try_recv(struct 
>>> nvme_tcp_queue *queue)
>>>          queue->nr_cqe = 0;
>>>          consumed = sock->ops->read_sock(sk, &rd_desc, 
>>> nvme_tcp_recv_skb);
>>>          release_sock(sk);
>>> -       return consumed;
>>> +       return consumed == -EAGAIN ? 0 : consumed;
>>>   }
>>>
>>>   static void nvme_tcp_io_work(struct work_struct *w)
>>> -- 
>>>
>>> [2 based on your partial write fix]:
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 4eb9a2dec07e..daf59e75cf15 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -129,6 +129,7 @@ enum nvme_tcp_queue_flags {
>>>          NVME_TCP_Q_LIVE         = 1,
>>>          NVME_TCP_Q_POLLING      = 2,
>>>          NVME_TCP_Q_IO_CPU_SET   = 3,
>>> +       NVME_TCP_Q_WAKE_SENDER  = 4,
>>>   };
>>>
>>>   enum nvme_tcp_recv_state {
>>> @@ -1063,6 +1064,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>>>          queue = sk->sk_user_data;
>>>          if (likely(queue && sk_stream_is_writeable(sk))) {
>>>                  clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>>> +               set_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>>>                  queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue- 
>>>  >io_work);
>>>          }
>>>          read_unlock_bh(&sk->sk_callback_lock);
>>> @@ -1357,6 +1359,7 @@ static void nvme_tcp_io_work(struct 
>>> work_struct *w)
>>>                  container_of(w, struct nvme_tcp_queue, io_work);
>>>          unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>>
>>> +       clear_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>>>          do {
>>>                  bool pending = false;
>>>                  int result;
>>> @@ -1376,7 +1379,15 @@ static void nvme_tcp_io_work(struct 
>>> work_struct *w)
>>>                  else if (unlikely(result < 0))
>>>                          return;
>>>
>>> -               if (!pending || !queue->rd_enabled)
>>> +               /* did we get some space after spending time in recv 
>>> ? */
>>> +               if (nvme_tcp_queue_has_pending(queue) &&
>>> + sk_stream_is_writeable(queue->sock->sk))
>>> +                       pending = true;
>>> +
>>> +               if (!queue->rd_enabled)
>>> +                       return;
>>> +
>>> +               if (!pending && !test_bit(NVME_TCP_Q_WAKE_SENDER, 
>>> &queue->flags))
>>>                          return;
>>>
>>>          } while (!time_after(jiffies, deadline)); /* quota is 
>>> exhausted */
>>> -- 
>>>
>> While trying to sift through the patches trying to come up with
>> a next version:
>> Why do you check for Q_WAKE_SENDER after checking for
>> sk_stream_is_writable?
>> Surely sk_stream_is_writable() is a necessary condition for 
>> nvme_tcp_try_send(), no?
>
> Right after the check sk_stream_is_writeable(), there could have new 
> room in the socket
> send buffer, state_change() fired, and queued io_work, before we get 
> to the pending
> check. This is the "lost event" problem.

Actually, this is incorrect. write_space() -> queue_work(io_work) will 
be re-queued if io_work
is currently running. Hence I'm not sure that this flag is needed.

Kamaljit, can you run with the below and report if it fixes your problem?
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 4d20fcc0a230..835e29014841 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1390,6 +1390,11 @@ static void nvme_tcp_io_work(struct work_struct *w)
                 else if (unlikely(result < 0))
                         return;

+               /* did we get some space after spending time in recv? */
+               if (nvme_tcp_queue_has_pending(queue) &&
+                   sk_stream_is_writeable(queue->sock->sk))
+                       pending = true;
+
                 if (!pending || !queue->rd_enabled)
                         return;
--





More information about the Linux-nvme mailing list