[PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets

Fri Apr 25 14:55:43 PDT 2025

On 24/04/2025 14:26, Hannes Reinecke wrote:
> On 4/17/25 00:09, Sagi Grimberg wrote:
>>
>>
>> On 16/04/2025 10:56, Hannes Reinecke wrote:
>>> On 4/15/25 23:35, Sagi Grimberg wrote:
>>>>
>>>>
>>>> On 15/04/2025 10:07, Hannes Reinecke wrote:
>>>>> On 4/3/25 08:55, Hannes Reinecke wrote:
>>>>>> When the socket is busy processing nvme_tcp_try_recv() might
>>>>>> return -EAGAIN, but this doesn't automatically imply that
>>>>>> the sending side is blocked, too.
>>>>>> So check if there are pending requests once nvme_tcp_try_recv()
>>>>>> returns -EAGAIN and continue with the sending loop to avoid
>>>>>> I/O stalls.
>>>>>>
>>>>>> Acked-by: Chris Leech <cleech at redhat.com>
>>>>>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>>>>>> ---
>>>>>>   drivers/nvme/host/tcp.c | 5 ++++-
>>>>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>>>> index 1a319cb86453..87f1d7a4ea06 100644
>>>>>> --- a/drivers/nvme/host/tcp.c
>>>>>> +++ b/drivers/nvme/host/tcp.c
>>>>>> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct 
>>>>>> work_struct *w)
>>>>>>           result = nvme_tcp_try_recv(queue);
>>>>>>           if (result > 0)
>>>>>>               pending = true;
>>>>>> -        else if (unlikely(result < 0))
>>>>>> +        else if (unlikely(result < 0) && result != -EAGAIN)
>>>>>>               return;
>>>>>>   +        if (nvme_tcp_queue_has_pending(queue))
>>>>>> +            pending = true;
>>>>>> +
>>>>>>           if (!pending || !queue->rd_enabled)
>>>>>>               return;
>>>>>
>>>>> The various 'try_send' function will return -EAGAIN for a partial 
>>>>> send.
>>>>> But it doesn't indicate a blocked Tx, rather we should retry 
>>>>> directly.
>>>>> Hence this check.
>>>>>
>>>>> Unless you tell me differently and even a partial send will cause
>>>>> ->write_space() to be invoked, then we wouldn't _need_ it.
>>>>
>>>> Umm, that is my understanding. If you tried to send X and were able to
>>>> send Y where Y < X, you shouldn't have to keep trying in a busy 
>>>> loop, the stack should tell you when you can send again.
>>>>
>>> Okay, I could try that.
>>>
>>>>> It would
>>>>> still be an optimisation as we're saving the round-trip via socket
>>>>> callbacks.
>>>>
>>>> But you are doing a busy loop on a socket that cannot accept new 
>>>> data, there are other sockets that the kthread can be working on.
>>>>
>>> But we might be _sending_ data, right?
>>
>> I'd say that odds are you're not in the next attempt...
>>
>>>
>>>>>
>>>>> We could aim for a different error here, to differentiate between a
>>>>> 'real' EAGAIN and a partial send.
>>>>> Whatever you prefer.
>>>>
>>>> I still don't understand why a partial send warrants a busy loop 
>>>> call to sock_sendmsg...
>>>>
>>> This is, it's not just sendmsg. It's the combination of send and recv.
>>
>> I agree with you that the recv fix is correct.
>>
>>> In my tests I have seen sendmsg return with a partial/incomplete 
>>> status,
>>> consequently recvmsg has nothing to receive, and io_work stops.
>>> And that is what the patch fixes.
>>
>> The question is if the recv fix is sufficient?
>>
>>>
>>> I haven't checked the ->write_space() callback (which should've been 
>>> triggered), but my feeling is that the ->write_space() callback
>>> hit when we were still busy processing, so the queue_work() went
>>> nowhere.
>>
>> That makes sense.
>>
>>>
>>> Maybe we can fix it with setting a flag when ->write_space()
>>> triggers ('hey, more data to process'), to be evaluated during
>>> the io_work() loop. But that would be pretty close to the
>>> check 'nvme_tcp_queue_has_pending', so I'm not sure if we
>>> gain anything.
>>
>> How about checking sk_stream_is_writeable() instead?
>> I think we also need to flag the queue for the write_space during IO 
>> work...
>>
>>>
>>> And I really haven't seen any detrimental performance effects
>>> with this patch; in the contrary, performance was on par,
>>> and if anything standard deviation went down.
>>
>> Still you agree that busy looping a socket that may not have space is 
>> less efficient?
>> When there are other queues waiting to execute io_work?
>>
>> I agree there is a problem here, but I am trying to figure out what 
>> we want to do here.
>>
>> How about these two (untested) patches:
>> [1 based on your recv-side fix]:
>> --diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 72d260201d8c..4eb9a2dec07e 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -1348,7 +1348,7 @@ static int nvme_tcp_try_recv(struct 
>> nvme_tcp_queue *queue)
>>          queue->nr_cqe = 0;
>>          consumed = sock->ops->read_sock(sk, &rd_desc, 
>> nvme_tcp_recv_skb);
>>          release_sock(sk);
>> -       return consumed;
>> +       return consumed == -EAGAIN ? 0 : consumed;
>>   }
>>
>>   static void nvme_tcp_io_work(struct work_struct *w)
>> -- 
>>
>> [2 based on your partial write fix]:
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 4eb9a2dec07e..daf59e75cf15 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -129,6 +129,7 @@ enum nvme_tcp_queue_flags {
>>          NVME_TCP_Q_LIVE         = 1,
>>          NVME_TCP_Q_POLLING      = 2,
>>          NVME_TCP_Q_IO_CPU_SET   = 3,
>> +       NVME_TCP_Q_WAKE_SENDER  = 4,
>>   };
>>
>>   enum nvme_tcp_recv_state {
>> @@ -1063,6 +1064,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>>          queue = sk->sk_user_data;
>>          if (likely(queue && sk_stream_is_writeable(sk))) {
>>                  clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>> +               set_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>>                  queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue- 
>>  >io_work);
>>          }
>>          read_unlock_bh(&sk->sk_callback_lock);
>> @@ -1357,6 +1359,7 @@ static void nvme_tcp_io_work(struct work_struct 
>> *w)
>>                  container_of(w, struct nvme_tcp_queue, io_work);
>>          unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>
>> +       clear_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>>          do {
>>                  bool pending = false;
>>                  int result;
>> @@ -1376,7 +1379,15 @@ static void nvme_tcp_io_work(struct 
>> work_struct *w)
>>                  else if (unlikely(result < 0))
>>                          return;
>>
>> -               if (!pending || !queue->rd_enabled)
>> +               /* did we get some space after spending time in recv 
>> ? */
>> +               if (nvme_tcp_queue_has_pending(queue) &&
>> + sk_stream_is_writeable(queue->sock->sk))
>> +                       pending = true;
>> +
>> +               if (!queue->rd_enabled)
>> +                       return;
>> +
>> +               if (!pending && !test_bit(NVME_TCP_Q_WAKE_SENDER, 
>> &queue->flags))
>>                          return;
>>
>>          } while (!time_after(jiffies, deadline)); /* quota is 
>> exhausted */
>> -- 
>>
> While trying to sift through the patches trying to come up with
> a next version:
> Why do you check for Q_WAKE_SENDER after checking for
> sk_stream_is_writable?
> Surely sk_stream_is_writable() is a necessary condition for 
> nvme_tcp_try_send(), no?

Right after the check sk_stream_is_writeable(), there could have new 
room in the socket
send buffer, state_change() fired, and queued io_work, before we get to 
the pending
check. This is the "lost event" problem.