[PATCH 3/3] nvme-tcp: fix I/O stalls on congested sockets

Thu Apr 24 04:26:26 PDT 2025

On 4/17/25 00:09, Sagi Grimberg wrote:
> 
> 
> On 16/04/2025 10:56, Hannes Reinecke wrote:
>> On 4/15/25 23:35, Sagi Grimberg wrote:
>>>
>>>
>>> On 15/04/2025 10:07, Hannes Reinecke wrote:
>>>> On 4/3/25 08:55, Hannes Reinecke wrote:
>>>>> When the socket is busy processing nvme_tcp_try_recv() might
>>>>> return -EAGAIN, but this doesn't automatically imply that
>>>>> the sending side is blocked, too.
>>>>> So check if there are pending requests once nvme_tcp_try_recv()
>>>>> returns -EAGAIN and continue with the sending loop to avoid
>>>>> I/O stalls.
>>>>>
>>>>> Acked-by: Chris Leech <cleech at redhat.com>
>>>>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>>>>> ---
>>>>>   drivers/nvme/host/tcp.c | 5 ++++-
>>>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>>>> index 1a319cb86453..87f1d7a4ea06 100644
>>>>> --- a/drivers/nvme/host/tcp.c
>>>>> +++ b/drivers/nvme/host/tcp.c
>>>>> @@ -1389,9 +1389,12 @@ static void nvme_tcp_io_work(struct 
>>>>> work_struct *w)
>>>>>           result = nvme_tcp_try_recv(queue);
>>>>>           if (result > 0)
>>>>>               pending = true;
>>>>> -        else if (unlikely(result < 0))
>>>>> +        else if (unlikely(result < 0) && result != -EAGAIN)
>>>>>               return;
>>>>>   +        if (nvme_tcp_queue_has_pending(queue))
>>>>> +            pending = true;
>>>>> +
>>>>>           if (!pending || !queue->rd_enabled)
>>>>>               return;
>>>>
>>>> The various 'try_send' function will return -EAGAIN for a partial send.
>>>> But it doesn't indicate a blocked Tx, rather we should retry directly.
>>>> Hence this check.
>>>>
>>>> Unless you tell me differently and even a partial send will cause
>>>> ->write_space() to be invoked, then we wouldn't _need_ it.
>>>
>>> Umm, that is my understanding. If you tried to send X and were able to
>>> send Y where Y < X, you shouldn't have to keep trying in a busy loop, 
>>> the stack should tell you when you can send again.
>>>
>> Okay, I could try that.
>>
>>>> It would
>>>> still be an optimisation as we're saving the round-trip via socket
>>>> callbacks.
>>>
>>> But you are doing a busy loop on a socket that cannot accept new 
>>> data, there are other sockets that the kthread can be working on.
>>>
>> But we might be _sending_ data, right?
> 
> I'd say that odds are you're not in the next attempt...
> 
>>
>>>>
>>>> We could aim for a different error here, to differentiate between a
>>>> 'real' EAGAIN and a partial send.
>>>> Whatever you prefer.
>>>
>>> I still don't understand why a partial send warrants a busy loop call 
>>> to sock_sendmsg...
>>>
>> This is, it's not just sendmsg. It's the combination of send and recv.
> 
> I agree with you that the recv fix is correct.
> 
>> In my tests I have seen sendmsg return with a partial/incomplete status,
>> consequently recvmsg has nothing to receive, and io_work stops.
>> And that is what the patch fixes.
> 
> The question is if the recv fix is sufficient?
> 
>>
>> I haven't checked the ->write_space() callback (which should've been 
>> triggered), but my feeling is that the ->write_space() callback
>> hit when we were still busy processing, so the queue_work() went
>> nowhere.
> 
> That makes sense.
> 
>>
>> Maybe we can fix it with setting a flag when ->write_space()
>> triggers ('hey, more data to process'), to be evaluated during
>> the io_work() loop. But that would be pretty close to the
>> check 'nvme_tcp_queue_has_pending', so I'm not sure if we
>> gain anything.
> 
> How about checking sk_stream_is_writeable() instead?
> I think we also need to flag the queue for the write_space during IO 
> work...
> 
>>
>> And I really haven't seen any detrimental performance effects
>> with this patch; in the contrary, performance was on par,
>> and if anything standard deviation went down.
> 
> Still you agree that busy looping a socket that may not have space is 
> less efficient?
> When there are other queues waiting to execute io_work?
> 
> I agree there is a problem here, but I am trying to figure out what we 
> want to do here.
> 
> How about these two (untested) patches:
> [1 based on your recv-side fix]:
> --diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 72d260201d8c..4eb9a2dec07e 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1348,7 +1348,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue 
> *queue)
>          queue->nr_cqe = 0;
>          consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
>          release_sock(sk);
> -       return consumed;
> +       return consumed == -EAGAIN ? 0 : consumed;
>   }
> 
>   static void nvme_tcp_io_work(struct work_struct *w)
> -- 
> 
> [2 based on your partial write fix]:
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 4eb9a2dec07e..daf59e75cf15 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -129,6 +129,7 @@ enum nvme_tcp_queue_flags {
>          NVME_TCP_Q_LIVE         = 1,
>          NVME_TCP_Q_POLLING      = 2,
>          NVME_TCP_Q_IO_CPU_SET   = 3,
> +       NVME_TCP_Q_WAKE_SENDER  = 4,
>   };
> 
>   enum nvme_tcp_recv_state {
> @@ -1063,6 +1064,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>          queue = sk->sk_user_data;
>          if (likely(queue && sk_stream_is_writeable(sk))) {
>                  clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
> +               set_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>                  queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue- 
>  >io_work);
>          }
>          read_unlock_bh(&sk->sk_callback_lock);
> @@ -1357,6 +1359,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
>                  container_of(w, struct nvme_tcp_queue, io_work);
>          unsigned long deadline = jiffies + msecs_to_jiffies(1);
> 
> +       clear_bit(NVME_TCP_Q_WAKE_SENDER, &queue->flags);
>          do {
>                  bool pending = false;
>                  int result;
> @@ -1376,7 +1379,15 @@ static void nvme_tcp_io_work(struct work_struct *w)
>                  else if (unlikely(result < 0))
>                          return;
> 
> -               if (!pending || !queue->rd_enabled)
> +               /* did we get some space after spending time in recv ? */
> +               if (nvme_tcp_queue_has_pending(queue) &&
> +                   sk_stream_is_writeable(queue->sock->sk))
> +                       pending = true;
> +
> +               if (!queue->rd_enabled)
> +                       return;
> +
> +               if (!pending && !test_bit(NVME_TCP_Q_WAKE_SENDER, 
> &queue->flags))
>                          return;
> 
>          } while (!time_after(jiffies, deadline)); /* quota is exhausted */
> -- 
> 
While trying to sift through the patches trying to come up with
a next version:
Why do you check for Q_WAKE_SENDER after checking for
sk_stream_is_writable?
Surely sk_stream_is_writable() is a necessary condition for 
nvme_tcp_try_send(), no?

Maybe it's a better idea to check sk_stream_is_writable() just
before nvme_tcp_try_send()...

Let me see how things pan out.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich