[PATCH 1/3] nvme-tcp: improve rx/tx fairness

Mon Jul 8 08:50:31 PDT 2024

On 7/8/24 16:25, Sagi Grimberg wrote:
> 
> 
> On 08/07/2024 16:21, Hannes Reinecke wrote:
>> On 7/8/24 13:57, Sagi Grimberg wrote:
>>> Hey Hannes, thanks for doing this.
>>>
[ .. ]
>>> That is a significant decline in relative perf. I'm counting 12.5%...
>>> Can you explain why that is?
>>>
>> Not really :-(
>> But then fairness cuts both ways; so I am not surprised that some 
>> workloads suffer here.
> 
> Well, 12.5% is rather steep, don't you agree?
> I'm even more surprised that seq/rand read make a difference...
> 
Yeah; but I guess I found the issue now (we should interrupt data 
transfers if the deadline expires, not just pdu transfers).

[ .. ]
>>>
>>> I don't see why you need to keep this in the queue struct. You could 
>>> have
>>> easily initialize it in the read_descriptor_t and test against it.
>>>
>> Because I wanted to interrupt the receive side for large data 
>> transfers, and haven't found a way to pass the deadline into 
>> ->read_sock().
> 
> desc.count is for you to interpret however you want it to be no?
> 
Except for the magic 'desc.count == 0' value; but yeah, that should work.

>>
>>>>       /* send state */
>>>>       struct nvme_tcp_request *request;
>>>> @@ -359,14 +360,18 @@ static inline void nvme_tcp_advance_req(struct 
>>>> nvme_tcp_request *req,
>>>>       }
>>>>   }
>>>> -static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
>>>> +static inline int nvme_tcp_send_all(struct nvme_tcp_queue *queue,
>>>> +                    unsigned long deadline)
>>>>   {
>>>>       int ret;
>>>>       /* drain the send queue as much as we can... */
>>>>       do {
>>>>           ret = nvme_tcp_try_send(queue);
>>>> +        if (time_after(jiffies, deadline))
>>>> +            break;
>>>>       } while (ret > 0);
>>>> +    return ret;
>>>
>>> I think you want a different interface, nvme_tcp_send_budgeted(queue, 
>>> budget).
>>> I don't know what you pass here, but jiffies is a rather large 
>>> granularity...
>>>
>> Hmm. I've been using jiffies as the io_work loop had been counting
>> in jiffies. But let me check what would happen if I move that over to
>> PDU / size counting.
> 
> Cool.
> 
>>
>>>>   }
>>>>   static inline bool nvme_tcp_queue_has_pending(struct 
>>>> nvme_tcp_queue *queue)
>>>> @@ -385,6 +390,7 @@ static inline void nvme_tcp_queue_request(struct 
>>>> nvme_tcp_request *req,
>>>>           bool sync, bool last)
>>>>   {
>>>>       struct nvme_tcp_queue *queue = req->queue;
>>>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>>>       bool empty;
>>>>       empty = llist_add(&req->lentry, &queue->req_list) &&
>>>> @@ -397,7 +403,7 @@ static inline void nvme_tcp_queue_request(struct 
>>>> nvme_tcp_request *req,
>>>>        */
>>>>       if (queue->io_cpu == raw_smp_processor_id() &&
>>>>           sync && empty && mutex_trylock(&queue->send_mutex)) {
>>>> -        nvme_tcp_send_all(queue);
>>>> +        nvme_tcp_send_all(queue, deadline);
>>>>           mutex_unlock(&queue->send_mutex);
>>>>       }
>>>
>>> Umm, spend up to a millisecond in in queue_request ? Sounds like way 
>>> too much...
>>> Did you ever see this deadline exceeded? sends should be rather quick...
>>>
>> Of _course_ it's too long. That's kinda the point.
>> But I'm seeing network latency up to 4000 msecs (!) on my test setup,
>> so _that_ is the least of my worries ...
> 
> This has nothing to do with the network latency afaict. This is just a 
> matter of how long does it take to write to the socket buffer (or in our case, 
> construct skbs and add them to the socket as TX is zero copy).
> 
Not the transfer itself? I would have thought so ...

>>
>>>> @@ -959,9 +965,14 @@ static int nvme_tcp_recv_skb(read_descriptor_t 
>>>> *desc, struct sk_buff *skb,
>>>> nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>>>>               return result;
>>>>           }
>>>> +        if (time_after(jiffies, queue->deadline)) {
>>>> +            desc->count = 0;
>>>> +            break;
>>>> +        }
>>>> +
>>>
>>> That is still not right.
>>> You don't want to spend a full deadline reading from the socket, and 
>>> then spend a full deadline writing to the socket...
>>>
>> Yes, and no.
>> Problem is that the current code serializes writes  and reads.
>> Essentially for each iteration we first to a write, and then a read.
>> If we spend the full deadline on write we will need to reschedule,
>> but we then _again_ start with writes. This leads to a heavy preference
>> for writing, and negative performance impact.
> 
> Yes, the assumption we should take is that the tx/rx limits allow to 
> switch between the
> two at least once or twice in the io_work while loop, and that they get 
> a fair'ish share
> of quota.
> 
>>
>>> You want the io_work to take a full deadline, and send budgets of 
>>> try_send and try_recv. And set it to sane counts. Say 8 pdus, or
>>> 64k bytes. We want to get to some magic value that presents a
>>> sane behavior, that confidently fits inside a deadline, and is fair.
>>>
>> Easier said than done.
>> Biggest problem is that most of the latency increase comes from the
>> actual 'sendmsg()' and 'read_sock()' calls.
> 
> This is a bit surprising. because read_sock reads data that is already
> processed by the tcp stack, and ready to simply copy back to the user
> pages. Same for sendmsg, the socket is non-blocking, so I would not
> expect it to stall for unusually long periods. Maybe there is an issue
> hiding in how we call send/recv?
> 
Hmm. That could be.

>> And the only way of inhibiting that would be to check _prior_ whether
>> we can issue the call in the first place.
>> (That's why I did the SOCK_NOSPACE tests in the previous patchsets).
> 
> The socket is non-blocking. I always thought that SOCK_NOSPACE is 
> serving blocking
> sockets... afaik, there shouldn't be a real difference between checking 
> NOSPACE vs.
> attempting sendmsg on a non-blocking socket... But I could be wrong here..
> 

Weellll ... if 'SOCK_NOSPACE' is for blocking sockets, why do we even 
get the 'write_space()' callback?
It gets triggered quite often, and checking for the SOCK_NOSPACE bit 
before sending drops the number of invocations quite significantly.

[  .. ]
>>>>                   break;
>>>>           }
>>>> -        result = nvme_tcp_try_recv(queue);
>>>> +        result = nvme_tcp_try_recv(queue, rx_deadline);
>>>
>>> I think you want a more frequent substitution of sends/receives. the 
>>> granularity of 1ms budget may be too coarse?
>>>
>> The problem is not the granularity, the problem is the latency spikes
>> I'm seeing when issuing 'sendmsg' or 'read_sock'.
> 
> I think that its possible that there is another underlying issue. 
> Because these calls
> should NOT be blocked for a long period of time for non-blocking sockets.
> 
I did some instrumentation, and that certainly pointed into that 
direction. But will have to redo that to get some meaningful data here.
Another point really might be the granularity; jiffies isn't a good 
measurement here (even with HZ=1000 we'll only have milliseconds 
granularity, so it might be that we're incurring fluctuations due to
jiffy calculation itself).

>>
>>>>           if (result > 0)
>>>>               pending = true;
>>>>           else if (unlikely(result < 0))
>>>> @@ -1302,7 +1315,13 @@ static void nvme_tcp_io_work(struct 
>>>> work_struct *w)
>>>>           if (!pending || !queue->rd_enabled)
>>>>               return;
>>>> -    } while (!time_after(jiffies, deadline)); /* quota is exhausted */
>>>> +    } while (!time_after(jiffies, rx_deadline)); /* quota is 
>>>> exhausted */
>>>> +
>>>> +    overrun = jiffies - rx_deadline;
>>>> +    if (nvme_tcp_queue_id(queue) > 0 &&
>>>> +        overrun > msecs_to_jiffies(10))
>>>> +        dev_dbg(queue->ctrl->ctrl.device, "queue %d: queue stall 
>>>> (%u msecs)\n",
>>>> +            nvme_tcp_queue_id(queue), jiffies_to_msecs(overrun));
>>>
>>> Umm, ok. why 10? why not 2? or 3?
>>> Do you expect io_work to spend more time executing?
>>>
>> Eg 4000 msecs like on my testbed?
> 
> If you see ANYTHING in the IO path stalling for 4 seconds then we have a 
> bigger issue here.
> 
You don't say ...

>> Yes.
>>
>>>>       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>>>   }
>>>> @@ -2666,6 +2685,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx 
>>>> *hctx, struct io_comp_batch *iob)
>>>>   {
>>>>       struct nvme_tcp_queue *queue = hctx->driver_data;
>>>>       struct sock *sk = queue->sock->sk;
>>>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>>>       if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
>>>>           return 0;
>>>> @@ -2673,7 +2693,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx 
>>>> *hctx, struct io_comp_batch *iob)
>>>>       set_bit(NVME_TCP_Q_POLLING, &queue->flags);
>>>>       if (sk_can_busy_loop(sk) && 
>>>> skb_queue_empty_lockless(&sk->sk_receive_queue))
>>>>           sk_busy_loop(sk, true);
>>>> -    nvme_tcp_try_recv(queue);
>>>> +    nvme_tcp_try_recv(queue, deadline);
>>>
>>> spend a millisecond in nvme_tcp_poll() ??
>>> Isn't it too long?
>> Haven't tried with polling, so can't honestly answer.
>>
>> In the end, it all boils down to numbers.
>> I'm having 2 controllers with 32 queues and 256 requests.
>> Running on a 10GigE link one should be getting a net throughput
>> of 1GB/s, or, with 4k requests, 256k IOPS.
>> Or a latency of 0.25 milliseconds per command.
>> In the optimal case. As I'm seeing a bandwidth of around
>> 500MB/s, I'm looking at a latency 0.5 milliseconds _in the ideal case_.
>>
>> So no, I don't think the 1 milliseconds is too long.
> 
> nvme_tcp_poll is NOT designed to poll for the entirety or an individual 
> IO lifetime. Its
> a non-selective polling interface wired to io_uring. It should peek the 
> socket, consume what ever
> is there, and break. It most definitely should not spend 1ms in this 
> check...
Oh, I know that.
That's why I said: "Haven't tried with polling."

The calculation is independent on polling.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich