[PATCH 1/3] nvme-tcp: improve rx/tx fairness

Mon Jul 8 07:25:03 PDT 2024

On 08/07/2024 16:21, Hannes Reinecke wrote:
> On 7/8/24 13:57, Sagi Grimberg wrote:
>> Hey Hannes, thanks for doing this.
>>
>> On 08/07/2024 10:10, Hannes Reinecke wrote:
>>> We need to restrict both side, rx and tx, to only run for a certain 
>>> time
>>> to ensure that we're not blocking the other side and induce starvation.
>>> So pass in a 'deadline' value to nvme_tcp_send_all()
>>
>> Please split the addition of nvme_tcp_send_all() to a separate prep 
>> patch.
>>
> Okay, no problem.
>
>>>   and nvme_tcp_try_recv()
>>> and break out of the loop if the deadline is reached.
>>
>> I think we want to limit the rx/tx in pdus/bytes. This will also 
>> allow us
>> to possibly do burst rx from data-ready.
>>
> PDUs is not the best scheduling boundary here, as each PDU can be of 
> different size, and the network interface most definitely is limited by
> the number of bytes transferred.

Well, PDUs is at least A proxy of bytes in terms of limits, but I agree 
bytes may be more appropriate.

>
>>>
>>> As we now have a timestamp we can also use it to print out a warning
>>> if the actual time spent exceeds the deadline.
>>>
>>> Performance comparison:
>>>                 baseline rx/tx fairness
>>> 4k seq write:  449MiB/s 480MiB/s
>>> 4k rand write: 410MiB/s 481MiB/s
>>> 4k seq read:   478MiB/s 481MiB/s
>>> 4k rand read:  547MiB/s 480MiB/s
>>>
>>> Random read is ever so disappointing, but that will be fixed with 
>>> the later
>>> patches.
>>
>> That is a significant decline in relative perf. I'm counting 12.5%...
>> Can you explain why that is?
>>
> Not really :-(
> But then fairness cuts both ways; so I am not surprised that some 
> workloads suffer here.

Well, 12.5% is rather steep, don't you agree?
I'm even more surprised that seq/rand read make a difference...

>
>
>> How does this look for multiple controllers?
>>
> Haven't really checked (yet); it would make a rather weak case if
> we killed performance just to scale better ...
>
>>
>>
>>>
>>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>>> ---
>>>   drivers/nvme/host/tcp.c | 38 +++++++++++++++++++++++++++++---------
>>>   1 file changed, 29 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 0873b3949355..f621d3ba89b2 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -153,6 +153,7 @@ struct nvme_tcp_queue {
>>>       size_t            data_remaining;
>>>       size_t            ddgst_remaining;
>>>       unsigned int        nr_cqe;
>>> +    unsigned long        deadline;
>>
>> I don't see why you need to keep this in the queue struct. You could 
>> have
>> easily initialize it in the read_descriptor_t and test against it.
>>
> Because I wanted to interrupt the receive side for large data 
> transfers, and haven't found a way to pass the deadline into 
> ->read_sock().

desc.count is for you to interpret however you want it to be no?

>
>>>       /* send state */
>>>       struct nvme_tcp_request *request;
>>> @@ -359,14 +360,18 @@ static inline void nvme_tcp_advance_req(struct 
>>> nvme_tcp_request *req,
>>>       }
>>>   }
>>> -static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
>>> +static inline int nvme_tcp_send_all(struct nvme_tcp_queue *queue,
>>> +                    unsigned long deadline)
>>>   {
>>>       int ret;
>>>       /* drain the send queue as much as we can... */
>>>       do {
>>>           ret = nvme_tcp_try_send(queue);
>>> +        if (time_after(jiffies, deadline))
>>> +            break;
>>>       } while (ret > 0);
>>> +    return ret;
>>
>> I think you want a different interface, nvme_tcp_send_budgeted(queue, 
>> budget).
>> I don't know what you pass here, but jiffies is a rather large 
>> granularity...
>>
> Hmm. I've been using jiffies as the io_work loop had been counting
> in jiffies. But let me check what would happen if I move that over to
> PDU / size counting.

Cool.

>
>>>   }
>>>   static inline bool nvme_tcp_queue_has_pending(struct 
>>> nvme_tcp_queue *queue)
>>> @@ -385,6 +390,7 @@ static inline void nvme_tcp_queue_request(struct 
>>> nvme_tcp_request *req,
>>>           bool sync, bool last)
>>>   {
>>>       struct nvme_tcp_queue *queue = req->queue;
>>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>>       bool empty;
>>>       empty = llist_add(&req->lentry, &queue->req_list) &&
>>> @@ -397,7 +403,7 @@ static inline void nvme_tcp_queue_request(struct 
>>> nvme_tcp_request *req,
>>>        */
>>>       if (queue->io_cpu == raw_smp_processor_id() &&
>>>           sync && empty && mutex_trylock(&queue->send_mutex)) {
>>> -        nvme_tcp_send_all(queue);
>>> +        nvme_tcp_send_all(queue, deadline);
>>>           mutex_unlock(&queue->send_mutex);
>>>       }
>>
>> Umm, spend up to a millisecond in in queue_request ? Sounds like way 
>> too much...
>> Did you ever see this deadline exceeded? sends should be rather quick...
>>
> Of _course_ it's too long. That's kinda the point.
> But I'm seeing network latency up to 4000 msecs (!) on my test setup,
> so _that_ is the least of my worries ...

This has nothing to do with the network latency afaict. This is just a 
matter
of how long does it take to write to the socket buffer (or in our case, 
construct
skbs and add them to the socket as TX is zero copy).

>
>>> @@ -959,9 +965,14 @@ static int nvme_tcp_recv_skb(read_descriptor_t 
>>> *desc, struct sk_buff *skb,
>>> nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>>>               return result;
>>>           }
>>> +        if (time_after(jiffies, queue->deadline)) {
>>> +            desc->count = 0;
>>> +            break;
>>> +        }
>>> +
>>
>> That is still not right.
>> You don't want to spend a full deadline reading from the socket, and 
>> then spend a full deadline writing to the socket...
>>
> Yes, and no.
> Problem is that the current code serializes writes  and reads.
> Essentially for each iteration we first to a write, and then a read.
> If we spend the full deadline on write we will need to reschedule,
> but we then _again_ start with writes. This leads to a heavy preference
> for writing, and negative performance impact.

Yes, the assumption we should take is that the tx/rx limits allow to 
switch between the
two at least once or twice in the io_work while loop, and that they get 
a fair'ish share
of quota.

>
>> You want the io_work to take a full deadline, and send budgets of 
>> try_send and try_recv. And set it to sane counts. Say 8 pdus, or
>> 64k bytes. We want to get to some magic value that presents a
>> sane behavior, that confidently fits inside a deadline, and is fair.
>>
> Easier said than done.
> Biggest problem is that most of the latency increase comes from the
> actual 'sendmsg()' and 'read_sock()' calls.

This is a bit surprising. because read_sock reads data that is already
processed by the tcp stack, and ready to simply copy back to the user
pages. Same for sendmsg, the socket is non-blocking, so I would not
expect it to stall for unusually long periods. Maybe there is an issue
hiding in how we call send/recv?

> And the only way of inhibiting that would be to check _prior_ whether
> we can issue the call in the first place.
> (That's why I did the SOCK_NOSPACE tests in the previous patchsets).

The socket is non-blocking. I always thought that SOCK_NOSPACE is 
serving blocking
sockets... afaik, there shouldn't be a real difference between checking 
NOSPACE vs.
attempting sendmsg on a non-blocking socket... But I could be wrong here..

>
>>>       }
>>> -    return consumed;
>>> +    return consumed - len;
>>>   }
>>>   static void nvme_tcp_data_ready(struct sock *sk)
>>> @@ -1258,7 +1269,7 @@ static int nvme_tcp_try_send(struct 
>>> nvme_tcp_queue *queue)
>>>       return ret;
>>>   }
>>> -static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
>>> +static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue, unsigned 
>>> long deadline)
>>>   {
>>>       struct socket *sock = queue->sock;
>>>       struct sock *sk = sock->sk;
>>> @@ -1269,6 +1280,7 @@ static int nvme_tcp_try_recv(struct 
>>> nvme_tcp_queue *queue)
>>>       rd_desc.count = 1;
>>>       lock_sock(sk);
>>>       queue->nr_cqe = 0;
>>> +    queue->deadline = deadline;
>>>       consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
>>>       release_sock(sk);
>>>       return consumed;
>>> @@ -1278,14 +1290,15 @@ static void nvme_tcp_io_work(struct 
>>> work_struct *w)
>>>   {
>>>       struct nvme_tcp_queue *queue =
>>>           container_of(w, struct nvme_tcp_queue, io_work);
>>> -    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>> +    unsigned long tx_deadline = jiffies + msecs_to_jiffies(1);
>>> +    unsigned long rx_deadline = tx_deadline + msecs_to_jiffies(1), 
>>> overrun;
>>>       do {
>>>           bool pending = false;
>>>           int result;
>>>           if (mutex_trylock(&queue->send_mutex)) {
>>> -            result = nvme_tcp_try_send(queue);
>>> +            result = nvme_tcp_send_all(queue, tx_deadline);
>>>               mutex_unlock(&queue->send_mutex);
>>>               if (result > 0)
>>>                   pending = true;
>>> @@ -1293,7 +1306,7 @@ static void nvme_tcp_io_work(struct 
>>> work_struct *w)
>>>                   break;
>>>           }
>>> -        result = nvme_tcp_try_recv(queue);
>>> +        result = nvme_tcp_try_recv(queue, rx_deadline);
>>
>> I think you want a more frequent substitution of sends/receives. the 
>> granularity of 1ms budget may be too coarse?
>>
> The problem is not the granularity, the problem is the latency spikes
> I'm seeing when issuing 'sendmsg' or 'read_sock'.

I think that its possible that there is another underlying issue. 
Because these calls
should NOT be blocked for a long period of time for non-blocking sockets.

>
>>>           if (result > 0)
>>>               pending = true;
>>>           else if (unlikely(result < 0))
>>> @@ -1302,7 +1315,13 @@ static void nvme_tcp_io_work(struct 
>>> work_struct *w)
>>>           if (!pending || !queue->rd_enabled)
>>>               return;
>>> -    } while (!time_after(jiffies, deadline)); /* quota is exhausted */
>>> +    } while (!time_after(jiffies, rx_deadline)); /* quota is 
>>> exhausted */
>>> +
>>> +    overrun = jiffies - rx_deadline;
>>> +    if (nvme_tcp_queue_id(queue) > 0 &&
>>> +        overrun > msecs_to_jiffies(10))
>>> +        dev_dbg(queue->ctrl->ctrl.device, "queue %d: queue stall 
>>> (%u msecs)\n",
>>> +            nvme_tcp_queue_id(queue), jiffies_to_msecs(overrun));
>>
>> Umm, ok. why 10? why not 2? or 3?
>> Do you expect io_work to spend more time executing?
>>
> Eg 4000 msecs like on my testbed?

If you see ANYTHING in the IO path stalling for 4 seconds then we have a 
bigger issue here.

> Yes.
>
>>>       queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>>   }
>>> @@ -2666,6 +2685,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx 
>>> *hctx, struct io_comp_batch *iob)
>>>   {
>>>       struct nvme_tcp_queue *queue = hctx->driver_data;
>>>       struct sock *sk = queue->sock->sk;
>>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>>       if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
>>>           return 0;
>>> @@ -2673,7 +2693,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx 
>>> *hctx, struct io_comp_batch *iob)
>>>       set_bit(NVME_TCP_Q_POLLING, &queue->flags);
>>>       if (sk_can_busy_loop(sk) && 
>>> skb_queue_empty_lockless(&sk->sk_receive_queue))
>>>           sk_busy_loop(sk, true);
>>> -    nvme_tcp_try_recv(queue);
>>> +    nvme_tcp_try_recv(queue, deadline);
>>
>> spend a millisecond in nvme_tcp_poll() ??
>> Isn't it too long?
> Haven't tried with polling, so can't honestly answer.
>
> In the end, it all boils down to numbers.
> I'm having 2 controllers with 32 queues and 256 requests.
> Running on a 10GigE link one should be getting a net throughput
> of 1GB/s, or, with 4k requests, 256k IOPS.
> Or a latency of 0.25 milliseconds per command.
> In the optimal case. As I'm seeing a bandwidth of around
> 500MB/s, I'm looking at a latency 0.5 milliseconds _in the ideal case_.
>
> So no, I don't think the 1 milliseconds is too long.

nvme_tcp_poll is NOT designed to poll for the entirety or an individual 
IO lifetime. Its
a non-selective polling interface wired to io_uring. It should peek the 
socket, consume what ever
is there, and break. It most definitely should not spend 1ms in this 
check...