[PATCH 1/3] nvme-tcp: improve rx/tx fairness
Sagi Grimberg
sagi at grimberg.me
Mon Jul 8 07:25:03 PDT 2024
On 08/07/2024 16:21, Hannes Reinecke wrote:
> On 7/8/24 13:57, Sagi Grimberg wrote:
>> Hey Hannes, thanks for doing this.
>>
>> On 08/07/2024 10:10, Hannes Reinecke wrote:
>>> We need to restrict both side, rx and tx, to only run for a certain
>>> time
>>> to ensure that we're not blocking the other side and induce starvation.
>>> So pass in a 'deadline' value to nvme_tcp_send_all()
>>
>> Please split the addition of nvme_tcp_send_all() to a separate prep
>> patch.
>>
> Okay, no problem.
>
>>> and nvme_tcp_try_recv()
>>> and break out of the loop if the deadline is reached.
>>
>> I think we want to limit the rx/tx in pdus/bytes. This will also
>> allow us
>> to possibly do burst rx from data-ready.
>>
> PDUs is not the best scheduling boundary here, as each PDU can be of
> different size, and the network interface most definitely is limited by
> the number of bytes transferred.
Well, PDUs is at least A proxy of bytes in terms of limits, but I agree
bytes may be more appropriate.
>
>>>
>>> As we now have a timestamp we can also use it to print out a warning
>>> if the actual time spent exceeds the deadline.
>>>
>>> Performance comparison:
>>> baseline rx/tx fairness
>>> 4k seq write: 449MiB/s 480MiB/s
>>> 4k rand write: 410MiB/s 481MiB/s
>>> 4k seq read: 478MiB/s 481MiB/s
>>> 4k rand read: 547MiB/s 480MiB/s
>>>
>>> Random read is ever so disappointing, but that will be fixed with
>>> the later
>>> patches.
>>
>> That is a significant decline in relative perf. I'm counting 12.5%...
>> Can you explain why that is?
>>
> Not really :-(
> But then fairness cuts both ways; so I am not surprised that some
> workloads suffer here.
Well, 12.5% is rather steep, don't you agree?
I'm even more surprised that seq/rand read make a difference...
>
>
>> How does this look for multiple controllers?
>>
> Haven't really checked (yet); it would make a rather weak case if
> we killed performance just to scale better ...
>
>>
>>
>>>
>>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>>> ---
>>> drivers/nvme/host/tcp.c | 38 +++++++++++++++++++++++++++++---------
>>> 1 file changed, 29 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 0873b3949355..f621d3ba89b2 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -153,6 +153,7 @@ struct nvme_tcp_queue {
>>> size_t data_remaining;
>>> size_t ddgst_remaining;
>>> unsigned int nr_cqe;
>>> + unsigned long deadline;
>>
>> I don't see why you need to keep this in the queue struct. You could
>> have
>> easily initialize it in the read_descriptor_t and test against it.
>>
> Because I wanted to interrupt the receive side for large data
> transfers, and haven't found a way to pass the deadline into
> ->read_sock().
desc.count is for you to interpret however you want it to be no?
>
>>> /* send state */
>>> struct nvme_tcp_request *request;
>>> @@ -359,14 +360,18 @@ static inline void nvme_tcp_advance_req(struct
>>> nvme_tcp_request *req,
>>> }
>>> }
>>> -static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
>>> +static inline int nvme_tcp_send_all(struct nvme_tcp_queue *queue,
>>> + unsigned long deadline)
>>> {
>>> int ret;
>>> /* drain the send queue as much as we can... */
>>> do {
>>> ret = nvme_tcp_try_send(queue);
>>> + if (time_after(jiffies, deadline))
>>> + break;
>>> } while (ret > 0);
>>> + return ret;
>>
>> I think you want a different interface, nvme_tcp_send_budgeted(queue,
>> budget).
>> I don't know what you pass here, but jiffies is a rather large
>> granularity...
>>
> Hmm. I've been using jiffies as the io_work loop had been counting
> in jiffies. But let me check what would happen if I move that over to
> PDU / size counting.
Cool.
>
>>> }
>>> static inline bool nvme_tcp_queue_has_pending(struct
>>> nvme_tcp_queue *queue)
>>> @@ -385,6 +390,7 @@ static inline void nvme_tcp_queue_request(struct
>>> nvme_tcp_request *req,
>>> bool sync, bool last)
>>> {
>>> struct nvme_tcp_queue *queue = req->queue;
>>> + unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>> bool empty;
>>> empty = llist_add(&req->lentry, &queue->req_list) &&
>>> @@ -397,7 +403,7 @@ static inline void nvme_tcp_queue_request(struct
>>> nvme_tcp_request *req,
>>> */
>>> if (queue->io_cpu == raw_smp_processor_id() &&
>>> sync && empty && mutex_trylock(&queue->send_mutex)) {
>>> - nvme_tcp_send_all(queue);
>>> + nvme_tcp_send_all(queue, deadline);
>>> mutex_unlock(&queue->send_mutex);
>>> }
>>
>> Umm, spend up to a millisecond in in queue_request ? Sounds like way
>> too much...
>> Did you ever see this deadline exceeded? sends should be rather quick...
>>
> Of _course_ it's too long. That's kinda the point.
> But I'm seeing network latency up to 4000 msecs (!) on my test setup,
> so _that_ is the least of my worries ...
This has nothing to do with the network latency afaict. This is just a
matter
of how long does it take to write to the socket buffer (or in our case,
construct
skbs and add them to the socket as TX is zero copy).
>
>>> @@ -959,9 +965,14 @@ static int nvme_tcp_recv_skb(read_descriptor_t
>>> *desc, struct sk_buff *skb,
>>> nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>>> return result;
>>> }
>>> + if (time_after(jiffies, queue->deadline)) {
>>> + desc->count = 0;
>>> + break;
>>> + }
>>> +
>>
>> That is still not right.
>> You don't want to spend a full deadline reading from the socket, and
>> then spend a full deadline writing to the socket...
>>
> Yes, and no.
> Problem is that the current code serializes writes and reads.
> Essentially for each iteration we first to a write, and then a read.
> If we spend the full deadline on write we will need to reschedule,
> but we then _again_ start with writes. This leads to a heavy preference
> for writing, and negative performance impact.
Yes, the assumption we should take is that the tx/rx limits allow to
switch between the
two at least once or twice in the io_work while loop, and that they get
a fair'ish share
of quota.
>
>> You want the io_work to take a full deadline, and send budgets of
>> try_send and try_recv. And set it to sane counts. Say 8 pdus, or
>> 64k bytes. We want to get to some magic value that presents a
>> sane behavior, that confidently fits inside a deadline, and is fair.
>>
> Easier said than done.
> Biggest problem is that most of the latency increase comes from the
> actual 'sendmsg()' and 'read_sock()' calls.
This is a bit surprising. because read_sock reads data that is already
processed by the tcp stack, and ready to simply copy back to the user
pages. Same for sendmsg, the socket is non-blocking, so I would not
expect it to stall for unusually long periods. Maybe there is an issue
hiding in how we call send/recv?
> And the only way of inhibiting that would be to check _prior_ whether
> we can issue the call in the first place.
> (That's why I did the SOCK_NOSPACE tests in the previous patchsets).
The socket is non-blocking. I always thought that SOCK_NOSPACE is
serving blocking
sockets... afaik, there shouldn't be a real difference between checking
NOSPACE vs.
attempting sendmsg on a non-blocking socket... But I could be wrong here..
>
>>> }
>>> - return consumed;
>>> + return consumed - len;
>>> }
>>> static void nvme_tcp_data_ready(struct sock *sk)
>>> @@ -1258,7 +1269,7 @@ static int nvme_tcp_try_send(struct
>>> nvme_tcp_queue *queue)
>>> return ret;
>>> }
>>> -static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
>>> +static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue, unsigned
>>> long deadline)
>>> {
>>> struct socket *sock = queue->sock;
>>> struct sock *sk = sock->sk;
>>> @@ -1269,6 +1280,7 @@ static int nvme_tcp_try_recv(struct
>>> nvme_tcp_queue *queue)
>>> rd_desc.count = 1;
>>> lock_sock(sk);
>>> queue->nr_cqe = 0;
>>> + queue->deadline = deadline;
>>> consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
>>> release_sock(sk);
>>> return consumed;
>>> @@ -1278,14 +1290,15 @@ static void nvme_tcp_io_work(struct
>>> work_struct *w)
>>> {
>>> struct nvme_tcp_queue *queue =
>>> container_of(w, struct nvme_tcp_queue, io_work);
>>> - unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>> + unsigned long tx_deadline = jiffies + msecs_to_jiffies(1);
>>> + unsigned long rx_deadline = tx_deadline + msecs_to_jiffies(1),
>>> overrun;
>>> do {
>>> bool pending = false;
>>> int result;
>>> if (mutex_trylock(&queue->send_mutex)) {
>>> - result = nvme_tcp_try_send(queue);
>>> + result = nvme_tcp_send_all(queue, tx_deadline);
>>> mutex_unlock(&queue->send_mutex);
>>> if (result > 0)
>>> pending = true;
>>> @@ -1293,7 +1306,7 @@ static void nvme_tcp_io_work(struct
>>> work_struct *w)
>>> break;
>>> }
>>> - result = nvme_tcp_try_recv(queue);
>>> + result = nvme_tcp_try_recv(queue, rx_deadline);
>>
>> I think you want a more frequent substitution of sends/receives. the
>> granularity of 1ms budget may be too coarse?
>>
> The problem is not the granularity, the problem is the latency spikes
> I'm seeing when issuing 'sendmsg' or 'read_sock'.
I think that its possible that there is another underlying issue.
Because these calls
should NOT be blocked for a long period of time for non-blocking sockets.
>
>>> if (result > 0)
>>> pending = true;
>>> else if (unlikely(result < 0))
>>> @@ -1302,7 +1315,13 @@ static void nvme_tcp_io_work(struct
>>> work_struct *w)
>>> if (!pending || !queue->rd_enabled)
>>> return;
>>> - } while (!time_after(jiffies, deadline)); /* quota is exhausted */
>>> + } while (!time_after(jiffies, rx_deadline)); /* quota is
>>> exhausted */
>>> +
>>> + overrun = jiffies - rx_deadline;
>>> + if (nvme_tcp_queue_id(queue) > 0 &&
>>> + overrun > msecs_to_jiffies(10))
>>> + dev_dbg(queue->ctrl->ctrl.device, "queue %d: queue stall
>>> (%u msecs)\n",
>>> + nvme_tcp_queue_id(queue), jiffies_to_msecs(overrun));
>>
>> Umm, ok. why 10? why not 2? or 3?
>> Do you expect io_work to spend more time executing?
>>
> Eg 4000 msecs like on my testbed?
If you see ANYTHING in the IO path stalling for 4 seconds then we have a
bigger issue here.
> Yes.
>
>>> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>> }
>>> @@ -2666,6 +2685,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx
>>> *hctx, struct io_comp_batch *iob)
>>> {
>>> struct nvme_tcp_queue *queue = hctx->driver_data;
>>> struct sock *sk = queue->sock->sk;
>>> + unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>> if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
>>> return 0;
>>> @@ -2673,7 +2693,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx
>>> *hctx, struct io_comp_batch *iob)
>>> set_bit(NVME_TCP_Q_POLLING, &queue->flags);
>>> if (sk_can_busy_loop(sk) &&
>>> skb_queue_empty_lockless(&sk->sk_receive_queue))
>>> sk_busy_loop(sk, true);
>>> - nvme_tcp_try_recv(queue);
>>> + nvme_tcp_try_recv(queue, deadline);
>>
>> spend a millisecond in nvme_tcp_poll() ??
>> Isn't it too long?
> Haven't tried with polling, so can't honestly answer.
>
> In the end, it all boils down to numbers.
> I'm having 2 controllers with 32 queues and 256 requests.
> Running on a 10GigE link one should be getting a net throughput
> of 1GB/s, or, with 4k requests, 256k IOPS.
> Or a latency of 0.25 milliseconds per command.
> In the optimal case. As I'm seeing a bandwidth of around
> 500MB/s, I'm looking at a latency 0.5 milliseconds _in the ideal case_.
>
> So no, I don't think the 1 milliseconds is too long.
nvme_tcp_poll is NOT designed to poll for the entirety or an individual
IO lifetime. Its
a non-selective polling interface wired to io_uring. It should peek the
socket, consume what ever
is there, and break. It most definitely should not spend 1ms in this
check...
More information about the Linux-nvme
mailing list