[PATCH 1/3] nvme-tcp: improve rx/tx fairness
Hannes Reinecke
hare at suse.de
Mon Jul 8 08:50:31 PDT 2024
On 7/8/24 16:25, Sagi Grimberg wrote:
>
>
> On 08/07/2024 16:21, Hannes Reinecke wrote:
>> On 7/8/24 13:57, Sagi Grimberg wrote:
>>> Hey Hannes, thanks for doing this.
>>>
[ .. ]
>>> That is a significant decline in relative perf. I'm counting 12.5%...
>>> Can you explain why that is?
>>>
>> Not really :-(
>> But then fairness cuts both ways; so I am not surprised that some
>> workloads suffer here.
>
> Well, 12.5% is rather steep, don't you agree?
> I'm even more surprised that seq/rand read make a difference...
>
Yeah; but I guess I found the issue now (we should interrupt data
transfers if the deadline expires, not just pdu transfers).
[ .. ]
>>>
>>> I don't see why you need to keep this in the queue struct. You could
>>> have
>>> easily initialize it in the read_descriptor_t and test against it.
>>>
>> Because I wanted to interrupt the receive side for large data
>> transfers, and haven't found a way to pass the deadline into
>> ->read_sock().
>
> desc.count is for you to interpret however you want it to be no?
>
Except for the magic 'desc.count == 0' value; but yeah, that should work.
>>
>>>> /* send state */
>>>> struct nvme_tcp_request *request;
>>>> @@ -359,14 +360,18 @@ static inline void nvme_tcp_advance_req(struct
>>>> nvme_tcp_request *req,
>>>> }
>>>> }
>>>> -static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
>>>> +static inline int nvme_tcp_send_all(struct nvme_tcp_queue *queue,
>>>> + unsigned long deadline)
>>>> {
>>>> int ret;
>>>> /* drain the send queue as much as we can... */
>>>> do {
>>>> ret = nvme_tcp_try_send(queue);
>>>> + if (time_after(jiffies, deadline))
>>>> + break;
>>>> } while (ret > 0);
>>>> + return ret;
>>>
>>> I think you want a different interface, nvme_tcp_send_budgeted(queue,
>>> budget).
>>> I don't know what you pass here, but jiffies is a rather large
>>> granularity...
>>>
>> Hmm. I've been using jiffies as the io_work loop had been counting
>> in jiffies. But let me check what would happen if I move that over to
>> PDU / size counting.
>
> Cool.
>
>>
>>>> }
>>>> static inline bool nvme_tcp_queue_has_pending(struct
>>>> nvme_tcp_queue *queue)
>>>> @@ -385,6 +390,7 @@ static inline void nvme_tcp_queue_request(struct
>>>> nvme_tcp_request *req,
>>>> bool sync, bool last)
>>>> {
>>>> struct nvme_tcp_queue *queue = req->queue;
>>>> + unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>>> bool empty;
>>>> empty = llist_add(&req->lentry, &queue->req_list) &&
>>>> @@ -397,7 +403,7 @@ static inline void nvme_tcp_queue_request(struct
>>>> nvme_tcp_request *req,
>>>> */
>>>> if (queue->io_cpu == raw_smp_processor_id() &&
>>>> sync && empty && mutex_trylock(&queue->send_mutex)) {
>>>> - nvme_tcp_send_all(queue);
>>>> + nvme_tcp_send_all(queue, deadline);
>>>> mutex_unlock(&queue->send_mutex);
>>>> }
>>>
>>> Umm, spend up to a millisecond in in queue_request ? Sounds like way
>>> too much...
>>> Did you ever see this deadline exceeded? sends should be rather quick...
>>>
>> Of _course_ it's too long. That's kinda the point.
>> But I'm seeing network latency up to 4000 msecs (!) on my test setup,
>> so _that_ is the least of my worries ...
>
> This has nothing to do with the network latency afaict. This is just a
> matter of how long does it take to write to the socket buffer (or in our case,
> construct skbs and add them to the socket as TX is zero copy).
>
Not the transfer itself? I would have thought so ...
>>
>>>> @@ -959,9 +965,14 @@ static int nvme_tcp_recv_skb(read_descriptor_t
>>>> *desc, struct sk_buff *skb,
>>>> nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>>>> return result;
>>>> }
>>>> + if (time_after(jiffies, queue->deadline)) {
>>>> + desc->count = 0;
>>>> + break;
>>>> + }
>>>> +
>>>
>>> That is still not right.
>>> You don't want to spend a full deadline reading from the socket, and
>>> then spend a full deadline writing to the socket...
>>>
>> Yes, and no.
>> Problem is that the current code serializes writes and reads.
>> Essentially for each iteration we first to a write, and then a read.
>> If we spend the full deadline on write we will need to reschedule,
>> but we then _again_ start with writes. This leads to a heavy preference
>> for writing, and negative performance impact.
>
> Yes, the assumption we should take is that the tx/rx limits allow to
> switch between the
> two at least once or twice in the io_work while loop, and that they get
> a fair'ish share
> of quota.
>
>>
>>> You want the io_work to take a full deadline, and send budgets of
>>> try_send and try_recv. And set it to sane counts. Say 8 pdus, or
>>> 64k bytes. We want to get to some magic value that presents a
>>> sane behavior, that confidently fits inside a deadline, and is fair.
>>>
>> Easier said than done.
>> Biggest problem is that most of the latency increase comes from the
>> actual 'sendmsg()' and 'read_sock()' calls.
>
> This is a bit surprising. because read_sock reads data that is already
> processed by the tcp stack, and ready to simply copy back to the user
> pages. Same for sendmsg, the socket is non-blocking, so I would not
> expect it to stall for unusually long periods. Maybe there is an issue
> hiding in how we call send/recv?
>
Hmm. That could be.
>> And the only way of inhibiting that would be to check _prior_ whether
>> we can issue the call in the first place.
>> (That's why I did the SOCK_NOSPACE tests in the previous patchsets).
>
> The socket is non-blocking. I always thought that SOCK_NOSPACE is
> serving blocking
> sockets... afaik, there shouldn't be a real difference between checking
> NOSPACE vs.
> attempting sendmsg on a non-blocking socket... But I could be wrong here..
>
Weellll ... if 'SOCK_NOSPACE' is for blocking sockets, why do we even
get the 'write_space()' callback?
It gets triggered quite often, and checking for the SOCK_NOSPACE bit
before sending drops the number of invocations quite significantly.
[ .. ]
>>>> break;
>>>> }
>>>> - result = nvme_tcp_try_recv(queue);
>>>> + result = nvme_tcp_try_recv(queue, rx_deadline);
>>>
>>> I think you want a more frequent substitution of sends/receives. the
>>> granularity of 1ms budget may be too coarse?
>>>
>> The problem is not the granularity, the problem is the latency spikes
>> I'm seeing when issuing 'sendmsg' or 'read_sock'.
>
> I think that its possible that there is another underlying issue.
> Because these calls
> should NOT be blocked for a long period of time for non-blocking sockets.
>
I did some instrumentation, and that certainly pointed into that
direction. But will have to redo that to get some meaningful data here.
Another point really might be the granularity; jiffies isn't a good
measurement here (even with HZ=1000 we'll only have milliseconds
granularity, so it might be that we're incurring fluctuations due to
jiffy calculation itself).
>>
>>>> if (result > 0)
>>>> pending = true;
>>>> else if (unlikely(result < 0))
>>>> @@ -1302,7 +1315,13 @@ static void nvme_tcp_io_work(struct
>>>> work_struct *w)
>>>> if (!pending || !queue->rd_enabled)
>>>> return;
>>>> - } while (!time_after(jiffies, deadline)); /* quota is exhausted */
>>>> + } while (!time_after(jiffies, rx_deadline)); /* quota is
>>>> exhausted */
>>>> +
>>>> + overrun = jiffies - rx_deadline;
>>>> + if (nvme_tcp_queue_id(queue) > 0 &&
>>>> + overrun > msecs_to_jiffies(10))
>>>> + dev_dbg(queue->ctrl->ctrl.device, "queue %d: queue stall
>>>> (%u msecs)\n",
>>>> + nvme_tcp_queue_id(queue), jiffies_to_msecs(overrun));
>>>
>>> Umm, ok. why 10? why not 2? or 3?
>>> Do you expect io_work to spend more time executing?
>>>
>> Eg 4000 msecs like on my testbed?
>
> If you see ANYTHING in the IO path stalling for 4 seconds then we have a
> bigger issue here.
>
You don't say ...
>> Yes.
>>
>>>> queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>>> }
>>>> @@ -2666,6 +2685,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx
>>>> *hctx, struct io_comp_batch *iob)
>>>> {
>>>> struct nvme_tcp_queue *queue = hctx->driver_data;
>>>> struct sock *sk = queue->sock->sk;
>>>> + unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>>> if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
>>>> return 0;
>>>> @@ -2673,7 +2693,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx
>>>> *hctx, struct io_comp_batch *iob)
>>>> set_bit(NVME_TCP_Q_POLLING, &queue->flags);
>>>> if (sk_can_busy_loop(sk) &&
>>>> skb_queue_empty_lockless(&sk->sk_receive_queue))
>>>> sk_busy_loop(sk, true);
>>>> - nvme_tcp_try_recv(queue);
>>>> + nvme_tcp_try_recv(queue, deadline);
>>>
>>> spend a millisecond in nvme_tcp_poll() ??
>>> Isn't it too long?
>> Haven't tried with polling, so can't honestly answer.
>>
>> In the end, it all boils down to numbers.
>> I'm having 2 controllers with 32 queues and 256 requests.
>> Running on a 10GigE link one should be getting a net throughput
>> of 1GB/s, or, with 4k requests, 256k IOPS.
>> Or a latency of 0.25 milliseconds per command.
>> In the optimal case. As I'm seeing a bandwidth of around
>> 500MB/s, I'm looking at a latency 0.5 milliseconds _in the ideal case_.
>>
>> So no, I don't think the 1 milliseconds is too long.
>
> nvme_tcp_poll is NOT designed to poll for the entirety or an individual
> IO lifetime. Its
> a non-selective polling interface wired to io_uring. It should peek the
> socket, consume what ever
> is there, and break. It most definitely should not spend 1ms in this
> check...
Oh, I know that.
That's why I said: "Haven't tried with polling."
The calculation is independent on polling.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
More information about the Linux-nvme
mailing list