[PATCH 1/3] nvme-tcp: improve rx/tx fairness

Mon Jul 8 12:31:48 PDT 2024

On 08/07/2024 18:50, Hannes Reinecke wrote:
> On 7/8/24 16:25, Sagi Grimberg wrote:
>>
>>
>> On 08/07/2024 16:21, Hannes Reinecke wrote:
>>> On 7/8/24 13:57, Sagi Grimberg wrote:
>>>> Hey Hannes, thanks for doing this.
>>>>
> [ .. ]
>>>> That is a significant decline in relative perf. I'm counting 12.5%...
>>>> Can you explain why that is?
>>>>
>>> Not really :-(
>>> But then fairness cuts both ways; so I am not surprised that some 
>>> workloads suffer here.
>>
>> Well, 12.5% is rather steep, don't you agree?
>> I'm even more surprised that seq/rand read make a difference...
>>
> Yeah; but I guess I found the issue now (we should interrupt data 
> transfers if the deadline expires, not just pdu transfers).
>
> [ .. ]
>>>>
>>>> I don't see why you need to keep this in the queue struct. You 
>>>> could have
>>>> easily initialize it in the read_descriptor_t and test against it.
>>>>
>>> Because I wanted to interrupt the receive side for large data 
>>> transfers, and haven't found a way to pass the deadline into 
>>> ->read_sock().
>>
>> desc.count is for you to interpret however you want it to be no?
>>
> Except for the magic 'desc.count == 0' value; but yeah, that should work.
>
>>>
>>>>>       /* send state */
>>>>>       struct nvme_tcp_request *request;
>>>>> @@ -359,14 +360,18 @@ static inline void 
>>>>> nvme_tcp_advance_req(struct nvme_tcp_request *req,
>>>>>       }
>>>>>   }
>>>>> -static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
>>>>> +static inline int nvme_tcp_send_all(struct nvme_tcp_queue *queue,
>>>>> +                    unsigned long deadline)
>>>>>   {
>>>>>       int ret;
>>>>>       /* drain the send queue as much as we can... */
>>>>>       do {
>>>>>           ret = nvme_tcp_try_send(queue);
>>>>> +        if (time_after(jiffies, deadline))
>>>>> +            break;
>>>>>       } while (ret > 0);
>>>>> +    return ret;
>>>>
>>>> I think you want a different interface, 
>>>> nvme_tcp_send_budgeted(queue, budget).
>>>> I don't know what you pass here, but jiffies is a rather large 
>>>> granularity...
>>>>
>>> Hmm. I've been using jiffies as the io_work loop had been counting
>>> in jiffies. But let me check what would happen if I move that over to
>>> PDU / size counting.
>>
>> Cool.
>>
>>>
>>>>>   }
>>>>>   static inline bool nvme_tcp_queue_has_pending(struct 
>>>>> nvme_tcp_queue *queue)
>>>>> @@ -385,6 +390,7 @@ static inline void 
>>>>> nvme_tcp_queue_request(struct nvme_tcp_request *req,
>>>>>           bool sync, bool last)
>>>>>   {
>>>>>       struct nvme_tcp_queue *queue = req->queue;
>>>>> +    unsigned long deadline = jiffies + msecs_to_jiffies(1);
>>>>>       bool empty;
>>>>>       empty = llist_add(&req->lentry, &queue->req_list) &&
>>>>> @@ -397,7 +403,7 @@ static inline void 
>>>>> nvme_tcp_queue_request(struct nvme_tcp_request *req,
>>>>>        */
>>>>>       if (queue->io_cpu == raw_smp_processor_id() &&
>>>>>           sync && empty && mutex_trylock(&queue->send_mutex)) {
>>>>> -        nvme_tcp_send_all(queue);
>>>>> +        nvme_tcp_send_all(queue, deadline);
>>>>>           mutex_unlock(&queue->send_mutex);
>>>>>       }
>>>>
>>>> Umm, spend up to a millisecond in in queue_request ? Sounds like 
>>>> way too much...
>>>> Did you ever see this deadline exceeded? sends should be rather 
>>>> quick...
>>>>
>>> Of _course_ it's too long. That's kinda the point.
>>> But I'm seeing network latency up to 4000 msecs (!) on my test setup,
>>> so _that_ is the least of my worries ...
>>
>> This has nothing to do with the network latency afaict. This is just 
>> a matter of how long does it take to write to the socket buffer (or 
>> in our case, construct skbs and add them to the socket as TX is zero 
>> copy).
>>
> Not the transfer itself? I would have thought so ...

It may trigger push packets down to [tcp/ip]_output under some conditions,
but that is also not something that should block for long.

>
>>>
>>>>> @@ -959,9 +965,14 @@ static int 
>>>>> nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
>>>>> nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>>>>>               return result;
>>>>>           }
>>>>> +        if (time_after(jiffies, queue->deadline)) {
>>>>> +            desc->count = 0;
>>>>> +            break;
>>>>> +        }
>>>>> +
>>>>
>>>> That is still not right.
>>>> You don't want to spend a full deadline reading from the socket, 
>>>> and then spend a full deadline writing to the socket...
>>>>
>>> Yes, and no.
>>> Problem is that the current code serializes writes  and reads.
>>> Essentially for each iteration we first to a write, and then a read.
>>> If we spend the full deadline on write we will need to reschedule,
>>> but we then _again_ start with writes. This leads to a heavy preference
>>> for writing, and negative performance impact.
>>
>> Yes, the assumption we should take is that the tx/rx limits allow to 
>> switch between the
>> two at least once or twice in the io_work while loop, and that they 
>> get a fair'ish share
>> of quota.
>>
>>>
>>>> You want the io_work to take a full deadline, and send budgets of 
>>>> try_send and try_recv. And set it to sane counts. Say 8 pdus, or
>>>> 64k bytes. We want to get to some magic value that presents a
>>>> sane behavior, that confidently fits inside a deadline, and is fair.
>>>>
>>> Easier said than done.
>>> Biggest problem is that most of the latency increase comes from the
>>> actual 'sendmsg()' and 'read_sock()' calls.
>>
>> This is a bit surprising. because read_sock reads data that is already
>> processed by the tcp stack, and ready to simply copy back to the user
>> pages. Same for sendmsg, the socket is non-blocking, so I would not
>> expect it to stall for unusually long periods. Maybe there is an issue
>> hiding in how we call send/recv?
>>
> Hmm. That could be.
>
>>> And the only way of inhibiting that would be to check _prior_ whether
>>> we can issue the call in the first place.
>>> (That's why I did the SOCK_NOSPACE tests in the previous patchsets).
>>
>> The socket is non-blocking. I always thought that SOCK_NOSPACE is 
>> serving blocking
>> sockets... afaik, there shouldn't be a real difference between 
>> checking NOSPACE vs.
>> attempting sendmsg on a non-blocking socket... But I could be wrong 
>> here..
>>
>
> Weellll ... if 'SOCK_NOSPACE' is for blocking sockets, why do we even 
> get the 'write_space()' callback?
> It gets triggered quite often, and checking for the SOCK_NOSPACE bit 
> before sending drops the number of invocations quite significantly.

I need to check, but where have you testing it? in the inline path?
Thinking further, I do agree that because the io_work is shared, we may
sendmsg immediately as space is becomes available instead of some 
minimum space.

Can you please quantify this with your testing?
How many times we get write_space() callback? how many times we get 
EAGAIN, and what
is the perf... Also interesting if this is more apparent in specific 
workloads. I'm assuming its
more apparent with large write workloads.