[PATCH 2/4] nvme-tcp: align I/O cpu with blk-mq mapping

Wed Jul 3 12:38:43 PDT 2024

>>
>>
>> On 03/07/2024 17:53, Hannes Reinecke wrote:
>>> On 7/3/24 16:19, Sagi Grimberg wrote:
>>>>
>>>>
>>>> On 03/07/2024 16:50, Hannes Reinecke wrote:
>>>>> When 'wq_unbound' is selected we should select the
>>>>> the first CPU from a given blk-mq hctx mapping to queue
>>>>> the tcp workqueue item. With this we can instruct the
>>>>> workqueue code to keep the I/O affinity and avoid
>>>>> a performance penalty.
>>>>
>>>> wq_unbound is designed to keep io_cpu to be UNBOUND, my recollection
>>>> was the the person introducing it was trying to make the io_cpu 
>>>> always be on a specific NUMA node, or a subset of cpus within a 
>>>> numa node. So he uses that and tinkers with wq cpumask via sysfs.
>>>>
>>>> I don't see why you are tying this to wq_unbound in the first place.
>>>>
>>> Because in the default case the workqueue is nailed to a cpu, and 
>>> will not move from it. IE if you call 'queue_work_on()' it _will_ 
>>> run on that cpu.
>>> But if something else is running on that CPU (printk logging, say), 
>>> you will have to stand in the queue until the scheduler gives you 
>>> some time.
>>>
>>> If the workqueue is unbound the workqueue code is able to switch 
>>> away from the cpu if it finds it busy or otherwise unsuitable, 
>>> leading to a better utilization and avoiding a workqueue stall.
>>> And in the 'unbound' case the 'cpu' argument merely serves as a hint
>>> where to place the workqueue item.
>>> At least, that's how I understood the code.
>>
>> We should make the io_cpu come from blk-mq hctx mapping by default, 
>> and for every controller it should use a different cpu from the hctx 
>> mapping. That is the default behavior. in the wq_unbound case, we 
>> skip all of that and make io_cpu = WORK_CPU_UNBOUND, as it was before.
>>
>> I'm not sure I follow your logic.
>>
> Hehe. That's quite simple: there is none :-)
> I have been tinkering with that approach in the last weeks, but got 
> consistently _worse_ results than with the original implementation.
> So I gave up on trying to make that the default.

What is the "original implementation" ?
What is you target? nvmet?
What is the fio job file you are using?
what is the queue count? controller count?
What was the queue mapping?

Please lets NOT condition any of this on wq_unbound option at this 
point. This modparam was introduced to address
a specific issue. If we see IO timeouts, we should fix them, not tell 
people to filp a modparam as a solution.

>
>>>
>>> And it makes the 'CPU hogged' messages go away, which is a bonus in 
>>> itself...
>>
>> Which messages? aren't these messages saying that the work spent too 
>> much time? why are you describing the case where the work does not get
>> cpu quota to run?
>
> I means these messages:
>
> workqueue: nvme_tcp_io_work [nvme_tcp] hogged CPU for >10000us 32771 
> times, consider switching to WQ_UNBOUND

That means that we are spending too much time in io_work, This is a 
separate bug. If you look at nvme_tcp_io_work it has
a stop condition after 1 millisecond. However, when we call 
nvme_tcp_try_recv() it just keeps receiving from the socket until
the socket receive buffer has no more payload. So in theory nothing 
prevents from the io_work from looping there forever.

This is indeed a bug that we need to address. Probably by setting 
rd_desc.count to some limit, decrement it for every
skb that we consume, and if we reach that limit and there are more skbs 
pending, we break and self-requeue.

If we indeed spend much time processing a single queue in io_work, it is 
possible that we have a starvation problem
that is escalating to the timeouts you are seeing.

>
> which I get consistently during testing with the default implementation.

Hannes, let's please separate this specific issue with the performance 
enhancements.
I do not think that we should search for performance enhancements to 
address what appears
to be a logical starvation issue.