[PATCH 2/4] nvme-tcp: align I/O cpu with blk-mq mapping

Wed Jul 3 08:40:55 PDT 2024

On 7/3/24 17:03, Sagi Grimberg wrote:
> 
> 
> On 03/07/2024 17:53, Hannes Reinecke wrote:
>> On 7/3/24 16:19, Sagi Grimberg wrote:
>>>
>>>
>>> On 03/07/2024 16:50, Hannes Reinecke wrote:
>>>> When 'wq_unbound' is selected we should select the
>>>> the first CPU from a given blk-mq hctx mapping to queue
>>>> the tcp workqueue item. With this we can instruct the
>>>> workqueue code to keep the I/O affinity and avoid
>>>> a performance penalty.
>>>
>>> wq_unbound is designed to keep io_cpu to be UNBOUND, my recollection
>>> was the the person introducing it was trying to make the io_cpu 
>>> always be on a specific NUMA node, or a subset of cpus within a numa node. So 
>>> he uses that and tinkers with wq cpumask via sysfs.
>>>
>>> I don't see why you are tying this to wq_unbound in the first place.
>>>
>> Because in the default case the workqueue is nailed to a cpu, and will 
>> not move from it. IE if you call 'queue_work_on()' it _will_ run on 
>> that cpu.
>> But if something else is running on that CPU (printk logging, say), 
>> you will have to stand in the queue until the scheduler gives you some 
>> time.
>>
>> If the workqueue is unbound the workqueue code is able to switch away 
>> from the cpu if it finds it busy or otherwise unsuitable, leading to a 
>> better utilization and avoiding a workqueue stall.
>> And in the 'unbound' case the 'cpu' argument merely serves as a hint
>> where to place the workqueue item.
>> At least, that's how I understood the code.
> 
> We should make the io_cpu come from blk-mq hctx mapping by default, and 
> for every controller it should use a different cpu from the hctx 
> mapping. That is the default behavior. in the wq_unbound case, we skip 
> all of that and make io_cpu = WORK_CPU_UNBOUND, as it was before.
> 
> I'm not sure I follow your logic.
> 
Hehe. That's quite simple: there is none :-)
I have been tinkering with that approach in the last weeks, but got 
consistently _worse_ results than with the original implementation.
So I gave up on trying to make that the default.

>>
>> And it makes the 'CPU hogged' messages go away, which is a bonus in 
>> itself...
> 
> Which messages? aren't these messages saying that the work spent too 
> much time? why are you describing the case where the work does not get
> cpu quota to run?

I means these messages:

workqueue: nvme_tcp_io_work [nvme_tcp] hogged CPU for >10000us 32771 
times, consider switching to WQ_UNBOUND

which I get consistently during testing with the default implementation.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich