[PATCH 2/4] nvme-tcp: align I/O cpu with blk-mq mapping
Hannes Reinecke
hare at suse.de
Wed Jul 3 08:40:55 PDT 2024
On 7/3/24 17:03, Sagi Grimberg wrote:
>
>
> On 03/07/2024 17:53, Hannes Reinecke wrote:
>> On 7/3/24 16:19, Sagi Grimberg wrote:
>>>
>>>
>>> On 03/07/2024 16:50, Hannes Reinecke wrote:
>>>> When 'wq_unbound' is selected we should select the
>>>> the first CPU from a given blk-mq hctx mapping to queue
>>>> the tcp workqueue item. With this we can instruct the
>>>> workqueue code to keep the I/O affinity and avoid
>>>> a performance penalty.
>>>
>>> wq_unbound is designed to keep io_cpu to be UNBOUND, my recollection
>>> was the the person introducing it was trying to make the io_cpu
>>> always be on a specific NUMA node, or a subset of cpus within a numa node. So
>>> he uses that and tinkers with wq cpumask via sysfs.
>>>
>>> I don't see why you are tying this to wq_unbound in the first place.
>>>
>> Because in the default case the workqueue is nailed to a cpu, and will
>> not move from it. IE if you call 'queue_work_on()' it _will_ run on
>> that cpu.
>> But if something else is running on that CPU (printk logging, say),
>> you will have to stand in the queue until the scheduler gives you some
>> time.
>>
>> If the workqueue is unbound the workqueue code is able to switch away
>> from the cpu if it finds it busy or otherwise unsuitable, leading to a
>> better utilization and avoiding a workqueue stall.
>> And in the 'unbound' case the 'cpu' argument merely serves as a hint
>> where to place the workqueue item.
>> At least, that's how I understood the code.
>
> We should make the io_cpu come from blk-mq hctx mapping by default, and
> for every controller it should use a different cpu from the hctx
> mapping. That is the default behavior. in the wq_unbound case, we skip
> all of that and make io_cpu = WORK_CPU_UNBOUND, as it was before.
>
> I'm not sure I follow your logic.
>
Hehe. That's quite simple: there is none :-)
I have been tinkering with that approach in the last weeks, but got
consistently _worse_ results than with the original implementation.
So I gave up on trying to make that the default.
>>
>> And it makes the 'CPU hogged' messages go away, which is a bonus in
>> itself...
>
> Which messages? aren't these messages saying that the work spent too
> much time? why are you describing the case where the work does not get
> cpu quota to run?
I means these messages:
workqueue: nvme_tcp_io_work [nvme_tcp] hogged CPU for >10000us 32771
times, consider switching to WQ_UNBOUND
which I get consistently during testing with the default implementation.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
More information about the Linux-nvme
mailing list