[PATCH 4/4] nvme-tcp: switch to 'cpu' affinity scope for unbound workqueues

Wed Jul 3 08:50:41 PDT 2024

On 7/3/24 17:09, Sagi Grimberg wrote:
> 
> 
> On 03/07/2024 18:01, Hannes Reinecke wrote:
>> On 7/3/24 16:22, Sagi Grimberg wrote:
>>>
>>>
>>> On 03/07/2024 16:50, Hannes Reinecke wrote:
>>>> We should switch to the 'cpu' affinity scope when using the 
>>>> 'wq_unbound'
>>>> parameter as this allows us to keep I/O locality and improve 
>>>> performance.
>>>
>>> Can you please describe more why this is better? locality between what?
>>>
>> Well; the default unbound scope is 'cache', which groups the cpu 
>> according to the cache hierarchy. I want the cpu locality of the 
>> workqueue items to be preserved as much as possible, so I switched
>> to 'cpu' here.
>>
>> I'll get some performance numbers.
>>
>>> While you mention in your cover letter "comments and reviews are 
>>> welcome"
>>> The change logs in your patches are not designed to assist your 
>>> reviewer.
>>
>> I spent the last few weeks trying to come up with a solution based on my
>> original submission, but in the end I gave up as I hadn't been able to
>> fix the original issue.
> 
> Well, the last submission was a discombobulated set of mostly unrelated 
> patches...
> What was it that did not work?
> 
>> This here is a different approach by massaging the 'wq_unbound' 
>> mechanism, which is not only easier but also has the big advantage that
>> it actually works :-)
>> So I did not include a changlog to the previous patchset as this is a 
>> pretty different approach.
>> Sorry if this is confusing.
> 
> It's just difficult to try and understand what each patch contributes, 
> and most of the time the patches
> are under-documented. I want to see the improvements added, but I also 
> want them to be properly reviewed.

Sure. So here are some performance number:
(One subsystem, two paths, 96 queues)
default:
4k seq read: bw=365MiB/s (383MB/s), 11.4MiB/s-20.5MiB/s 
(11.0MB/s-21.5MB/s), io=16.0GiB (17.2GB), run=24950-44907msec
4k rand read: bw=307MiB/s (322MB/s), 9830KiB/s-13.8MiB/s 
(10.1MB/s-14.5MB/s), io=16.0GiB (17.2GB), run=37081-53333msec
4k seq write: bw=550MiB/s (577MB/s), 17.2MiB/s-28.7MiB/s 
(18.0MB/s-30.1MB/s), io=16.0GiB (17.2GB), run=17859-29786msec
4k rand write: bw=453MiB/s (475MB/s), 14.2MiB/s-21.3MiB/s 
(14.8MB/s-22.3MB/s), io=16.0GiB (17.2GB), run=24066-36161msec

unbound:
4k seq read: bw=232MiB/s (243MB/s), 6145KiB/s-9249KiB/s 
(6293kB/s-9471kB/s), io=13.6GiB (14.6GB), run=56685-60074msec
4k rand read: bw=249MiB/s (261MB/s), 6335KiB/s-9713KiB/s 
(6487kB/s-9946kB/s), io=14.6GiB (15.7GB), run=53976-60019msec
4k seq write: bw=358MiB/s (375MB/s), 11.2MiB/s-13.5MiB/s 
(11.7MB/s-14.2MB/s), io=16.0GiB (17.2GB), run=37918-45779msec
4k rand write: bw=335MiB/s (351MB/s), 10.5MiB/s-14.7MiB/s 
(10.0MB/s-15.4MB/s), io=16.0GiB (17.2GB), run=34929-48971msec

unbound + 'cpu' affinity:
4k seq read: bw=249MiB/s (261MB/s), 6003KiB/s-13.6MiB/s 
(6147kB/s-14.3MB/s), io=14.6GiB (15.7GB), run=37636-60065msec
4k rand read: bw=305MiB/s (320MB/s), 9773KiB/s-13.9MiB/s 
(10.0MB/s-14.6MB/s), io=16.0GiB (17.2GB), run=36791-53644msec
4k seq write: bw=499MiB/s (523MB/s), 15.6MiB/s-18.0MiB/s 
(16.3MB/s-19.9MB/s), io=16.0GiB (17.2GB), run=27018-32860msec
4k rand write: bw=536MiB/s (562MB/s), 16.7MiB/s-21.1MiB/s 
(17.6MB/s-22.1MB/s), io=16.0GiB (17.2GB), run=24305-30588msec

As you can see, with unbound and 'cpu' affinity we are basically on par
with the default implementations (all tests are run with per-controller
workqueues, mind).
Running the same workload with 4 subsystems and 8 paths will run into
I/O timeouts for the default implementation, but perfectly succeed with
unbound and 'cpu' affinity.
So definitely an improvement there.

I'll see to dig out performance numbers for the current implementations.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich