[PATCH 4/4] nvme-tcp: switch to 'cpu' affinity scope for unbound workqueues
Sagi Grimberg
sagi at grimberg.me
Thu Jul 4 02:11:57 PDT 2024
On 7/3/24 18:50, Hannes Reinecke wrote:
> On 7/3/24 17:09, Sagi Grimberg wrote:
>>
>>
>> On 03/07/2024 18:01, Hannes Reinecke wrote:
>>> On 7/3/24 16:22, Sagi Grimberg wrote:
>>>>
>>>>
>>>> On 03/07/2024 16:50, Hannes Reinecke wrote:
>>>>> We should switch to the 'cpu' affinity scope when using the
>>>>> 'wq_unbound'
>>>>> parameter as this allows us to keep I/O locality and improve
>>>>> performance.
>>>>
>>>> Can you please describe more why this is better? locality between
>>>> what?
>>>>
>>> Well; the default unbound scope is 'cache', which groups the cpu
>>> according to the cache hierarchy. I want the cpu locality of the
>>> workqueue items to be preserved as much as possible, so I switched
>>> to 'cpu' here.
>>>
>>> I'll get some performance numbers.
>>>
>>>> While you mention in your cover letter "comments and reviews are
>>>> welcome"
>>>> The change logs in your patches are not designed to assist your
>>>> reviewer.
>>>
>>> I spent the last few weeks trying to come up with a solution based
>>> on my
>>> original submission, but in the end I gave up as I hadn't been able to
>>> fix the original issue.
>>
>> Well, the last submission was a discombobulated set of mostly
>> unrelated patches...
>> What was it that did not work?
>>
>>> This here is a different approach by massaging the 'wq_unbound'
>>> mechanism, which is not only easier but also has the big advantage that
>>> it actually works :-)
>>> So I did not include a changlog to the previous patchset as this is
>>> a pretty different approach.
>>> Sorry if this is confusing.
>>
>> It's just difficult to try and understand what each patch
>> contributes, and most of the time the patches
>> are under-documented. I want to see the improvements added, but I
>> also want them to be properly reviewed.
>
> Sure. So here are some performance number:
> (One subsystem, two paths, 96 queues)
> default:
> 4k seq read: bw=365MiB/s (383MB/s), 11.4MiB/s-20.5MiB/s
> (11.0MB/s-21.5MB/s), io=16.0GiB (17.2GB), run=24950-44907msec
> 4k rand read: bw=307MiB/s (322MB/s), 9830KiB/s-13.8MiB/s
> (10.1MB/s-14.5MB/s), io=16.0GiB (17.2GB), run=37081-53333msec
> 4k seq write: bw=550MiB/s (577MB/s), 17.2MiB/s-28.7MiB/s
> (18.0MB/s-30.1MB/s), io=16.0GiB (17.2GB), run=17859-29786msec
> 4k rand write: bw=453MiB/s (475MB/s), 14.2MiB/s-21.3MiB/s
> (14.8MB/s-22.3MB/s), io=16.0GiB (17.2GB), run=24066-36161msec
>
> unbound:
> 4k seq read: bw=232MiB/s (243MB/s), 6145KiB/s-9249KiB/s
> (6293kB/s-9471kB/s), io=13.6GiB (14.6GB), run=56685-60074msec
> 4k rand read: bw=249MiB/s (261MB/s), 6335KiB/s-9713KiB/s
> (6487kB/s-9946kB/s), io=14.6GiB (15.7GB), run=53976-60019msec
> 4k seq write: bw=358MiB/s (375MB/s), 11.2MiB/s-13.5MiB/s
> (11.7MB/s-14.2MB/s), io=16.0GiB (17.2GB), run=37918-45779msec
> 4k rand write: bw=335MiB/s (351MB/s), 10.5MiB/s-14.7MiB/s
> (10.0MB/s-15.4MB/s), io=16.0GiB (17.2GB), run=34929-48971msec
>
> unbound + 'cpu' affinity:
> 4k seq read: bw=249MiB/s (261MB/s), 6003KiB/s-13.6MiB/s
> (6147kB/s-14.3MB/s), io=14.6GiB (15.7GB), run=37636-60065msec
> 4k rand read: bw=305MiB/s (320MB/s), 9773KiB/s-13.9MiB/s
> (10.0MB/s-14.6MB/s), io=16.0GiB (17.2GB), run=36791-53644msec
> 4k seq write: bw=499MiB/s (523MB/s), 15.6MiB/s-18.0MiB/s
> (16.3MB/s-19.9MB/s), io=16.0GiB (17.2GB), run=27018-32860msec
> 4k rand write: bw=536MiB/s (562MB/s), 16.7MiB/s-21.1MiB/s
> (17.6MB/s-22.1MB/s), io=16.0GiB (17.2GB), run=24305-30588msec
>
> As you can see, with unbound and 'cpu' affinity we are basically on par
> with the default implementations (all tests are run with per-controller
> workqueues, mind).
I'm puzzled that the seq vs. rand vary this much when you work against a
brd device.
Are these results stable?
> Running the same workload with 4 subsystems and 8 paths will run into
> I/O timeouts for the default implementation, but perfectly succeed with
> unbound and 'cpu' affinity.
> So definitely an improvement there.
I tend to think that the io timeouts are caused by a bug, not by "non
optimized" code. io timeouts
are eternity for this test, which makes me think we have a different
issue here.
More information about the Linux-nvme
mailing list