[RFC PATCH v1 00/57] Boot-time page size selection for arm64
Ryan Roberts
ryan.roberts at arm.com
Wed Nov 13 04:56:24 PST 2024
On 13/11/2024 12:40, Petr Tesarik wrote:
> On Tue, 12 Nov 2024 11:50:39 +0100
> Petr Tesarik <ptesarik at suse.com> wrote:
>
>> On Tue, 12 Nov 2024 10:19:34 +0000
>> Ryan Roberts <ryan.roberts at arm.com> wrote:
>>
>>> On 12/11/2024 09:45, Petr Tesarik wrote:
>>>> On Mon, 11 Nov 2024 12:25:35 +0000
>>>> Ryan Roberts <ryan.roberts at arm.com> wrote:
>>>>
>>>>> Hi Petr,
>>>>>
>>>>> On 11/11/2024 12:14, Petr Tesarik wrote:
>>>>>> Hi Ryan,
>>>>>>
>>>>>> On Thu, 17 Oct 2024 13:32:43 +0100
>>>>>> Ryan Roberts <ryan.roberts at arm.com> wrote:
>>>>> [...]
>>>>>> Third, a few micro-benchmarks saw a significant regression.
>>>>>>
>>>>>> Most notably, getenv and getenvT2 tests from libMicro were 18% and 20%
>>>>>> slower with variable page size. I don't know why, but I'm looking into
>>>>>> it. The system() library call was also about 18% slower, but that might
>>>>>> be related.
>>>>>
>>>>> OK, ouch. I think there are some things we can try to optimize the
>>>>> implementation further. But I'll wait for your analysis before digging myself.
>>>>
>>>> This turned out to be a false positive. The way this microbenchmark was
>>>> invoked did not get enough samples, so it was mostly dependent on
>>>> whether caches were hot or cold, and the timing on this specific system
>>>> with the specific sequence of bencnmarks in the suite happens to favour
>>>> my baseline kernel.
>>>>
>>>> After increasing the batch count, I'm getting pretty much the same
>>>> performance for 6.11 vanilla and patched kernels:
>>>>
>>>> prc thr usecs/call samples errors cnt/samp
>>>> getenv (baseline) 1 1 0.14975 99 0 100000
>>>> getenv (patched) 1 1 0.14981 92 0 100000
>>>
>>> Oh that's good news! Does this account for all 3 of the above tests (getenv,
>>> getenvT2 and system())?
>>
>> It does for getenvT2 (a variant of the test with 2 threads), but not
>> for system. Thanks for asking, I forgot about that one.
>>
>> I'm getting substantial difference there (+29% on average over 100 runs):
>>
>> prc thr usecs/call samples errors cnt/samp command
>> system (baseline) 1 1 6937.18016 102 0 100 A=$$
>> system (patched) 1 1 8959.48032 102 0 100 A=$$
>>
>> So, yeah, this should in fact be my priority #1.
>
> Further testing reveals the workload is bimodal, that is to say the
> distribution of results has two peaks. The first peak around 3.2 ms
> covers 30% runs, the second peak around 15.7 ms covers 11%. Two per
> cent are faster than the fast peak, 5% are slower than slow peak, the
> rest is distributed almost evenly between them.
FWIW, One source of bimodality I've seen on Ampere systems with 2 NUMA nodes is
placement of the kernel image vs placement of the running thread. If they are
remote from eachother, you'll see a slowdown. I've hacked this source away in
the past by effectively using only a single NUMA node (with the help of
'maxcpus' and 'mem' kernel cmdline options).
>
> 100 samples were not sufficient to see this distribution, and it was
> mere bad luck that only the patched kernel originally reported bad
> results. I can now see bad results even with the unpatched kernel.
>
> In short, I don't think there is a difference in system() performance.
>
> I will still have a look at dup() and VMA performance, but so far it
> all looks good to me. Good job! ;-)
Thanks for digging into all this!
>
> I will also try running a more complete set of benchmarks during next
> week. That's SUSE Hack Week, and I want to make a PoC for the MM
> changes I proposed at LPC24, so I won't need this Ampere system for
> interactive use.
>
> Petr T
More information about the linux-arm-kernel
mailing list