fwd: Re: Performance loss with word count benchmark

Fri Apr 5 12:40:39 PDT 2024

For some reason this email from Julia is missing from the October 2023
archive of the mailing list even though it was sent around the middle of
the month.  I'm forwarding it back to the list so the results won't be
lost.

Thanks again Julia for running the benchmarks and following up with us.
Liam

----- Forwarded message from Julia Lawall <julia.lawall at inria.fr> -----

...
Subject: Re: Performance loss with word count benchmark
Date: Fri, 13 Oct 2023 23:40:41 +0200 (CEST)

On Mon, 2 Oct 2023, Jirka Hladky wrote:

> Thank you, Julia!
>
> The results are interesting:
>
> $tail -v -n+1 * | grep -P "==>|system"
> ==> wordcount_yeti-4_5.17.0_powersave-active.json <==
>      "system": 21.587097254999996,
> ==> wordcount_yeti-4_5.18.0_powersave-active.json <==
>      "system": 21.593763075000005,
> ==> wordcount_yeti-4_5.19.0_powersave-active.json <==
>      "system": 22.53202131,
> ==> wordcount_yeti-4_6.0.0_powersave-active.json <==
>      "system": 27.673837299999995,
> ==> wordcount_yeti-4_6.1.0_powersave-active.json <==
>      "system": 24.58705548,
> ==> wordcount_yeti-4_6.2.0_powersave-active.json <==
>      "system": 24.951412994999995,
>
> System time went up significantly between 5.19 and 6.0. Starting from
> 6.1, it has improved a little. I was searching to link the results to
> the lower performance of the "mprotect" system call (beginning in
> 6.1), but this data does not align with this. Something else is
> happening here...

I tried some more recent versions, with interesting results.  This is on a
four-socket Intel 6130.

wordcount_yeti-4_5.19.0_performance.json:      "mean": 3.12326750034,
wordcount_yeti-4_6.0.0_performance.json:      "mean": 3.08271866984,
wordcount_yeti-4_6.1.0_performance.json:      "mean": 3.9168611305299996,
wordcount_yeti-4_6.2.0_performance.json:      "mean": 3.9323236962599992,
wordcount_yeti-4_6.3.0_performance.json:      "mean": 3.9299195262349995,
wordcount_yeti-4_6.4.0_performance.json:      "mean": 1.6312004133800002,
wordcount_yeti-4_6.5.0_performance.json:      "mean": 1.6477082927600002,
wordcount_yeti-4_6.6.0rc1_performance.json:      "mean": 0.9028324600200002,
wordcount_yeti-4_6.6.0rc3_performance.json:      "mean": 0.8936725624550004,

Bisecting between 6.3 nd 6.4 gives the following list of commits:

c7f8f31c00d1 mm: separate vma->lock from vm_area_struct =========> 1.6 seconds
0d2ebf9c3f78 mm/mmap: free vm_area_struct without call_rcu in exit_mmap
70d4cbc80c88 (HEAD) powerc/mm: try VMA lock-based page fault handling first
cd7f176aea5f arm64/mm: try VMA lock-based page fault handling first
0bff0aaea03e x86/mm: try VMA lock-based page fault handling first
52f238653e45 mm: introduce per-VMA lock statistics ==============> 3.9 seconds

The versions in the middle don't compile.  The commit c7f8f31c00d1 is
performance related:

commit c7f8f31c00d187a2c71a241c7f2bd6aa102a4e6f
Author: Suren Baghdasaryan <surenb at google.com>
Date:   Mon Feb 27 09:36:32 2023 -0800

    mm: separate vma->lock from vm_area_struct

I don't know if this commit undoes something that was aggravated by the
introduction of maple trees.  Or if the performance improvement is just so
great that it hides the previous performance degradation.

There is another big performance improvement in 6.6-rc1.  After
c7f8f31c00d187a2c71a241c7f2bd6aa102a4e6f, the graphe still has the towers
of mprotects, as shown in the attached graph.  In 6.6.0-rc3, the towers
are gone, and the mprotects come at different times on different cores.
This surely reduces contention and improves performance.  But I haven't
bisected that case to see what causes the changed behavior.

julia

>
> On Mon, Oct 2, 2023 at 11:31 AM Julia Lawall <julia.lawall at inria.fr> wrote:
> >
> >
> >
> > On Mon, 2 Oct 2023, Jirka Hladky wrote:
> >
> > > Thanks for sharing, Julia!
> > >
> > > > Something else seems to happen for 6.1, though.
> > >
> > > Could you please share the user and system times as well? I'm using
> > > the tool "hyperfine" to measure it, including statistics:
> > >
> > > cargo install hyperfine || cargo install --version 8.4.0 hyperfine
> > >
> > > See
> > > https://crates.io/crates/hyperfine
> > >
> > > hyperfine --warmup 1 'phoenix-2.0/tests/word_count/word_count
> > > word_count_datafiles/word_100MB.txt'
> >
> > The hyperfine files for 5.17-6.2 are attached.
> >
> > julia
> >
> > >
> > > Merci!
> > > Jirka
> > >
> > > On Sat, Sep 30, 2023 at 8:03 AM Julia Lawall <julia.lawall at inria.fr> wrote:
> > > >
> > > >
> > > >
> > > > On Sat, 30 Sep 2023, Jirka Hladky wrote:
> > > >
> > > > > Hi Matt and Julia,
> > > > > thanks a lot for adding me to this thread. The performance issues with the word_count test might be connected to a performance drop in the mprotect system call. Libmicro [1] shows performance drops for various mprotect tests up to a factor of 1.5x.
> > > > >
> > > > > ==> 6.0.0-54.eln121.x86_64_mmap_mprotect.log <==
> > > > > TSC for 32768 mprotect calls with len of 128kiB: 202866 K-cycles.  Avg: 6.191 K-cycles/call
> > > > >
> > > > > ==> 6.5.0-57.eln130.x86_64_mmap_mprotect.log <==
> > > > > TSC for 32768 mprotect calls with len of 128kiB: 302327 K-cycles.  Avg: 9.22631 K-cycles/call
> > > > >
> > > > > ==> 6.6.0-0.rc2.20.eln130.x86_64_mmap_mprotect.log <==
> > > > > TSC for 32768 mprotect calls with len of 128kiB: 269886 K-cycles.  Avg: 8.2363 K-cycles/call
> > > > >
> > > > > Matt, Liam: could you please comment if this is expected?
> > > > >
> > > > > I'm attaching a simple .c reproducer. It times the mprotect syscall for a large mmaped region where we alternate NO_PROT and PROT_READ | PROT_WRITE 128kiB large areas. You can run it like this:gcc -Wall -Wextra -O1 mmap_mprotect.c -o mmap_mprotect
> > > > > ./mmap_mprotect 128 32768
> > > > >
> > > > > It will allocate 128kiB * 32768 = 4 GiB of memory
> > > > >
> > > > > Here is the main part where we measure the performance:
> > > > >
> > > > >      // Time the mprotect calls -alternate protection between PROT_NONE and PROT_READ | PROT_WRITE
> > > > >      //
> > > > >      u_int64_t start_rdtsc = start_clock();
> > > > >
> > > > >      for (i = 0; i < iterations; i++) {
> > > > >        if (i % 2 == 0 ) {
> > > > >          prot = PROT_NONE;
> > > > >        } else {
> > > > >          prot = PROT_READ | PROT_WRITE;
> > > > >        }
> > > > >        ret =mprotect((void *)ts_map + i*mmap_len, mmap_len, prot);
> > > > >        if (ret != 0) {
> > > > >          perror("mprotect");
> > > > >          printf(" mmap error at iteration %d from %ld\n", i, iterations);
> > > > >        }
> > > > >      }
> > > > >
> > > > >      u_int64_t stop_rdtsc = stop_clock();
> > > > >
> > > > >
> > > > >
> > > > > BTW, I have noticed that the word_count test shows the most significant performance degradation with Maple Tree compared to RB when run with many threads. Below are the results from Intel Icelake server with 36-core Platinum 8351N CPU :
> > > >
> > > > I also observed this.  I tried the same experiment on a 2-socket Intel
> > > > 6130 (same as the previous machine, but with 2 sockets instead of 4), and
> > > > I don't have any performance degradation for 6.0:
> > > >
> > > > wordcount_dahu-28_5.19.0_performance.json:      "mean": 1.1824737362849995,
> > > > wordcount_dahu-28_6.0.0_performance.json:      "mean": 1.142431961085,
> > > > wordcount_dahu-28_6.1.0_performance.json:      "mean": 1.346207455575,
> > > > wordcount_dahu-28_6.2.0_performance.json:      "mean": 1.3775329248449997,
> > > >
> > > > The calls to mprotect seem to be less aligned than for the larger machine.
> > > >
> > > > Something else seems to happen for 6.1, though.
> > > >
> > > > julia
> > > >
> > > >
> > > >
> > > >
> > > > > 72 threads:
> > > > > ==> 6.0.0-54.eln121.x86_64_allthreads.statistics.txt <==
> > > > > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > > > >  Time (mean ± σ):     797.4 ms ±  47.0 ms    [User: 4868.8 ms, System: 10793.5 ms]
> > > > >  Range (min … max):   707.0 ms … 849.8 ms    10 runs
> > > > >
> > > > >
> > > > > ==> 6.5.0-57.eln130.x86_64_allthreads.statistics.txt <==
> > > > > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > > > >  Time (mean ± σ):      1.673 s ±  0.265 s    [User: 9.574 s, System: 54.777 s]
> > > > >  Range (min … max):    1.282 s …  2.240 s    10 runs
> > > > >
> > > > >
> > > > > ==> 6.6.0-0.rc2.20.eln130.x86_64_allthreads.statistics.txt <==
> > > > > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > > > >  Time (mean ± σ):      1.223 s ±  0.038 s    [User: 13.107 s, System: 20.346 s]
> > > > >  Range (min … max):    1.153 s …  1.273 s    10 runs
> > > > >
> > > > > System time is 5x longer with 6.5 kernel compared to 6.0. In 6.6, there is a significant improvement.
> > > > > When I repeat the test on the same server but just with four threads, I'm getting much better results:
> > > > >
> > > > > ==> 6.0.0-54.eln121.x86_64_4threads.statistics.txt <==
> > > > > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > > > >  Time (mean ± σ):     689.4 ms ±   3.3 ms    [User: 2350.9 ms, System: 191.9 ms]
> > > > >  Range (min … max):   684.5 ms … 693.8 ms    10 runs
> > > > >
> > > > >
> > > > > ==> 6.5.0-57.eln130.x86_64_4threads.statistics.txt <==
> > > > > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > > > >  Time (mean ± σ):     699.3 ms ±   2.4 ms    [User: 2371.8 ms, System: 205.8 ms]
> > > > >  Range (min … max):   695.4 ms … 702.6 ms    10 runs
> > > > >
> > > > >
> > > > > ==> 6.6.0-0.rc2.20.eln130.x86_64_4threads.statistics.txt <==
> > > > > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > > > >  Time (mean ± σ):     696.5 ms ±   3.0 ms    [User: 2370.1 ms, System: 194.3 ms]
> > > > >  Range (min … max):   691.2 ms … 700.2 ms    10 runs
> > > > >
> > > > > In this scenario, there is barely any difference between kernel 6.0, 6.5, and 6.6.
> > > > >
> > > > > It seems like the test is not ready to use so many threads. Actual runtime with four threads is SHORTER than with 72 threads.
> > > > >
> > > > > Thanks
> > > > > Jirka
> > > > >
> > > > > [1] https://github.com/redhat-performance/libMicro/tree/master
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Sep 29, 2023 at 10:09 PM Julia Lawall <julia.lawall at inria.fr> wrote:
> > > > >       Attached are graphs from a small extract of the run of wordcount on the
> > > > >       4-socket Intel 6130 before and after the introduction of maple trees.
> > > > >
> > > > >       The kernel called beforemaple is 6.0.0-rc3-00220-g000a449345bb
> > > > >       The kernel called aftermaple is 6.0.0-rc3-00286-gc154124fe925
> > > > >
> > > > >       trace-cmd is given the argument -e syscalls, so all system calls are
> > > > >       traced.  The only ones that happen in this region are mprotects.
> > > > >
> > > > >       My interpretation is that there are the towers of enter mprotects that
> > > > >       subsequently block on a rwsem, and then wake up later to do an exit.
> > > > >       Other mprotects manage to sneak in along the way, because rwsems are
> > > > >       not fair.  This is probably orthogonal to maple trees, though.
> > > > >
> > > > >       julia
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -Jirka
> > > > >
> > > > >
> > >
> > >
> > >
> > > --
> > > -Jirka
> > >
> > >
>
>
>
> --
> -Jirka
>
>

----- End forwarded message -----