Performance loss with word count benchmark

Mon Oct 2 03:20:17 PDT 2023

Thank you, Julia!

The results are interesting:

$tail -v -n+1 * | grep -P "==>|system"
==> wordcount_yeti-4_5.17.0_powersave-active.json <==
     "system": 21.587097254999996,
==> wordcount_yeti-4_5.18.0_powersave-active.json <==
     "system": 21.593763075000005,
==> wordcount_yeti-4_5.19.0_powersave-active.json <==
     "system": 22.53202131,
==> wordcount_yeti-4_6.0.0_powersave-active.json <==
     "system": 27.673837299999995,
==> wordcount_yeti-4_6.1.0_powersave-active.json <==
     "system": 24.58705548,
==> wordcount_yeti-4_6.2.0_powersave-active.json <==
     "system": 24.951412994999995,

System time went up significantly between 5.19 and 6.0. Starting from
6.1, it has improved a little. I was searching to link the results to
the lower performance of the "mprotect" system call (beginning in
6.1), but this data does not align with this. Something else is
happening here...

On Mon, Oct 2, 2023 at 11:31 AM Julia Lawall <julia.lawall at inria.fr> wrote:
>
>
>
> On Mon, 2 Oct 2023, Jirka Hladky wrote:
>
> > Thanks for sharing, Julia!
> >
> > > Something else seems to happen for 6.1, though.
> >
> > Could you please share the user and system times as well? I'm using
> > the tool "hyperfine" to measure it, including statistics:
> >
> > cargo install hyperfine || cargo install --version 8.4.0 hyperfine
> >
> > See
> > https://crates.io/crates/hyperfine
> >
> > hyperfine --warmup 1 'phoenix-2.0/tests/word_count/word_count
> > word_count_datafiles/word_100MB.txt'
>
> The hyperfine files for 5.17-6.2 are attached.
>
> julia
>
> >
> > Merci!
> > Jirka
> >
> > On Sat, Sep 30, 2023 at 8:03 AM Julia Lawall <julia.lawall at inria.fr> wrote:
> > >
> > >
> > >
> > > On Sat, 30 Sep 2023, Jirka Hladky wrote:
> > >
> > > > Hi Matt and Julia,
> > > > thanks a lot for adding me to this thread. The performance issues with the word_count test might be connected to a performance drop in the mprotect system call. Libmicro [1] shows performance drops for various mprotect tests up to a factor of 1.5x.
> > > >
> > > > ==> 6.0.0-54.eln121.x86_64_mmap_mprotect.log <==
> > > > TSC for 32768 mprotect calls with len of 128kiB: 202866 K-cycles.  Avg: 6.191 K-cycles/call
> > > >
> > > > ==> 6.5.0-57.eln130.x86_64_mmap_mprotect.log <==
> > > > TSC for 32768 mprotect calls with len of 128kiB: 302327 K-cycles.  Avg: 9.22631 K-cycles/call
> > > >
> > > > ==> 6.6.0-0.rc2.20.eln130.x86_64_mmap_mprotect.log <==
> > > > TSC for 32768 mprotect calls with len of 128kiB: 269886 K-cycles.  Avg: 8.2363 K-cycles/call
> > > >
> > > > Matt, Liam: could you please comment if this is expected?
> > > >
> > > > I'm attaching a simple .c reproducer. It times the mprotect syscall for a large mmaped region where we alternate NO_PROT and PROT_READ | PROT_WRITE 128kiB large areas. You can run it like this:gcc -Wall -Wextra -O1 mmap_mprotect.c -o mmap_mprotect
> > > > ./mmap_mprotect 128 32768
> > > >
> > > > It will allocate 128kiB * 32768 = 4 GiB of memory
> > > >
> > > > Here is the main part where we measure the performance:
> > > >
> > > >      // Time the mprotect calls -alternate protection between PROT_NONE and PROT_READ | PROT_WRITE
> > > >      //
> > > >      u_int64_t start_rdtsc = start_clock();
> > > >
> > > >      for (i = 0; i < iterations; i++) {
> > > >        if (i % 2 == 0 ) {
> > > >          prot = PROT_NONE;
> > > >        } else {
> > > >          prot = PROT_READ | PROT_WRITE;
> > > >        }
> > > >        ret =mprotect((void *)ts_map + i*mmap_len, mmap_len, prot);
> > > >        if (ret != 0) {
> > > >          perror("mprotect");
> > > >          printf(" mmap error at iteration %d from %ld\n", i, iterations);
> > > >        }
> > > >      }
> > > >
> > > >      u_int64_t stop_rdtsc = stop_clock();
> > > >
> > > >
> > > >
> > > > BTW, I have noticed that the word_count test shows the most significant performance degradation with Maple Tree compared to RB when run with many threads. Below are the results from Intel Icelake server with 36-core Platinum 8351N CPU :
> > >
> > > I also observed this.  I tried the same experiment on a 2-socket Intel
> > > 6130 (same as the previous machine, but with 2 sockets instead of 4), and
> > > I don't have any performance degradation for 6.0:
> > >
> > > wordcount_dahu-28_5.19.0_performance.json:      "mean": 1.1824737362849995,
> > > wordcount_dahu-28_6.0.0_performance.json:      "mean": 1.142431961085,
> > > wordcount_dahu-28_6.1.0_performance.json:      "mean": 1.346207455575,
> > > wordcount_dahu-28_6.2.0_performance.json:      "mean": 1.3775329248449997,
> > >
> > > The calls to mprotect seem to be less aligned than for the larger machine.
> > >
> > > Something else seems to happen for 6.1, though.
> > >
> > > julia
> > >
> > >
> > >
> > >
> > > > 72 threads:
> > > > ==> 6.0.0-54.eln121.x86_64_allthreads.statistics.txt <==
> > > > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > > >  Time (mean ± σ):     797.4 ms ±  47.0 ms    [User: 4868.8 ms, System: 10793.5 ms]
> > > >  Range (min … max):   707.0 ms … 849.8 ms    10 runs
> > > >
> > > >
> > > > ==> 6.5.0-57.eln130.x86_64_allthreads.statistics.txt <==
> > > > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > > >  Time (mean ± σ):      1.673 s ±  0.265 s    [User: 9.574 s, System: 54.777 s]
> > > >  Range (min … max):    1.282 s …  2.240 s    10 runs
> > > >
> > > >
> > > > ==> 6.6.0-0.rc2.20.eln130.x86_64_allthreads.statistics.txt <==
> > > > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > > >  Time (mean ± σ):      1.223 s ±  0.038 s    [User: 13.107 s, System: 20.346 s]
> > > >  Range (min … max):    1.153 s …  1.273 s    10 runs
> > > >
> > > > System time is 5x longer with 6.5 kernel compared to 6.0. In 6.6, there is a significant improvement.
> > > > When I repeat the test on the same server but just with four threads, I'm getting much better results:
> > > >
> > > > ==> 6.0.0-54.eln121.x86_64_4threads.statistics.txt <==
> > > > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > > >  Time (mean ± σ):     689.4 ms ±   3.3 ms    [User: 2350.9 ms, System: 191.9 ms]
> > > >  Range (min … max):   684.5 ms … 693.8 ms    10 runs
> > > >
> > > >
> > > > ==> 6.5.0-57.eln130.x86_64_4threads.statistics.txt <==
> > > > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > > >  Time (mean ± σ):     699.3 ms ±   2.4 ms    [User: 2371.8 ms, System: 205.8 ms]
> > > >  Range (min … max):   695.4 ms … 702.6 ms    10 runs
> > > >
> > > >
> > > > ==> 6.6.0-0.rc2.20.eln130.x86_64_4threads.statistics.txt <==
> > > > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > > >  Time (mean ± σ):     696.5 ms ±   3.0 ms    [User: 2370.1 ms, System: 194.3 ms]
> > > >  Range (min … max):   691.2 ms … 700.2 ms    10 runs
> > > >
> > > > In this scenario, there is barely any difference between kernel 6.0, 6.5, and 6.6.
> > > >
> > > > It seems like the test is not ready to use so many threads. Actual runtime with four threads is SHORTER than with 72 threads.
> > > >
> > > > Thanks
> > > > Jirka
> > > >
> > > > [1] https://github.com/redhat-performance/libMicro/tree/master
> > > >
> > > >
> > > >
> > > > On Fri, Sep 29, 2023 at 10:09 PM Julia Lawall <julia.lawall at inria.fr> wrote:
> > > >       Attached are graphs from a small extract of the run of wordcount on the
> > > >       4-socket Intel 6130 before and after the introduction of maple trees.
> > > >
> > > >       The kernel called beforemaple is 6.0.0-rc3-00220-g000a449345bb
> > > >       The kernel called aftermaple is 6.0.0-rc3-00286-gc154124fe925
> > > >
> > > >       trace-cmd is given the argument -e syscalls, so all system calls are
> > > >       traced.  The only ones that happen in this region are mprotects.
> > > >
> > > >       My interpretation is that there are the towers of enter mprotects that
> > > >       subsequently block on a rwsem, and then wake up later to do an exit.
> > > >       Other mprotects manage to sneak in along the way, because rwsems are
> > > >       not fair.  This is probably orthogonal to maple trees, though.
> > > >
> > > >       julia
> > > >
> > > >
> > > >
> > > > --
> > > > -Jirka
> > > >
> > > >
> >
> >
> >
> > --
> > -Jirka
> >
> >

-- 
-Jirka