Performance loss with word count benchmark

Mon Oct 2 02:20:56 PDT 2023

Thanks for sharing, Julia!

> Something else seems to happen for 6.1, though.

Could you please share the user and system times as well? I'm using
the tool "hyperfine" to measure it, including statistics:

cargo install hyperfine || cargo install --version 8.4.0 hyperfine

See
https://crates.io/crates/hyperfine

hyperfine --warmup 1 'phoenix-2.0/tests/word_count/word_count
word_count_datafiles/word_100MB.txt'

Merci!
Jirka

On Sat, Sep 30, 2023 at 8:03 AM Julia Lawall <julia.lawall at inria.fr> wrote:
>
>
>
> On Sat, 30 Sep 2023, Jirka Hladky wrote:
>
> > Hi Matt and Julia,
> > thanks a lot for adding me to this thread. The performance issues with the word_count test might be connected to a performance drop in the mprotect system call. Libmicro [1] shows performance drops for various mprotect tests up to a factor of 1.5x.
> >
> > ==> 6.0.0-54.eln121.x86_64_mmap_mprotect.log <==
> > TSC for 32768 mprotect calls with len of 128kiB: 202866 K-cycles.  Avg: 6.191 K-cycles/call
> >
> > ==> 6.5.0-57.eln130.x86_64_mmap_mprotect.log <==
> > TSC for 32768 mprotect calls with len of 128kiB: 302327 K-cycles.  Avg: 9.22631 K-cycles/call
> >
> > ==> 6.6.0-0.rc2.20.eln130.x86_64_mmap_mprotect.log <==
> > TSC for 32768 mprotect calls with len of 128kiB: 269886 K-cycles.  Avg: 8.2363 K-cycles/call
> >
> > Matt, Liam: could you please comment if this is expected?
> >
> > I'm attaching a simple .c reproducer. It times the mprotect syscall for a large mmaped region where we alternate NO_PROT and PROT_READ | PROT_WRITE 128kiB large areas. You can run it like this:gcc -Wall -Wextra -O1 mmap_mprotect.c -o mmap_mprotect
> > ./mmap_mprotect 128 32768
> >
> > It will allocate 128kiB * 32768 = 4 GiB of memory
> >
> > Here is the main part where we measure the performance:
> >
> >      // Time the mprotect calls -alternate protection between PROT_NONE and PROT_READ | PROT_WRITE
> >      //
> >      u_int64_t start_rdtsc = start_clock();
> >
> >      for (i = 0; i < iterations; i++) {
> >        if (i % 2 == 0 ) {
> >          prot = PROT_NONE;
> >        } else {
> >          prot = PROT_READ | PROT_WRITE;
> >        }
> >        ret =mprotect((void *)ts_map + i*mmap_len, mmap_len, prot);
> >        if (ret != 0) {
> >          perror("mprotect");
> >          printf(" mmap error at iteration %d from %ld\n", i, iterations);
> >        }
> >      }
> >
> >      u_int64_t stop_rdtsc = stop_clock();
> >
> >
> >
> > BTW, I have noticed that the word_count test shows the most significant performance degradation with Maple Tree compared to RB when run with many threads. Below are the results from Intel Icelake server with 36-core Platinum 8351N CPU :
>
> I also observed this.  I tried the same experiment on a 2-socket Intel
> 6130 (same as the previous machine, but with 2 sockets instead of 4), and
> I don't have any performance degradation for 6.0:
>
> wordcount_dahu-28_5.19.0_performance.json:      "mean": 1.1824737362849995,
> wordcount_dahu-28_6.0.0_performance.json:      "mean": 1.142431961085,
> wordcount_dahu-28_6.1.0_performance.json:      "mean": 1.346207455575,
> wordcount_dahu-28_6.2.0_performance.json:      "mean": 1.3775329248449997,
>
> The calls to mprotect seem to be less aligned than for the larger machine.
>
> Something else seems to happen for 6.1, though.
>
> julia
>
>
>
>
> > 72 threads:
> > ==> 6.0.0-54.eln121.x86_64_allthreads.statistics.txt <==
> > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> >  Time (mean ± σ):     797.4 ms ±  47.0 ms    [User: 4868.8 ms, System: 10793.5 ms]
> >  Range (min … max):   707.0 ms … 849.8 ms    10 runs
> >
> >
> > ==> 6.5.0-57.eln130.x86_64_allthreads.statistics.txt <==
> > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> >  Time (mean ± σ):      1.673 s ±  0.265 s    [User: 9.574 s, System: 54.777 s]
> >  Range (min … max):    1.282 s …  2.240 s    10 runs
> >
> >
> > ==> 6.6.0-0.rc2.20.eln130.x86_64_allthreads.statistics.txt <==
> > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> >  Time (mean ± σ):      1.223 s ±  0.038 s    [User: 13.107 s, System: 20.346 s]
> >  Range (min … max):    1.153 s …  1.273 s    10 runs
> >
> > System time is 5x longer with 6.5 kernel compared to 6.0. In 6.6, there is a significant improvement.
> > When I repeat the test on the same server but just with four threads, I'm getting much better results:
> >
> > ==> 6.0.0-54.eln121.x86_64_4threads.statistics.txt <==
> > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> >  Time (mean ± σ):     689.4 ms ±   3.3 ms    [User: 2350.9 ms, System: 191.9 ms]
> >  Range (min … max):   684.5 ms … 693.8 ms    10 runs
> >
> >
> > ==> 6.5.0-57.eln130.x86_64_4threads.statistics.txt <==
> > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> >  Time (mean ± σ):     699.3 ms ±   2.4 ms    [User: 2371.8 ms, System: 205.8 ms]
> >  Range (min … max):   695.4 ms … 702.6 ms    10 runs
> >
> >
> > ==> 6.6.0-0.rc2.20.eln130.x86_64_4threads.statistics.txt <==
> > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> >  Time (mean ± σ):     696.5 ms ±   3.0 ms    [User: 2370.1 ms, System: 194.3 ms]
> >  Range (min … max):   691.2 ms … 700.2 ms    10 runs
> >
> > In this scenario, there is barely any difference between kernel 6.0, 6.5, and 6.6.
> >
> > It seems like the test is not ready to use so many threads. Actual runtime with four threads is SHORTER than with 72 threads.
> >
> > Thanks
> > Jirka
> >
> > [1] https://github.com/redhat-performance/libMicro/tree/master
> >
> >
> >
> > On Fri, Sep 29, 2023 at 10:09 PM Julia Lawall <julia.lawall at inria.fr> wrote:
> >       Attached are graphs from a small extract of the run of wordcount on the
> >       4-socket Intel 6130 before and after the introduction of maple trees.
> >
> >       The kernel called beforemaple is 6.0.0-rc3-00220-g000a449345bb
> >       The kernel called aftermaple is 6.0.0-rc3-00286-gc154124fe925
> >
> >       trace-cmd is given the argument -e syscalls, so all system calls are
> >       traced.  The only ones that happen in this region are mprotects.
> >
> >       My interpretation is that there are the towers of enter mprotects that
> >       subsequently block on a rwsem, and then wake up later to do an exit.
> >       Other mprotects manage to sneak in along the way, because rwsems are
> >       not fair.  This is probably orthogonal to maple trees, though.
> >
> >       julia
> >
> >
> >
> > --
> > -Jirka
> >
> >

-- 
-Jirka