Performance loss with word count benchmark
Jirka Hladky
jhladky at redhat.com
Mon Oct 2 02:20:56 PDT 2023
Thanks for sharing, Julia!
> Something else seems to happen for 6.1, though.
Could you please share the user and system times as well? I'm using
the tool "hyperfine" to measure it, including statistics:
cargo install hyperfine || cargo install --version 8.4.0 hyperfine
See
https://crates.io/crates/hyperfine
hyperfine --warmup 1 'phoenix-2.0/tests/word_count/word_count
word_count_datafiles/word_100MB.txt'
Merci!
Jirka
On Sat, Sep 30, 2023 at 8:03 AM Julia Lawall <julia.lawall at inria.fr> wrote:
>
>
>
> On Sat, 30 Sep 2023, Jirka Hladky wrote:
>
> > Hi Matt and Julia,
> > thanks a lot for adding me to this thread. The performance issues with the word_count test might be connected to a performance drop in the mprotect system call. Libmicro [1] shows performance drops for various mprotect tests up to a factor of 1.5x.
> >
> > ==> 6.0.0-54.eln121.x86_64_mmap_mprotect.log <==
> > TSC for 32768 mprotect calls with len of 128kiB: 202866 K-cycles. Avg: 6.191 K-cycles/call
> >
> > ==> 6.5.0-57.eln130.x86_64_mmap_mprotect.log <==
> > TSC for 32768 mprotect calls with len of 128kiB: 302327 K-cycles. Avg: 9.22631 K-cycles/call
> >
> > ==> 6.6.0-0.rc2.20.eln130.x86_64_mmap_mprotect.log <==
> > TSC for 32768 mprotect calls with len of 128kiB: 269886 K-cycles. Avg: 8.2363 K-cycles/call
> >
> > Matt, Liam: could you please comment if this is expected?
> >
> > I'm attaching a simple .c reproducer. It times the mprotect syscall for a large mmaped region where we alternate NO_PROT and PROT_READ | PROT_WRITE 128kiB large areas. You can run it like this:gcc -Wall -Wextra -O1 mmap_mprotect.c -o mmap_mprotect
> > ./mmap_mprotect 128 32768
> >
> > It will allocate 128kiB * 32768 = 4 GiB of memory
> >
> > Here is the main part where we measure the performance:
> >
> > // Time the mprotect calls -alternate protection between PROT_NONE and PROT_READ | PROT_WRITE
> > //
> > u_int64_t start_rdtsc = start_clock();
> >
> > for (i = 0; i < iterations; i++) {
> > if (i % 2 == 0 ) {
> > prot = PROT_NONE;
> > } else {
> > prot = PROT_READ | PROT_WRITE;
> > }
> > ret =mprotect((void *)ts_map + i*mmap_len, mmap_len, prot);
> > if (ret != 0) {
> > perror("mprotect");
> > printf(" mmap error at iteration %d from %ld\n", i, iterations);
> > }
> > }
> >
> > u_int64_t stop_rdtsc = stop_clock();
> >
> >
> >
> > BTW, I have noticed that the word_count test shows the most significant performance degradation with Maple Tree compared to RB when run with many threads. Below are the results from Intel Icelake server with 36-core Platinum 8351N CPU :
>
> I also observed this. I tried the same experiment on a 2-socket Intel
> 6130 (same as the previous machine, but with 2 sockets instead of 4), and
> I don't have any performance degradation for 6.0:
>
> wordcount_dahu-28_5.19.0_performance.json: "mean": 1.1824737362849995,
> wordcount_dahu-28_6.0.0_performance.json: "mean": 1.142431961085,
> wordcount_dahu-28_6.1.0_performance.json: "mean": 1.346207455575,
> wordcount_dahu-28_6.2.0_performance.json: "mean": 1.3775329248449997,
>
> The calls to mprotect seem to be less aligned than for the larger machine.
>
> Something else seems to happen for 6.1, though.
>
> julia
>
>
>
>
> > 72 threads:
> > ==> 6.0.0-54.eln121.x86_64_allthreads.statistics.txt <==
> > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > Time (mean ± σ): 797.4 ms ± 47.0 ms [User: 4868.8 ms, System: 10793.5 ms]
> > Range (min … max): 707.0 ms … 849.8 ms 10 runs
> >
> >
> > ==> 6.5.0-57.eln130.x86_64_allthreads.statistics.txt <==
> > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > Time (mean ± σ): 1.673 s ± 0.265 s [User: 9.574 s, System: 54.777 s]
> > Range (min … max): 1.282 s … 2.240 s 10 runs
> >
> >
> > ==> 6.6.0-0.rc2.20.eln130.x86_64_allthreads.statistics.txt <==
> > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > Time (mean ± σ): 1.223 s ± 0.038 s [User: 13.107 s, System: 20.346 s]
> > Range (min … max): 1.153 s … 1.273 s 10 runs
> >
> > System time is 5x longer with 6.5 kernel compared to 6.0. In 6.6, there is a significant improvement.
> > When I repeat the test on the same server but just with four threads, I'm getting much better results:
> >
> > ==> 6.0.0-54.eln121.x86_64_4threads.statistics.txt <==
> > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > Time (mean ± σ): 689.4 ms ± 3.3 ms [User: 2350.9 ms, System: 191.9 ms]
> > Range (min … max): 684.5 ms … 693.8 ms 10 runs
> >
> >
> > ==> 6.5.0-57.eln130.x86_64_4threads.statistics.txt <==
> > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > Time (mean ± σ): 699.3 ms ± 2.4 ms [User: 2371.8 ms, System: 205.8 ms]
> > Range (min … max): 695.4 ms … 702.6 ms 10 runs
> >
> >
> > ==> 6.6.0-0.rc2.20.eln130.x86_64_4threads.statistics.txt <==
> > Benchmark 1: phoenix-2.0/tests/word_count/word_count word_count_datafiles/word_100MB.txt
> > Time (mean ± σ): 696.5 ms ± 3.0 ms [User: 2370.1 ms, System: 194.3 ms]
> > Range (min … max): 691.2 ms … 700.2 ms 10 runs
> >
> > In this scenario, there is barely any difference between kernel 6.0, 6.5, and 6.6.
> >
> > It seems like the test is not ready to use so many threads. Actual runtime with four threads is SHORTER than with 72 threads.
> >
> > Thanks
> > Jirka
> >
> > [1] https://github.com/redhat-performance/libMicro/tree/master
> >
> >
> >
> > On Fri, Sep 29, 2023 at 10:09 PM Julia Lawall <julia.lawall at inria.fr> wrote:
> > Attached are graphs from a small extract of the run of wordcount on the
> > 4-socket Intel 6130 before and after the introduction of maple trees.
> >
> > The kernel called beforemaple is 6.0.0-rc3-00220-g000a449345bb
> > The kernel called aftermaple is 6.0.0-rc3-00286-gc154124fe925
> >
> > trace-cmd is given the argument -e syscalls, so all system calls are
> > traced. The only ones that happen in this region are mprotects.
> >
> > My interpretation is that there are the towers of enter mprotects that
> > subsequently block on a rwsem, and then wake up later to do an exit.
> > Other mprotects manage to sneak in along the way, because rwsems are
> > not fair. This is probably orthogonal to maple trees, though.
> >
> > julia
> >
> >
> >
> > --
> > -Jirka
> >
> >
--
-Jirka
More information about the maple-tree
mailing list