Performance loss with word count benchmark

Fri Sep 29 21:02:44 PDT 2023

Resending in plain-text mode.

Hi Matt and Julia,

thanks a lot for adding me to this thread. The performance issues with
the word_count test might be connected to a performance drop in the
mprotect system call. Libmicro [1] shows performance drops for various
mprotect tests up to a factor of 1.5x.

==> 6.0.0-54.eln121.x86_64_mmap_mprotect.log <==
TSC for 32768 mprotect calls with len of 128kiB: 202866 K-cycles.
Avg: 6.191 K-cycles/call

==> 6.5.0-57.eln130.x86_64_mmap_mprotect.log <==
TSC for 32768 mprotect calls with len of 128kiB: 302327 K-cycles.
Avg: 9.22631 K-cycles/call

==> 6.6.0-0.rc2.20.eln130.x86_64_mmap_mprotect.log <==
TSC for 32768 mprotect calls with len of 128kiB: 269886 K-cycles.
Avg: 8.2363 K-cycles/call

Matt, Liam: could you please comment if this is expected?

I'm attaching a simple .c reproducer. It times the mprotect syscall
for a large mmaped region where we alternate NO_PROT and PROT_READ |
PROT_WRITE 128kiB large areas. You can run it like this:
gcc -Wall -Wextra -O1 mmap_mprotect.c -o mmap_mprotect
./mmap_mprotect 128 32768

It will allocate 128kiB * 32768 = 4 GiB of memory

Here is the main part where we measure the performance:

     // Time the mprotect calls -alternate protection between
PROT_NONE and PROT_READ | PROT_WRITE
     //
     u_int64_t start_rdtsc = start_clock();

     for (i = 0; i < iterations; i++) {
       if (i % 2 == 0 ) {
         prot = PROT_NONE;
       } else {
         prot = PROT_READ | PROT_WRITE;
       }
       ret =mprotect((void *)ts_map + i*mmap_len, mmap_len, prot);
       if (ret != 0) {
         perror("mprotect");
         printf(" mmap error at iteration %d from %ld\n", i, iterations);
       }
     }

     u_int64_t stop_rdtsc = stop_clock();

BTW, I have noticed that the word_count test shows the most
significant performance degradation with Maple Tree compared to RB
when run with many threads. Below are the results from Intel Icelake
server with 36-core Platinum 8351N CPU :

72 threads:
==> 6.0.0-54.eln121.x86_64_allthreads.statistics.txt <==
Benchmark 1: phoenix-2.0/tests/word_count/word_count
word_count_datafiles/word_100MB.txt
 Time (mean ± σ):     797.4 ms ±  47.0 ms    [User: 4868.8 ms, System:
10793.5 ms]
 Range (min … max):   707.0 ms … 849.8 ms    10 runs

==> 6.5.0-57.eln130.x86_64_allthreads.statistics.txt <==
Benchmark 1: phoenix-2.0/tests/word_count/word_count
word_count_datafiles/word_100MB.txt
 Time (mean ± σ):      1.673 s ±  0.265 s    [User: 9.574 s, System: 54.777 s]
 Range (min … max):    1.282 s …  2.240 s    10 runs

==> 6.6.0-0.rc2.20.eln130.x86_64_allthreads.statistics.txt <==
Benchmark 1: phoenix-2.0/tests/word_count/word_count
word_count_datafiles/word_100MB.txt
 Time (mean ± σ):      1.223 s ±  0.038 s    [User: 13.107 s, System: 20.346 s]
 Range (min … max):    1.153 s …  1.273 s    10 runs

System time is 5x longer with 6.5 kernel compared to 6.0. In 6.6,
there is a significant improvement.

When I repeat the test on the same server but just with four threads,
I'm getting much better results:

==> 6.0.0-54.eln121.x86_64_4threads.statistics.txt <==
Benchmark 1: phoenix-2.0/tests/word_count/word_count
word_count_datafiles/word_100MB.txt
 Time (mean ± σ):     689.4 ms ±   3.3 ms    [User: 2350.9 ms, System: 191.9 ms]
 Range (min … max):   684.5 ms … 693.8 ms    10 runs

==> 6.5.0-57.eln130.x86_64_4threads.statistics.txt <==
Benchmark 1: phoenix-2.0/tests/word_count/word_count
word_count_datafiles/word_100MB.txt
 Time (mean ± σ):     699.3 ms ±   2.4 ms    [User: 2371.8 ms, System: 205.8 ms]
 Range (min … max):   695.4 ms … 702.6 ms    10 runs

==> 6.6.0-0.rc2.20.eln130.x86_64_4threads.statistics.txt <==
Benchmark 1: phoenix-2.0/tests/word_count/word_count
word_count_datafiles/word_100MB.txt
 Time (mean ± σ):     696.5 ms ±   3.0 ms    [User: 2370.1 ms, System: 194.3 ms]
 Range (min … max):   691.2 ms … 700.2 ms    10 runs

In this scenario, there is barely any difference between kernel 6.0,
6.5, and 6.6.

It seems like the test is not ready to use so many threads. Actual
runtime with four threads is SHORTER than with 72 threads.

Thanks
Jirka

[1] https://github.com/redhat-performance/libMicro/tree/master

On Fri, Sep 29, 2023 at 10:09 PM Julia Lawall <julia.lawall at inria.fr> wrote:
>
> Attached are graphs from a small extract of the run of wordcount on the
> 4-socket Intel 6130 before and after the introduction of maple trees.
>
> The kernel called beforemaple is 6.0.0-rc3-00220-g000a449345bb
> The kernel called aftermaple is 6.0.0-rc3-00286-gc154124fe925
>
> trace-cmd is given the argument -e syscalls, so all system calls are
> traced.  The only ones that happen in this region are mprotects.
>
> My interpretation is that there are the towers of enter mprotects that
> subsequently block on a rwsem, and then wake up later to do an exit.
> Other mprotects manage to sneak in along the way, because rwsems are
> not fair.  This is probably orthogonal to maple trees, though.
>
> julia

-- 
-Jirka
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mmap_mprotect.c
Type: text/x-csrc
Size: 3036 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/maple-tree/attachments/20230930/22b326c7/attachment-0001.bin>