[PATCH] arm64: entry: Improve the performance of system calls

Joey Gouly joey.gouly at arm.com
Tue Sep 14 08:17:24 PDT 2021


Hi,

On Tue, Sep 14, 2021 at 10:55:16AM +0100, Mark Rutland wrote:
> Hi,
> 
> At a high-level, I'm not too keen on special-casing things unless
> necessary.
> 
> I wonder if we could get similar results without special-casing by using
> a static const array of handlers indexed by the EC, since (with GCC
> 11.1.0 from the kernel.org crosstool page) that can result in code like:
> 
> 0000000000001010 <el0t_64_sync_handler>:
>     1010:       d503245f        bti     c
>     1014:       d503233f        paciasp
>     1018:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
>     101c:       910003fd        mov     x29, sp
>     1020:       d5385201        mrs     x1, esr_el1
>     1024:       90000002        adrp    x2, 0 <el0t_64_sync_handlers>
>     1028:       531a7c23        lsr     w3, w1, #26
>     102c:       91000042        add     x2, x2, #:lo12:<el0t_64_sync_handlers>
>     1030:       f8637842        ldr     x2, [x2, x3, lsl #3]
>     1034:       d63f0040        blr     x2
>     1038:       a8c17bfd        ldp     x29, x30, [sp], #16
>     103c:       d50323bf        autiasp
>     1040:       d65f03c0        ret
> 
> ... which might do better by virtue of reducing a chain of potential
> mispredicts down to a single potential mispredict, and dynamic branch
> prediction hopefully does a good job of predicting the common case at
> runtime. That said, the resulting tables will be pretty big...

I tested Mark's branch which implements this (found at
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/entry/switch-table)

I also took lmbench from https://github.com/intel/lmbench.git and built
`lat_syscall` with:

    gcc lat_syscall.c lib_*.c -l m -o lat_syscall -static

These are the results I got from benchmarking on my MacBook Air M1, with
the following command:

    ./lat_syscall null &> /dev/null ; uname -a ; for i in 0 1 2 3 4 ; do ./lat_syscall null ; done

The kernel was based on arm64_defconfig that was then stripped of as much as possible. 
GCC 11.1.0 from kernel.org crosstool page.
Clang build fom git b041b613e6fff713fc9ad6dbc73024286fb2fc93.

gcc:
        master: 0.14300
  switch-table: 0.14350
        likely: 0.13962

clang:
        master: 0.14354
  switch-table: 0.14642
        likely: 0.14256


The generated code looks similar to what Leizhen has posted, so I didn't
post it again.

So it seems the table approach actually performs worse in my testing,
and Leizhen's approach is slightly better than master (d0ee23f9d78be5531c4b055ea424ed0b489dfe9b).

Thanks,
Joey



More information about the linux-arm-kernel mailing list