[PATCH] arm64: entry: Improve the performance of system calls

Tue Sep 14 08:17:24 PDT 2021

Hi,

On Tue, Sep 14, 2021 at 10:55:16AM +0100, Mark Rutland wrote:
> Hi,
> 
> At a high-level, I'm not too keen on special-casing things unless
> necessary.
> 
> I wonder if we could get similar results without special-casing by using
> a static const array of handlers indexed by the EC, since (with GCC
> 11.1.0 from the kernel.org crosstool page) that can result in code like:
> 
> 0000000000001010 <el0t_64_sync_handler>:
>     1010:       d503245f        bti     c
>     1014:       d503233f        paciasp
>     1018:       a9bf7bfd        stp     x29, x30, [sp, #-16]!
>     101c:       910003fd        mov     x29, sp
>     1020:       d5385201        mrs     x1, esr_el1
>     1024:       90000002        adrp    x2, 0 <el0t_64_sync_handlers>
>     1028:       531a7c23        lsr     w3, w1, #26
>     102c:       91000042        add     x2, x2, #:lo12:<el0t_64_sync_handlers>
>     1030:       f8637842        ldr     x2, [x2, x3, lsl #3]
>     1034:       d63f0040        blr     x2
>     1038:       a8c17bfd        ldp     x29, x30, [sp], #16
>     103c:       d50323bf        autiasp
>     1040:       d65f03c0        ret
> 
> ... which might do better by virtue of reducing a chain of potential
> mispredicts down to a single potential mispredict, and dynamic branch
> prediction hopefully does a good job of predicting the common case at
> runtime. That said, the resulting tables will be pretty big...

I tested Mark's branch which implements this (found at
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/entry/switch-table)

I also took lmbench from https://github.com/intel/lmbench.git and built
`lat_syscall` with:

    gcc lat_syscall.c lib_*.c -l m -o lat_syscall -static

These are the results I got from benchmarking on my MacBook Air M1, with
the following command:

    ./lat_syscall null &> /dev/null ; uname -a ; for i in 0 1 2 3 4 ; do ./lat_syscall null ; done

The kernel was based on arm64_defconfig that was then stripped of as much as possible. 
GCC 11.1.0 from kernel.org crosstool page.
Clang build fom git b041b613e6fff713fc9ad6dbc73024286fb2fc93.

gcc:
        master: 0.14300
  switch-table: 0.14350
        likely: 0.13962

clang:
        master: 0.14354
  switch-table: 0.14642
        likely: 0.14256

The generated code looks similar to what Leizhen has posted, so I didn't
post it again.

So it seems the table approach actually performs worse in my testing,
and Leizhen's approach is slightly better than master (d0ee23f9d78be5531c4b055ea424ed0b489dfe9b).

Thanks,
Joey