[PATCH] arm64: entry: Improve the performance of system calls
Joey Gouly
joey.gouly at arm.com
Tue Sep 14 08:17:24 PDT 2021
Hi,
On Tue, Sep 14, 2021 at 10:55:16AM +0100, Mark Rutland wrote:
> Hi,
>
> At a high-level, I'm not too keen on special-casing things unless
> necessary.
>
> I wonder if we could get similar results without special-casing by using
> a static const array of handlers indexed by the EC, since (with GCC
> 11.1.0 from the kernel.org crosstool page) that can result in code like:
>
> 0000000000001010 <el0t_64_sync_handler>:
> 1010: d503245f bti c
> 1014: d503233f paciasp
> 1018: a9bf7bfd stp x29, x30, [sp, #-16]!
> 101c: 910003fd mov x29, sp
> 1020: d5385201 mrs x1, esr_el1
> 1024: 90000002 adrp x2, 0 <el0t_64_sync_handlers>
> 1028: 531a7c23 lsr w3, w1, #26
> 102c: 91000042 add x2, x2, #:lo12:<el0t_64_sync_handlers>
> 1030: f8637842 ldr x2, [x2, x3, lsl #3]
> 1034: d63f0040 blr x2
> 1038: a8c17bfd ldp x29, x30, [sp], #16
> 103c: d50323bf autiasp
> 1040: d65f03c0 ret
>
> ... which might do better by virtue of reducing a chain of potential
> mispredicts down to a single potential mispredict, and dynamic branch
> prediction hopefully does a good job of predicting the common case at
> runtime. That said, the resulting tables will be pretty big...
I tested Mark's branch which implements this (found at
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/entry/switch-table)
I also took lmbench from https://github.com/intel/lmbench.git and built
`lat_syscall` with:
gcc lat_syscall.c lib_*.c -l m -o lat_syscall -static
These are the results I got from benchmarking on my MacBook Air M1, with
the following command:
./lat_syscall null &> /dev/null ; uname -a ; for i in 0 1 2 3 4 ; do ./lat_syscall null ; done
The kernel was based on arm64_defconfig that was then stripped of as much as possible.
GCC 11.1.0 from kernel.org crosstool page.
Clang build fom git b041b613e6fff713fc9ad6dbc73024286fb2fc93.
gcc:
master: 0.14300
switch-table: 0.14350
likely: 0.13962
clang:
master: 0.14354
switch-table: 0.14642
likely: 0.14256
The generated code looks similar to what Leizhen has posted, so I didn't
post it again.
So it seems the table approach actually performs worse in my testing,
and Leizhen's approach is slightly better than master (d0ee23f9d78be5531c4b055ea424ed0b489dfe9b).
Thanks,
Joey
More information about the linux-arm-kernel
mailing list