Withdraw [PATCH] tracing: Enable kprobe tracing for Arm64 asm functions
Ben Niu
BenNiu at meta.com
Wed Dec 10 12:16:17 PST 2025
On Mon, Nov 17, 2025 at 10:34:22AM +0000, Mark Rutland wrote:
> On Thu, Oct 30, 2025 at 11:07:51AM -0700, Ben Niu wrote:
> > On Thu, Oct 30, 2025 at 12:35:25PM +0000, Mark Rutland wrote:
> > > Is there something specific you want to trace, but cannot currently
> > > trace (on arm64)?
> >
> > For some reason, we only saw Arm64 Linux asm functions __arch_copy_to_user and
> > __arch_copy_from_user being hot in our workloads, not those counterpart asm
> > functions on x86, so we are trying to understand and improve performance of
> > those Arm64 asm functions.
>
> Are you sure that's not an artifact of those being out-of-line on arm64,
> but inline on x86? On x86, the out-of-line forms are only used when the
> CPU doesn't have FSRM, and when the CPU *does* have FSRM, the logic gets
> inlined. See raw_copy_from_user(), raw_copy_to_user(), and
> copy_user_generic() in arch/x86/include/asm/uaccess_64.h.
On x86, INLINE_COPY_TO_USER is not defined in the latest linux kernel and our
internal branch, so _copy_to_user is always defined as an extern function
(no-inline), which ends up inlining copy_user_generic. copy_user_generic
executes FSRM rep movs if CPU supports it (our case), otherwise, it calls
rep_movs_alternative, which issues plain movs to copy memory.
FSRM is vectorized internally by the CPU and Arm FEAT_MOPS seems similar to
what FSRM does, but Neoverse V2 does not support FEAT_MOPS so the kernel
resorts to sttr to copy memory.
> Have you checked that inlining is not skewing your results, and
> artificially making those look hotter on am64 by virtue of centralizing
> samples to the same IP/PC range?
As mentioned above, _copy_to_user is not inlined on x86.
> Can you share any information on those workloads? e.g. which callchains
> were hot?
Please reach out to James Greenhalgh and Chris Goodyer at Arm for more details
about those workloads, which I can't share in a public channel.
By bpftracing, we found that most of __arch_copy_to_user calls are against size
2k-4k. I wrote a simple microbenchmark to test and compare the performance of
copy_to_user on both x64 and arm64, see
https://github.com/mcfi/benchmark/tree/main/ku_copy
Below were data collected. You could see that Arm64 ku copy speed had a dip
when the size hits 2K-16K.
Routine Size Arm64 bytes/s / x64 bytes /s
copy_to_user 8 203.4%
copy_from_user 8 248.5%
copy_to_user 16 194.6%
copy_from_user 16 240.6%
copy_to_user 32 195.0%
copy_from_user 32 236.5%
copy_to_user 64 193.2%
copy_from_user 64 240.4%
copy_to_user 128 211.0%
copy_from_user 128 273.0%
copy_to_user 256 185.3%
copy_from_user 256 229.3%
copy_to_user 512 152.9%
copy_from_user 512 182.4%
copy_to_user 1024 121.0%
copy_from_user 1024 141.9%
copy_to_user 2048 95.1%
copy_from_user 2048 108.7%
copy_to_user 4096 78.9%
copy_from_user 4096 85.7%
copy_to_user 8192 69.7%
copy_from_user 8192 73.5%
copy_to_user 16384 65.9%
copy_from_user 16384 68.8%
copy_to_user 32768 117.3%
copy_from_user 32768 118.6%
copy_to_user 65536 115.5%
copy_from_user 65536 117.4%
copy_to_user 131072 114.6%
copy_from_user 131072 113.3%
copy_to_user 262144 114.4%
copy_from_user 262144 109.3%
I sent out a patch for faster __arch_copy_from_user/__arch_copy_to_user. Please
see this message ID 20251018052237.1368504-2-benniu at meta.com for more details.
> Mark.
In addition, I'm withdrawing this patch because we found a workaround to trace
__arch_copy_to_user/__arch_copy_from_user through watchpoints. This message ID
aTjpzfqTEGPcird5 at meta.com should have all the details. Thank you for reviewing
this patch and all your helpful comments.
Ben
More information about the linux-arm-kernel
mailing list