Withdraw [PATCH] tracing: Enable kprobe tracing for Arm64 asm functions

Wed Dec 10 12:16:17 PST 2025

On Mon, Nov 17, 2025 at 10:34:22AM +0000, Mark Rutland wrote:
> On Thu, Oct 30, 2025 at 11:07:51AM -0700, Ben Niu wrote:
> > On Thu, Oct 30, 2025 at 12:35:25PM +0000, Mark Rutland wrote:
> > > Is there something specific you want to trace, but cannot currently
> > > trace (on arm64)?
> > 
> > For some reason, we only saw Arm64 Linux asm functions __arch_copy_to_user and
> > __arch_copy_from_user being hot in our workloads, not those counterpart asm
> > functions on x86, so we are trying to understand and improve performance of
> > those Arm64 asm functions.
> 
> Are you sure that's not an artifact of those being out-of-line on arm64,
> but inline on x86? On x86, the out-of-line forms are only used when the
> CPU doesn't have FSRM, and when the CPU *does* have FSRM, the logic gets
> inlined. See raw_copy_from_user(), raw_copy_to_user(), and
> copy_user_generic() in arch/x86/include/asm/uaccess_64.h.

On x86, INLINE_COPY_TO_USER is not defined in the latest linux kernel and our
internal branch, so _copy_to_user is always defined as an extern function
(no-inline), which ends up inlining copy_user_generic. copy_user_generic
executes FSRM rep movs if CPU supports it (our case), otherwise, it calls
rep_movs_alternative, which issues plain movs to copy memory.

FSRM is vectorized internally by the CPU and Arm FEAT_MOPS seems similar to
what FSRM does, but Neoverse V2 does not support FEAT_MOPS so the kernel
resorts to sttr to copy memory.

> Have you checked that inlining is not skewing your results, and
> artificially making those look hotter on am64 by virtue of centralizing
> samples to the same IP/PC range?

As mentioned above, _copy_to_user is not inlined on x86.

> Can you share any information on those workloads? e.g. which callchains
> were hot?

Please reach out to James Greenhalgh and Chris Goodyer at Arm for more details
about those workloads, which I can't share in a public channel.

By bpftracing, we found that most of __arch_copy_to_user calls are against size
2k-4k. I wrote a simple microbenchmark to test and compare the performance of
copy_to_user on both x64 and arm64, see
https://github.com/mcfi/benchmark/tree/main/ku_copy
Below were data collected. You could see that Arm64 ku copy speed had a dip
when the size hits 2K-16K.

Routine	        Size	Arm64 bytes/s / x64 bytes /s
copy_to_user	8	203.4%
copy_from_user	8	248.5%
copy_to_user	16	194.6%
copy_from_user	16	240.6%
copy_to_user	32	195.0%
copy_from_user	32	236.5%
copy_to_user	64	193.2%
copy_from_user	64	240.4%
copy_to_user	128	211.0%
copy_from_user	128	273.0%
copy_to_user	256	185.3%
copy_from_user	256	229.3%
copy_to_user	512	152.9%
copy_from_user	512	182.4%
copy_to_user	1024	121.0%
copy_from_user	1024	141.9%
copy_to_user	2048	95.1%
copy_from_user	2048	108.7%
copy_to_user	4096	78.9%
copy_from_user	4096	85.7%
copy_to_user	8192	69.7%
copy_from_user	8192	73.5%
copy_to_user	16384	65.9%
copy_from_user	16384	68.8%
copy_to_user	32768	117.3%
copy_from_user	32768	118.6%
copy_to_user	65536	115.5%
copy_from_user	65536	117.4%
copy_to_user	131072	114.6%
copy_from_user	131072	113.3%
copy_to_user	262144	114.4%
copy_from_user	262144	109.3%

I sent out a patch for faster __arch_copy_from_user/__arch_copy_to_user. Please
see this message ID 20251018052237.1368504-2-benniu at meta.com for more details.

> Mark.

In addition, I'm withdrawing this patch because we found a workaround to trace
__arch_copy_to_user/__arch_copy_from_user through watchpoints. This message ID
aTjpzfqTEGPcird5 at meta.com should have all the details. Thank you for reviewing
this patch and all your helpful comments.

Ben