Withdraw [PATCH] tracing: Enable kprobe tracing for Arm64 asm functions

Ben Niu BenNiu at meta.com
Wed Dec 10 12:16:17 PST 2025


On Mon, Nov 17, 2025 at 10:34:22AM +0000, Mark Rutland wrote:
> On Thu, Oct 30, 2025 at 11:07:51AM -0700, Ben Niu wrote:
> > On Thu, Oct 30, 2025 at 12:35:25PM +0000, Mark Rutland wrote:
> > > Is there something specific you want to trace, but cannot currently
> > > trace (on arm64)?
> > 
> > For some reason, we only saw Arm64 Linux asm functions __arch_copy_to_user and
> > __arch_copy_from_user being hot in our workloads, not those counterpart asm
> > functions on x86, so we are trying to understand and improve performance of
> > those Arm64 asm functions.
> 
> Are you sure that's not an artifact of those being out-of-line on arm64,
> but inline on x86? On x86, the out-of-line forms are only used when the
> CPU doesn't have FSRM, and when the CPU *does* have FSRM, the logic gets
> inlined. See raw_copy_from_user(), raw_copy_to_user(), and
> copy_user_generic() in arch/x86/include/asm/uaccess_64.h.

On x86, INLINE_COPY_TO_USER is not defined in the latest linux kernel and our
internal branch, so _copy_to_user is always defined as an extern function
(no-inline), which ends up inlining copy_user_generic. copy_user_generic
executes FSRM rep movs if CPU supports it (our case), otherwise, it calls
rep_movs_alternative, which issues plain movs to copy memory.

FSRM is vectorized internally by the CPU and Arm FEAT_MOPS seems similar to
what FSRM does, but Neoverse V2 does not support FEAT_MOPS so the kernel
resorts to sttr to copy memory.

> Have you checked that inlining is not skewing your results, and
> artificially making those look hotter on am64 by virtue of centralizing
> samples to the same IP/PC range?

As mentioned above, _copy_to_user is not inlined on x86.

> Can you share any information on those workloads? e.g. which callchains
> were hot?

Please reach out to James Greenhalgh and Chris Goodyer at Arm for more details
about those workloads, which I can't share in a public channel.

By bpftracing, we found that most of __arch_copy_to_user calls are against size
2k-4k. I wrote a simple microbenchmark to test and compare the performance of
copy_to_user on both x64 and arm64, see
https://github.com/mcfi/benchmark/tree/main/ku_copy
Below were data collected. You could see that Arm64 ku copy speed had a dip
when the size hits 2K-16K.

Routine	        Size	Arm64 bytes/s / x64 bytes /s
copy_to_user	8	203.4%
copy_from_user	8	248.5%
copy_to_user	16	194.6%
copy_from_user	16	240.6%
copy_to_user	32	195.0%
copy_from_user	32	236.5%
copy_to_user	64	193.2%
copy_from_user	64	240.4%
copy_to_user	128	211.0%
copy_from_user	128	273.0%
copy_to_user	256	185.3%
copy_from_user	256	229.3%
copy_to_user	512	152.9%
copy_from_user	512	182.4%
copy_to_user	1024	121.0%
copy_from_user	1024	141.9%
copy_to_user	2048	95.1%
copy_from_user	2048	108.7%
copy_to_user	4096	78.9%
copy_from_user	4096	85.7%
copy_to_user	8192	69.7%
copy_from_user	8192	73.5%
copy_to_user	16384	65.9%
copy_from_user	16384	68.8%
copy_to_user	32768	117.3%
copy_from_user	32768	118.6%
copy_to_user	65536	115.5%
copy_from_user	65536	117.4%
copy_to_user	131072	114.6%
copy_from_user	131072	113.3%
copy_to_user	262144	114.4%
copy_from_user	262144	109.3%

I sent out a patch for faster __arch_copy_from_user/__arch_copy_to_user. Please
see this message ID 20251018052237.1368504-2-benniu at meta.com for more details.

> Mark.

In addition, I'm withdrawing this patch because we found a workaround to trace
__arch_copy_to_user/__arch_copy_from_user through watchpoints. This message ID
aTjpzfqTEGPcird5 at meta.com should have all the details. Thank you for reviewing
this patch and all your helpful comments.

Ben



More information about the linux-arm-kernel mailing list