[PATCH] arm64: kprobe: Enable OPTPROBE for arm64

Song Bao Hua (Barry Song) song.bao.hua at hisilicon.com
Fri Jul 30 03:04:06 PDT 2021



> -----Original Message-----
> From: Song Bao Hua (Barry Song)
> Sent: Friday, July 23, 2021 2:43 PM
> To: 'Masami Hiramatsu' <mhiramat at kernel.org>
> Cc: liuqi (BA) <liuqi115 at huawei.com>; catalin.marinas at arm.com;
> will at kernel.org; naveen.n.rao at linux.ibm.com; anil.s.keshavamurthy at intel.com;
> davem at davemloft.net; linux-arm-kernel at lists.infradead.org; Zengtao (B)
> <prime.zeng at hisilicon.com>; robin.murphy at arm.com; Linuxarm
> <linuxarm at huawei.com>; linux-kernel at vger.kernel.org
> Subject: RE: [PATCH] arm64: kprobe: Enable OPTPROBE for arm64
> 
> 
> 
> > -----Original Message-----
> > From: Masami Hiramatsu [mailto:mhiramat at kernel.org]
> > Sent: Friday, July 23, 2021 3:03 AM
> > To: Song Bao Hua (Barry Song) <song.bao.hua at hisilicon.com>
> > Cc: liuqi (BA) <liuqi115 at huawei.com>; catalin.marinas at arm.com;
> > will at kernel.org; naveen.n.rao at linux.ibm.com;
> anil.s.keshavamurthy at intel.com;
> > davem at davemloft.net; linux-arm-kernel at lists.infradead.org; Zengtao (B)
> > <prime.zeng at hisilicon.com>; robin.murphy at arm.com; Linuxarm
> > <linuxarm at huawei.com>; linux-kernel at vger.kernel.org
> > Subject: Re: [PATCH] arm64: kprobe: Enable OPTPROBE for arm64
> >
> > Hi Song,
> >
> > On Thu, 22 Jul 2021 10:24:54 +0000
> > "Song Bao Hua (Barry Song)" <song.bao.hua at hisilicon.com> wrote:
> >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Masami Hiramatsu [mailto:mhiramat at kernel.org]
> > > > Sent: Wednesday, July 21, 2021 8:42 PM
> > > > To: liuqi (BA) <liuqi115 at huawei.com>
> > > > Cc: catalin.marinas at arm.com; will at kernel.org;
> naveen.n.rao at linux.ibm.com;
> > > > anil.s.keshavamurthy at intel.com; davem at davemloft.net;
> > > > linux-arm-kernel at lists.infradead.org; Song Bao Hua (Barry Song)
> > > > <song.bao.hua at hisilicon.com>; Zengtao (B) <prime.zeng at hisilicon.com>;
> > > > robin.murphy at arm.com; Linuxarm <linuxarm at huawei.com>;
> > > > linux-kernel at vger.kernel.org
> > > > Subject: Re: [PATCH] arm64: kprobe: Enable OPTPROBE for arm64
> > > >
> > > > Hi Qi,
> > > >
> > > > Thanks for your effort!
> > > >
> > > > On Mon, 19 Jul 2021 20:24:17 +0800
> > > > Qi Liu <liuqi115 at huawei.com> wrote:
> > > >
> > > > > This patch introduce optprobe for ARM64. In optprobe, probed
> > > > > instruction is replaced by a branch instruction to detour
> > > > > buffer. Detour buffer contains trampoline code and a call to
> > > > > optimized_callback(). optimized_callback() calls opt_pre_handler()
> > > > > to execute kprobe handler.
> > > >
> > > > OK so this will replace only one instruction.
> > > >
> > > > >
> > > > > Limitations:
> > > > > - We only support !CONFIG_RANDOMIZE_MODULE_REGION_FULL case to
> > > > > guarantee the offset between probe point and kprobe pre_handler
> > > > > is not larger than 128MiB.
> > > >
> > > > Hmm, shouldn't we depends on !CONFIG_ARM64_MODULE_PLTS? Or,
> > > > allocate an intermediate trampoline area similar to arm optprobe
> > > > does.
> > >
> > > Depending on !CONFIG_ARM64_MODULE_PLTS will totally disable
> > > RANDOMIZE_BASE according to arch/arm64/Kconfig:
> > > config RANDOMIZE_BASE
> > > 	bool "Randomize the address of the kernel image"
> > > 	select ARM64_MODULE_PLTS if MODULES
> > > 	select RELOCATABLE
> >
> > Yes, but why it is required for "RANDOMIZE_BASE"?
> > Does that imply the module call might need to use PLT in
> > some cases?
> >
> > >
> > > Depending on !RANDOMIZE_MODULE_REGION_FULL seems to be still
> > > allowing RANDOMIZE_BASE via avoiding long jump according to:
> > > arch/arm64/Kconfig:
> > >
> > > config RANDOMIZE_MODULE_REGION_FULL
> > > 	bool "Randomize the module region over a 4 GB range"
> > > 	depends on RANDOMIZE_BASE
> > > 	default y
> > > 	help
> > > 	  Randomizes the location of the module region inside a 4 GB window
> > > 	  covering the core kernel. This way, it is less likely for modules
> > > 	  to leak information about the location of core kernel data structures
> > > 	  but it does imply that function calls between modules and the core
> > > 	  kernel will need to be resolved via veneers in the module PLT.
> > >
> > > 	  When this option is not set, the module region will be randomized over
> > > 	  a limited range that contains the [_stext, _etext] interval of the
> > > 	  core kernel, so branch relocations are always in range.
> >
> > Hmm, this dependency looks strange. If it always in range, don't we need
> > PLT for modules?
> >
> > Cataline, would you know why?
> > Maybe it's a KASLR's Kconfig issue?
> 
> I actually didn't see any problem after making this change:
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index e07e7de9ac49..6440671b72e0 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1781,7 +1781,6 @@ config RELOCATABLE
> 
>  config RANDOMIZE_BASE
>         bool "Randomize the address of the kernel image"
> -       select ARM64_MODULE_PLTS if MODULES
>         select RELOCATABLE
>         help
>           Randomizes the virtual address at which the kernel image is
> @@ -1801,6 +1800,7 @@ config RANDOMIZE_BASE
>  config RANDOMIZE_MODULE_REGION_FULL
>         bool "Randomize the module region over a 4 GB range"
>         depends on RANDOMIZE_BASE
> +       select ARM64_MODULE_PLTS if MODULES
>         default y
>         help
>           Randomizes the location of the module region inside a 4 GB window
> 
> and having this config:
> # zcat /proc/config.gz | grep RANDOMIZE_BASE
> CONFIG_RANDOMIZE_BASE=y
> 
> # zcat /proc/config.gz | grep RANDOMIZE_MODULE_REGION_FULL
> # CONFIG_RANDOMIZE_MODULE_REGION_FULL is not set
> 
> # zcat /proc/config.gz | grep ARM64_MODULE_PLTS
> # CONFIG_ARM64_MODULE_PLTS is not set
> 
> Modules work all good:
> # lsmod
> Module                  Size  Used by
> btrfs                1355776  0
> blake2b_generic        20480  0
> libcrc32c              16384  1 btrfs
> xor                    20480  1 btrfs
> xor_neon               16384  1 xor
> zstd_compress         163840  1 btrfs
> raid6_pq              110592  1 btrfs
> ctr                    16384  0
> md5                    16384  0
> ip_tunnel              32768  0
> ipv6                  442368  28
> 
> 
> I am not quite sure if there is a corner case. If no,
> I would think the kconfig might be some improper.

The corner case is that even CONFIG_RANDOMIZE_MODULE_REGION_FULL
is not enabled, but if CONFIG_ARM64_MODULE_PLTS is enabled, when
we can't get memory from the 128MB area in case the area is exhausted,
we will fall back in module_alloc() to a 2GB area as long as either
of the below two conditions is met:

1. KASAN is not enabled
2. KASAN is enabled and CONFIG_KASAN_VMALLOC is also enabled.

void *module_alloc(unsigned long size)
{
	u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
	gfp_t gfp_mask = GFP_KERNEL;
	void *p;

	/* Silence the initial allocation */
	if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
		gfp_mask |= __GFP_NOWARN;

	if (IS_ENABLED(CONFIG_KASAN_GENERIC) ||
	    IS_ENABLED(CONFIG_KASAN_SW_TAGS))
		/* don't exceed the static module region - see below */
		module_alloc_end = MODULES_END;

	p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
				module_alloc_end, gfp_mask, PAGE_KERNEL, 0,
				NUMA_NO_NODE, __builtin_return_address(0));

	if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
	    (IS_ENABLED(CONFIG_KASAN_VMALLOC) ||
	     (!IS_ENABLED(CONFIG_KASAN_GENERIC) &&
	      !IS_ENABLED(CONFIG_KASAN_SW_TAGS))))
		/*
		 * KASAN without KASAN_VMALLOC can only deal with module
		 * allocations being served from the reserved module region,
		 * since the remainder of the vmalloc region is already
		 * backed by zero shadow pages, and punching holes into it
		 * is non-trivial. Since the module region is not randomized
		 * when KASAN is enabled without KASAN_VMALLOC, it is even
		 * less likely that the module region gets exhausted, so we
		 * can simply omit this fallback in that case.
		 */
		p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
				module_alloc_base + SZ_2G, GFP_KERNEL,
				PAGE_KERNEL, 0, NUMA_NO_NODE,
				__builtin_return_address(0));

	if (p && (kasan_module_alloc(p, size) < 0)) {
		vfree(p);
		return NULL;
	}

	return p;
}

This should be happening quite rarely. But maybe arm64's document
needs some minor fixup, otherwise, it is quite confusing.

> >
> > >
> > > and
> > >
> > > arch/arm64/kernel/kaslr.c:
> > > 	if (IS_ENABLED(CONFIG_RANDOMIZE_MODULE_REGION_FULL)) {
> > > 		/*
> > > 		 * Randomize the module region over a 2 GB window covering the
> > > 		 * kernel. This reduces the risk of modules leaking information
> > > 		 * about the address of the kernel itself, but results in
> > > 		 * branches between modules and the core kernel that are
> > > 		 * resolved via PLTs. (Branches between modules will be
> > > 		 * resolved normally.)
> > > 		 */
> > > 		module_range = SZ_2G - (u64)(_end - _stext);
> > > 		module_alloc_base = max((u64)_end + offset - SZ_2G,
> > > 					(u64)MODULES_VADDR);
> > > 	} else {
> > > 		/*
> > > 		 * Randomize the module region by setting module_alloc_base to
> > > 		 * a PAGE_SIZE multiple in the range [_etext - MODULES_VSIZE,
> > > 		 * _stext) . This guarantees that the resulting region still
> > > 		 * covers [_stext, _etext], and that all relative branches can
> > > 		 * be resolved without veneers.
> > > 		 */
> > > 		module_range = MODULES_VSIZE - (u64)(_etext - _stext);
> > > 		module_alloc_base = (u64)_etext + offset - MODULES_VSIZE;
> > > 	}
> > >
> > > So depending on ! ARM64_MODULE_PLTS seems to narrow the scenarios
> > > while depending on ! RANDOMIZE_MODULE_REGION_FULL  permit more
> > > machines to use optprobe.
> >
> > OK, I see that the code ensures the range will be in the MODULE_VSIZE (=128MB).
> >
> > >
> > > I am not quite sure I am 100% right but tests seem to back this.
> > > hopefully Catalin and Will can correct me.
> > >
> > > >
> > > > >
> > > > > Performance of optprobe on Hip08 platform is test using kprobe
> > > > > example module[1] to analyze the latency of a kernel function,
> > > > > and here is the result:
> > > > >
> > > > > [1]
> > > >
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/sa
> > > > mples/kprobes/kretprobe_example.c
> > > > >
> > > > > kprobe before optimized:
> > > > > [280709.846380] do_empty returned 0 and took 1530 ns to execute
> > > > > [280709.852057] do_empty returned 0 and took 550 ns to execute
> > > > > [280709.857631] do_empty returned 0 and took 440 ns to execute
> > > > > [280709.863215] do_empty returned 0 and took 380 ns to execute
> > > > > [280709.868787] do_empty returned 0 and took 360 ns to execute
> > > > > [280709.874362] do_empty returned 0 and took 340 ns to execute
> > > > > [280709.879936] do_empty returned 0 and took 320 ns to execute
> > > > > [280709.885505] do_empty returned 0 and took 300 ns to execute
> > > > > [280709.891075] do_empty returned 0 and took 280 ns to execute
> > > > > [280709.896646] do_empty returned 0 and took 290 ns to execute
> > > > > [280709.902220] do_empty returned 0 and took 290 ns to execute
> > > > > [280709.907807] do_empty returned 0 and took 290 ns to execute
> > > > >
> > > > > optprobe:
> > > > > [ 2965.964572] do_empty returned 0 and took 90 ns to execute
> > > > > [ 2965.969952] do_empty returned 0 and took 80 ns to execute
> > > > > [ 2965.975332] do_empty returned 0 and took 70 ns to execute
> > > > > [ 2965.980714] do_empty returned 0 and took 60 ns to execute
> > > > > [ 2965.986128] do_empty returned 0 and took 80 ns to execute
> > > > > [ 2965.991507] do_empty returned 0 and took 70 ns to execute
> > > > > [ 2965.996884] do_empty returned 0 and took 70 ns to execute
> > > > > [ 2966.002262] do_empty returned 0 and took 80 ns to execute
> > > > > [ 2966.007642] do_empty returned 0 and took 70 ns to execute
> > > > > [ 2966.013020] do_empty returned 0 and took 70 ns to execute
> > > > > [ 2966.018400] do_empty returned 0 and took 70 ns to execute
> > > > > [ 2966.023779] do_empty returned 0 and took 70 ns to execute
> > > > > [ 2966.029158] do_empty returned 0 and took 70 ns to execute
> > > >
> > > > Great result!
> > > > I have other comments on the code below.
> > > >
> > > > [...]
> > > > > diff --git a/arch/arm64/kernel/probes/kprobes.c
> > > > b/arch/arm64/kernel/probes/kprobes.c
> > > > > index 6dbcc89f6662..83755ad62abe 100644
> > > > > --- a/arch/arm64/kernel/probes/kprobes.c
> > > > > +++ b/arch/arm64/kernel/probes/kprobes.c
> > > > > @@ -11,6 +11,7 @@
> > > > >  #include <linux/kasan.h>
> > > > >  #include <linux/kernel.h>
> > > > >  #include <linux/kprobes.h>
> > > > > +#include <linux/moduleloader.h>
> > > > >  #include <linux/sched/debug.h>
> > > > >  #include <linux/set_memory.h>
> > > > >  #include <linux/slab.h>
> > > > > @@ -113,9 +114,21 @@ int __kprobes arch_prepare_kprobe(struct kprobe
> *p)
> > > > >
> > > > >  void *alloc_insn_page(void)
> > > > >  {
> > > > > -	return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START,
> > VMALLOC_END,
> > > > > -			GFP_KERNEL, PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS,
> > > > > -			NUMA_NO_NODE, __builtin_return_address(0));
> > > > > +	void *page;
> > > > > +
> > > > > +	page = module_alloc(PAGE_SIZE);
> > > > > +	if (!page)
> > > > > +		return NULL;
> > > > > +
> > > > > +	set_vm_flush_reset_perms(page);
> > > > > +	/*
> > > > > +	 * First make the page read-only, and only then make it executable
> > to
> > > > > +	 * prevent it from being W+X in between.
> > > > > +	 */
> > > > > +	set_memory_ro((unsigned long)page, 1);
> > > > > +	set_memory_x((unsigned long)page, 1);
> > > > > +
> > > > > +	return page;
> > > >
> > > > Isn't this a separated change? Or any reason why you have to
> > > > change this function?
> > >
> > > As far as I can tell, this is still related with the 128MB
> > > short jump limitation.
> > > VMALLOC_START, VMALLOC_END is an fixed virtual address area
> > > which isn't necessarily modules will be put.
> > > So this patch is moving to module_alloc() which will get
> > > memory between module_alloc_base and module_alloc_end.
> >
> > Ah, I missed that point. Yes, VMALLOC_START and VMALLOC_END
> > are not correct range.
> >
> > >
> > > Together with depending on !RANDOMIZE_MODULE_REGION_FULL,
> > > this makes all kernel, module and trampoline in short
> > > jmp area.
> > >
> > > As long as we can figure out a way to support long jmp
> > > for optprobe, the change in alloc_insn_page() can be
> > > dropped.
> >
> > No, I think above change is rather readable, so it is OK.
> >
> > >
> > > Masami, any reference code from any platform to support long
> > > jump for optprobe? For long jmp, we need to put jmp address
> > > to a memory and then somehow load the target address
> > > to PC. Right now, we are able to replace an instruction
> > > only. That is the problem.
> >
> > Hmm, I had read a paper about 2-stage jump idea 15years ago. That
> > paper allocated an intermediate trampoline (like PLT) which did a long
> > jump to the real trampoline on SPARC.
> > (something like, "push x0; ldr x0, [pc+8]; br x0; <immediate-addr>" for
> > a slot of the intermediate trampoline.)
> >
> > For the other (simpler) solution example is optprobe in powerpc
> > (arch/powerpc/kernel/optprobes_head.S). That reserves a buffer page
> > in the text section, and use it.
> >
> > But I think your current implementation is good enough for the
> > first step. If someone needs CONFIG_RANDOMIZE_MODULE_REGION_FULL
> > and optprobe, we can revisit this point.
> >
> > Thank you,


Thanks
Barry




More information about the linux-arm-kernel mailing list