[PATCH] arm64: module: Widen module region to 2 GiB

Mark Rutland mark.rutland at arm.com
Wed Apr 5 04:37:57 PDT 2023


On Tue, Apr 04, 2023 at 06:07:04PM +0200, Ard Biesheuvel wrote:
> On Tue, 4 Apr 2023 at 17:49, Mark Rutland <mark.rutland at arm.com> wrote:
> >
> > On Tue, Apr 04, 2023 at 03:54:37PM +0200, Ard Biesheuvel wrote:
> > > Shanker reports that, after loading a 110+ MiB kernel module, no other
> > > modules can be loaded, in spite of the fact that module PLTs have been
> > > enabled in the build.
> > >
> > > This is due to the fact, even with module PLTs enabled, the module
> > > region is dimensioned to be a fixed 128 MiB region, as we simply never
> > > anticipated the need for supporting modules that huge.
> > >
> > > So let's increase the size of the statically allocated module region to
> > > 2 GiB, and update the module loading logic so that we prefer loading
> > > modules in the vicinity of the kernel text, where relative branches can
> > > be resolved without the need for PLTs. Only when we run out of space
> > > here (or when CONFIG_RANDOMIZE_MODULE_REGION_FULL is enabled), we will
> > > fall back to the larger window and allocate from there.
> > >
> > > While at it, let's try to simplify the logic a bit, to make it easier to
> > > follow:
> > > - remove the special cases for KASAN, which are no longer needed now
> > >   that KASAN_VMALLOC is always enabled when KASAN is configured;
> > > - instead of defining a global module base address, define a global
> > >   module limit, and work our way down from it.
> >
> > I find this last change a bit confusing, and I think it'd be preferable to have
> > explicit start and end variables; more on that below.
> >
> > >
> > > Cc: Shanker Donthineni <sdonthineni at nvidia.com>
> > > Signed-off-by: Ard Biesheuvel <ardb at kernel.org>
> > > ---
> > >  Documentation/arm64/memory.rst  |  8 ++---
> > >  arch/arm64/include/asm/memory.h |  2 +-
> > >  arch/arm64/include/asm/module.h | 10 ++++--
> > >  arch/arm64/kernel/kaslr.c       | 38 +++++++++++------------
> > >  arch/arm64/kernel/module.c      | 54 ++++++++++++++++-----------------
> > >  5 files changed, 59 insertions(+), 53 deletions(-)
> > >
> > > diff --git a/Documentation/arm64/memory.rst b/Documentation/arm64/memory.rst
> > > index 2a641ba7be3b717a..55a55f30eed8a6ce 100644
> > > --- a/Documentation/arm64/memory.rst
> > > +++ b/Documentation/arm64/memory.rst
> > > @@ -33,8 +33,8 @@ AArch64 Linux memory layout with 4KB pages + 4 levels (48-bit)::
> > >    0000000000000000   0000ffffffffffff         256TB          user
> > >    ffff000000000000   ffff7fffffffffff         128TB          kernel logical memory map
> > >   [ffff600000000000   ffff7fffffffffff]         32TB          [kasan shadow region]
> > > -  ffff800000000000   ffff800007ffffff         128MB          modules
> > > -  ffff800008000000   fffffbffefffffff         124TB          vmalloc
> > > +  ffff800000000000   ffff80007fffffff           2GB          modules
> > > +  ffff800080000000   fffffbffefffffff         124TB          vmalloc
> > >    fffffbfff0000000   fffffbfffdffffff         224MB          fixed mappings (top down)
> > >    fffffbfffe000000   fffffbfffe7fffff           8MB          [guard region]
> > >    fffffbfffe800000   fffffbffff7fffff          16MB          PCI I/O space
> > > @@ -50,8 +50,8 @@ AArch64 Linux memory layout with 64KB pages + 3 levels (52-bit with HW support):
> > >    0000000000000000   000fffffffffffff           4PB          user
> > >    fff0000000000000   ffff7fffffffffff          ~4PB          kernel logical memory map
> > >   [fffd800000000000   ffff7fffffffffff]        512TB          [kasan shadow region]
> > > -  ffff800000000000   ffff800007ffffff         128MB          modules
> > > -  ffff800008000000   fffffbffefffffff         124TB          vmalloc
> > > +  ffff800000000000   ffff80007fffffff           2GB          modules
> > > +  ffff800080000000   fffffbffefffffff         124TB          vmalloc
> > >    fffffbfff0000000   fffffbfffdffffff         224MB          fixed mappings (top down)
> > >    fffffbfffe000000   fffffbfffe7fffff           8MB          [guard region]
> > >    fffffbfffe800000   fffffbffff7fffff          16MB          PCI I/O space
> > > diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> > > index 78e5163836a0ab95..b58c3127323e16c8 100644
> > > --- a/arch/arm64/include/asm/memory.h
> > > +++ b/arch/arm64/include/asm/memory.h
> > > @@ -46,7 +46,7 @@
> > >  #define KIMAGE_VADDR         (MODULES_END)
> > >  #define MODULES_END          (MODULES_VADDR + MODULES_VSIZE)
> > >  #define MODULES_VADDR                (_PAGE_END(VA_BITS_MIN))
> > > -#define MODULES_VSIZE                (SZ_128M)
> > > +#define MODULES_VSIZE                (SZ_2G)
> > >  #define VMEMMAP_START                (-(UL(1) << (VA_BITS - VMEMMAP_SHIFT)))
> > >  #define VMEMMAP_END          (VMEMMAP_START + VMEMMAP_SIZE)
> > >  #define PCI_IO_END           (VMEMMAP_START - SZ_8M)
> > > diff --git a/arch/arm64/include/asm/module.h b/arch/arm64/include/asm/module.h
> > > index 18734fed3bdd7609..98dae9f87b521f07 100644
> > > --- a/arch/arm64/include/asm/module.h
> > > +++ b/arch/arm64/include/asm/module.h
> > > @@ -31,9 +31,15 @@ u64 module_emit_veneer_for_adrp(struct module *mod, Elf64_Shdr *sechdrs,
> > >                               void *loc, u64 val);
> > >
> > >  #ifdef CONFIG_RANDOMIZE_BASE
> > > -extern u64 module_alloc_base;
> > > +extern u64 module_alloc_limit;
> > >  #else
> > > -#define module_alloc_base    ((u64)_etext - MODULES_VSIZE)
> > > +#define module_alloc_limit   MODULE_REF_END
> > > +#endif
> > > +
> > > +#ifdef CONFIG_ARM64_MODULE_PLTS
> > > +#define MODULE_REF_END               ((u64)_end)
> > > +#else
> > > +#define MODULE_REF_END               ((u64)_etext)
> > >  #endif
> >
> > I was initially a bit confused by this. I think it's a bit misleading for the
> > !CONFIG_ARM64_MODULE_PLTS case, since modules can still have data references
> > between _etext and _end, it's just that we're (hopefully) unlikely to have the
> > text be <128M with >2G of subsequent data.
> >
> 
> So the reason for this is that otherwise, with PLTs disabled, we lose
> (_end - _etext) worth of module space for no good reason, and this
> could potentially be a substantial chunk of space. However, when PLTs
> are enabled, we cannot safely use _etext as the upper bound, as _etext
> - SZ_2G may produce an address that is out of range for a PREL32
> relocation. 

I understood those two constraints, which are:

(1) For PREL32 references, the module must be within 2GiB of _end regardless of
    whether we use PLTs.

(2) For direct branches without PLTs, the module must be within 128MiB of
    _etext.

What I find confusing is having a single conditional MODULE_REF_END definition,
as it implicitly relies on other properties being maintained elsewhere, when we
try to allocate modules relative to this, and I don't think those always align
as we'd prefer.

Consider a config:

	CONFIG_MODULE_PLTS=y
	RANDOMIZE_BASE=n
	CONFIG_RANDOMIZE_MODULE_REGION_FULL=n

In that config, with your patch we'd have:

	#define module_alloc_limit	MODULE_REF_END
	#define MODULE_REF_END		((u64)_end)

In module alloc(), our first attempt at allocating the module would be:

	p = __vmalloc_node_range(size, MODULE_ALIGN,
				 module_alloc_limit - SZ_128M,
				 module_alloc_limit, gfp_mask, PAGE_KERNEL,
				 VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
				 __builtin_return_address(0));

In this case, module_alloc_limit is '(u64)_end', so if there's a signficiant
quantity of data between _etext and _end we will fail to allocate in the
preferred 128M region that avoids PLTs more often than necessary, before
falling back to the 2G region that may require PLTs.

That's not a functional problem since we'll fall back to using PLTs, but I
don't think that's as intended.

Consider another config with:

	CONFIG_MODULE_PLTS=n
	RANDOMIZE_BASE=n
	CONFIG_RANDOMIZE_MODULE_REGION_FULL=n

In that config we'd have:

	#define module_alloc_limit	MODULE_REF_END
	#define MODULE_REF_END		((u64)_etext)

In module alloc(), our only attempt at allocating the module would be:

	p = __vmalloc_node_range(size, MODULE_ALIGN,
				 module_alloc_limit - SZ_128M,
				 module_alloc_limit, gfp_mask, PAGE_KERNEL,
				 VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
				 __builtin_return_address(0));

Say I've built an incredibly unlikely kernel with 64M of text and 2G-96M of
data between _etext and _end. In this case, 'module_alloc_limit - SZ_128M'
would be 32M below '_end - SZ_2G', so PREL32 relocations could be out-of-range.

> However, in that case, we can tolerate the waste, so we can just use _end
> instead.

I get the idea, but as above, I don't think that's always correct.

> > I'd find this clearer if we could express the two constaints separately. e.g.
> > have something like:
> >
> > #define MODULE_DATA_REF_END             ((u64)_end)
> > #define MODULE_TEXT_REF_END             ((u64)_etext)
> >
> > That'd allow us to do something like the below, which I think would be clearer.
> >
> > u64 module_alloc_end;
> > u64 module_alloc_base_noplt;
> > u64 module_alloc_base_plt;
> >
> > /*
> >  * Call this somewhere after choosing hte KASLR limits
> >  */
> > void module_limits_init(void)
> > {
> >         module_alloc_end = (u64)_stext;
> >
> >         /*
> >          * 32-bit relative data references must always fall within 2G of the
> >          * end of the kernel image.
> >          */
> >         module_alloc_base_plt = max(MODULE_VADDR, MODULE_DATA_REF_END - SZ_2G);
> >
> >         /*
> >          * Direct branches must be within 128M of the end of the kernel text.
> >          */
> >         module_alloc_base_noplt = max(module_alloc_base_plt,
> >                                       MODULE_TEXT_REF_END - SZ_128M);
> > }
> >
> 
> Currently, the randomization of the module region is independent from
> the randomization of the kernel itself, and once we incorporate that
> here, I don't think it will be any clearer tbh.

Ok; I've clearly missed that aspect, and I'll have to go page that in.

Thanks,
Mark.

> > void *module_alloc(unsigned long size)
> > {
> >         gfp_t gfp_mask = GFP_KERNEL;
> >         void *p = NULL;
> >
> >         if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
> >                 gfp_mask |= __GFP_NOWARN;
> >
> >         /*
> >          * By default, prefer to place modules such that they don't require
> >          * PLTs. With CONFIG_RANDOMIZE_MODULE_REGION_FULL=y we'll prefer to use
> >          * the entire range possible with PLTs, to get as much entropy as possible.
> >          */
> >         if (!IS_ENABLED(CONFIG_RANDOMIZE_MODULE_REGION_FULL)) {
> >                 p = __vmalloc_node_range(size, MODULE_ALIGN,
> >                                          module_base_noplt, module_end,
> >                                          gfp_mask, PAGE_KERNEL,
> >                                          VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
> >                                          __builtin_return_address(0));
> >         }
> >
> >         if (!p && IS_ENABLED(CONFIG_MODULE_PLTS)) {
> >                 p = __vmalloc_node_range(size, MODULE_ALIGN,
> >                                          module_base_plt, module_end,
> >                                          gfp_mask, PAGE_KERNEL,
> >                                          VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
> >                                          __builtin_return_address(0));
> >         }
> >
> >         [... play with KASAN here ...]
> > }
> >
> > Is that viable, or am I missing something?
> >
> 
> It looks correct to me, but I'll defer to others to decide which is preferable.



More information about the linux-arm-kernel mailing list