[PATCH] arm64: module: Widen module region to 2 GiB

Tue May 2 04:31:37 PDT 2023

On Wed, Apr 05, 2023 at 03:16:25PM +0200, Ard Biesheuvel wrote:
> On Wed, 5 Apr 2023 at 13:38, Mark Rutland <mark.rutland at arm.com> wrote:
> >
> > On Tue, Apr 04, 2023 at 06:07:04PM +0200, Ard Biesheuvel wrote:
> > > On Tue, 4 Apr 2023 at 17:49, Mark Rutland <mark.rutland at arm.com> wrote:
> > > >
> > > > On Tue, Apr 04, 2023 at 03:54:37PM +0200, Ard Biesheuvel wrote:
> ...
> > > > >  #ifdef CONFIG_RANDOMIZE_BASE
> > > > > -extern u64 module_alloc_base;
> > > > > +extern u64 module_alloc_limit;
> > > > >  #else
> > > > > -#define module_alloc_base    ((u64)_etext - MODULES_VSIZE)
> > > > > +#define module_alloc_limit   MODULE_REF_END
> > > > > +#endif
> > > > > +
> > > > > +#ifdef CONFIG_ARM64_MODULE_PLTS
> > > > > +#define MODULE_REF_END               ((u64)_end)
> > > > > +#else
> > > > > +#define MODULE_REF_END               ((u64)_etext)
> > > > >  #endif
> > > >
> > > > I was initially a bit confused by this. I think it's a bit misleading for the
> > > > !CONFIG_ARM64_MODULE_PLTS case, since modules can still have data references
> > > > between _etext and _end, it's just that we're (hopefully) unlikely to have the
> > > > text be <128M with >2G of subsequent data.
> > > >
> > >
> > > So the reason for this is that otherwise, with PLTs disabled, we lose
> > > (_end - _etext) worth of module space for no good reason, and this
> > > could potentially be a substantial chunk of space. However, when PLTs
> > > are enabled, we cannot safely use _etext as the upper bound, as _etext
> > > - SZ_2G may produce an address that is out of range for a PREL32
> > > relocation.
> >
> > I understood those two constraints, which are:
> >
> > (1) For PREL32 references, the module must be within 2GiB of _end regardless of
> >     whether we use PLTs.
> >
> > (2) For direct branches without PLTs, the module must be within 128MiB of
> >     _etext.
> >
> 
> To be pedantic, let's define it as
> 
> (1) for PREL32 references, the module region must be at most 2 GiB in
> size and include the kernel range [_stext, _end), so that PREL32
> references from any module to any other module or the kernel are
> always within -/+ 2 GiB
> 
> (2) for CALL26 references, the module region must be at most 128 MiB
> in size and include the kernel range [_stext, _etext) so that CALL26
> references from any module to any other module or the kernel are
> always within -/+ 128 MiB

Yes, agreed.

As discussed in-person at Linaro connect, I'd prefer if we could have the
module logic in one place, and explicitly handle the (rare/unlikely) cases
where the kernel is too big to allow us to select a 128M or 2G region.

I've pushed a branch (atop v6.3 for now) to:

  https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/module-space-rework

Which does that (and I think handles the randomization aspects correctly). I
intend to rebase and post that come v6.4-rc1.

Thanks,
Mark.

> > What I find confusing is having a single conditional MODULE_REF_END definition,
> > as it implicitly relies on other properties being maintained elsewhere, when we
> > try to allocate modules relative to this, and I don't think those always align
> > as we'd prefer.
> >
> > Consider a config:
> >
> >         CONFIG_MODULE_PLTS=y
> >         RANDOMIZE_BASE=n
> >         CONFIG_RANDOMIZE_MODULE_REGION_FULL=n
> >
> > In that config, with your patch we'd have:
> >
> >         #define module_alloc_limit      MODULE_REF_END
> >         #define MODULE_REF_END          ((u64)_end)
> >
> > In module alloc(), our first attempt at allocating the module would be:
> >
> >         p = __vmalloc_node_range(size, MODULE_ALIGN,
> >                                  module_alloc_limit - SZ_128M,
> >                                  module_alloc_limit, gfp_mask, PAGE_KERNEL,
> >                                  VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
> >                                  __builtin_return_address(0));
> >
> > In this case, module_alloc_limit is '(u64)_end', so if there's a signficiant
> > quantity of data between _etext and _end we will fail to allocate in the
> > preferred 128M region that avoids PLTs more often than necessary, before
> > falling back to the 2G region that may require PLTs.
> >
> > That's not a functional problem since we'll fall back to using PLTs, but I
> > don't think that's as intended.
> >
> 
> The allocations occur bottom up, so we will fall back earlier than
> strictly necessary, but only after exhausting a significant chunk of
> the module region. I don't see that as a problem.
> 
> 
> > Consider another config with:
> >
> >         CONFIG_MODULE_PLTS=n
> >         RANDOMIZE_BASE=n
> >         CONFIG_RANDOMIZE_MODULE_REGION_FULL=n
> >
> > In that config we'd have:
> >
> >         #define module_alloc_limit      MODULE_REF_END
> >         #define MODULE_REF_END          ((u64)_etext)
> >
> > In module alloc(), our only attempt at allocating the module would be:
> >
> >         p = __vmalloc_node_range(size, MODULE_ALIGN,
> >                                  module_alloc_limit - SZ_128M,
> >                                  module_alloc_limit, gfp_mask, PAGE_KERNEL,
> >                                  VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
> >                                  __builtin_return_address(0));
> >
> > Say I've built an incredibly unlikely kernel with 64M of text and 2G-96M of
> > data between _etext and _end. In this case, 'module_alloc_limit - SZ_128M'
> > would be 32M below '_end - SZ_2G', so PREL32 relocations could be out-of-range.
> >
> 
> I think that ~2 GiB kernel images have their own special set of
> challenges, and this is probably not the nastiest one.
> 
> > > However, in that case, we can tolerate the waste, so we can just use _end
> > > instead.
> >
> > I get the idea, but as above, I don't think that's always correct.
> >
> > > > I'd find this clearer if we could express the two constaints separately. e.g.
> > > > have something like:
> > > >
> > > > #define MODULE_DATA_REF_END             ((u64)_end)
> > > > #define MODULE_TEXT_REF_END             ((u64)_etext)
> > > >
> > > > That'd allow us to do something like the below, which I think would be clearer.
> > > >
> > > > u64 module_alloc_end;
> > > > u64 module_alloc_base_noplt;
> > > > u64 module_alloc_base_plt;
> > > >
> > > > /*
> > > >  * Call this somewhere after choosing hte KASLR limits
> > > >  */
> > > > void module_limits_init(void)
> > > > {
> > > >         module_alloc_end = (u64)_stext;
> > > >
> > > >         /*
> > > >          * 32-bit relative data references must always fall within 2G of the
> > > >          * end of the kernel image.
> > > >          */
> > > >         module_alloc_base_plt = max(MODULE_VADDR, MODULE_DATA_REF_END - SZ_2G);
> > > >
> > > >         /*
> > > >          * Direct branches must be within 128M of the end of the kernel text.
> > > >          */
> > > >         module_alloc_base_noplt = max(module_alloc_base_plt,
> > > >                                       MODULE_TEXT_REF_END - SZ_128M);
> > > > }
> > > >
> > >
> > > Currently, the randomization of the module region is independent from
> > > the randomization of the kernel itself, and once we incorporate that
> > > here, I don't think it will be any clearer tbh.
> >
> > Ok; I've clearly missed that aspect, and I'll have to go page that in.
> >
> 
> ok