[PATCH 6/6] arm64: module: rework module VA range selection

Tue May 9 07:00:19 PDT 2023

On Tue, May 09, 2023 at 01:40:12PM +0200, Ard Biesheuvel wrote:
> Hi Mark,
> 
> Thanks for cleaning this up.
> 
> On Tue, 9 May 2023 at 13:15, Mark Rutland <mark.rutland at arm.com> wrote:
> >
> > Currently, the modules region is 128M in size, which is a problem for
> > some large modules. Shanker reports [1] that the NVIDIA GPU driver alone
> > can consume 110M of module space in some configurations. We'd like to
> > make the modules region a full 2G such that we can always make use of a
> > 2G range.
> >
> > It's possible to build kernel images which are larger than 128M in some
> > configurations, such as when many debug options are selected and many
> > drivers are built in. In these configurations, we can't legitimately
> > select a base for a 128M module region, though we currently select a
> > value for which allocation will fail. It would be nicer to have a
> > diagnostic message in this case.
> >
> > Similarly, in theory it's possible to build a kernel image which is
> > larger than 2G and which cannot support modules. While this isn't likely
> > to be the case for any realistic kernel deplyed in the field, it would
> > be nice if we could print a diagnostic in this case.
> >
> > This patch reworks the module VA renage selection to use a 2G range, and
> > improves handling of cases where we cannot select legitimate module
> > regions. We now attempt to select a 128M region and a 2G region:
> >
> > * The 128M region is selected such that modules can use direct branches
> >   (with JUMP26/CALL26 relocations) to branch to kernel code and other
> >   modules, and so that modules can use direct data references (with
> >   PREL32 relocations) to access data in the kernel image and other
> >   modules.
> >
> >   This region covers the entire kernel image (rather than just the text)
> >   to ensure that all PREL32 relocations are in range even the kernel
> >   data section is absurdly large. Where we cannot allocate from this
> >   region, we'll fall back to the full 2G region.
> >
> > * The 2G region is selected such that modules can use direct branches
> >   with PLTS to branch to kernel code and other modules, and so that
> >   modules can use direct data references (with PREL32 relocations) to
> >   access data in the kernel image and other modules.
> >
> >   This region covers the entire kernel image, and the 128M region (if
> >   one is selected).
> >
> > The two module regions are randomized independently while ensuring the
> > constraints described above.
> >
> > [1] https://lore.kernel.org/linux-arm-kernel/159ceeab-09af-3174-5058-445bc8dcf85b@nvidia.com/

[...]

> > +/*
> > + * Modules may directly reference data anywhere within the kernel image and
> > + * other modules. These data references will use PREL32 relocations with a
> > + * +/-2G range, and so we need to ensure that the entire kernel image and all
> > + * modules fall within a 2G window such that these are always within range.
> > + *
> 
> 'Data references' is slightly inaccurate here - data references from
> code use ADRP/LDR with a -/+ 4G range, whereas the PREL32 references
> in question are references *from* data to both data and code symbols.
> 
> The conclusion is the same of course, PREL32 having the smaller range
> and needing to cover the entire kernel image, including code symbols
> living in .text

Indeed; I'll replace the above with:

/*
 * Modules may directly refrence data and text anywhere within the kernel image
 * and other modules. References using PREL32 relocations have a +/-2G range,
 * and so we need to ensure that the entire kernel image and all modules fall
 * within a 2G winfow such that these are always within range.
 */

... and I'll update the commit message similarly where it refers to PREL32
relocations.

[...]

> > +       if (kernel_size >= SZ_2G) {
> > +               pr_warn("Kernel is too large to support modules (%llu bytes)\n",
> > +                       kernel_size);
> > +               return 0;
> > +       }
> >
> >         if (IS_ENABLED(CONFIG_RANDOMIZE_MODULE_REGION_FULL)) {
> > -               /*
> > -                * Randomize the module region over a 2 GB window covering the
> > -                * kernel. This reduces the risk of modules leaking information
> > -                * about the address of the kernel itself, but results in
> > -                * branches between modules and the core kernel that are
> > -                * resolved via PLTs. (Branches between modules will be
> > -                * resolved normally.)
> > -                */
> > -               module_range = SZ_2G - (u64)(_end - _stext);
> > -               module_alloc_base = max((u64)_end - SZ_2G, (u64)MODULES_VADDR);
> > +               pr_info("2G module region forced by RANDOMIZE_MODULE_REGION_FULL\n");
> > +       } else if (kernel_size >= SZ_128M) {
> 
> I suppose this bound is somewhat arbitrary? I mean, if kernel_size
> were SZ_128M-SZ_4K, we'd have the exact same problem, and end up using
> the 2G region all the same, just with a different diagnostic message?

That's a fair point, and that's also true for the 2G boundary.

Since the useful bound is arbitrary, it's probably better to log how many pages
we could potentially use.

I'll have a go at doing that instead.

> > +               pr_info("2G module region forced by kernel size (%llu bytes)\n",
> > +                       kernel_size);
> > +       } else if (IS_ENABLED(CONFIG_RANOMIZE_BASE)) {
> 
> Typo here ^^^

Thanks; I've fixed that now and I'll go re-test...

Mark.