[PATCH v8 42/43] mm: add arch hook to validate mmap() prot flags

Wed Mar 13 04:45:22 PDT 2024

On Wed, 13 Mar 2024 at 11:47, Catalin Marinas <catalin.marinas at arm.com> wrote:
>
> On Wed, Mar 13, 2024 at 12:23:24AM +0100, Ard Biesheuvel wrote:
> > On Tue, 12 Mar 2024 at 20:53, Catalin Marinas <catalin.marinas at arm.com> wrote:
> > > On Wed, Feb 14, 2024 at 01:29:28PM +0100, Ard Biesheuvel wrote:
> > > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > > index d89770eaab6b..977a8c3fd9f5 100644
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -1229,6 +1229,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> > > >               if (!(file && path_noexec(&file->f_path)))
> > > >                       prot |= PROT_EXEC;
> > > >
> > > > +     if (!arch_validate_mmap_prot(prot, addr))
> > > > +             return -EACCES;
> > > > +
> > > >       /* force arch specific MAP_FIXED handling in get_unmapped_area */
> > > >       if (flags & MAP_FIXED_NOREPLACE)
> > > >               flags |= MAP_FIXED;
> > >
> > > While writing the pull request for Linus (and looking to justify this
> > > change), I realised that we already have arch_validate_flags() that can
> > > do a similar check but on VM_* flags instead of PROT_* (we use it for
> > > VM_MTE checks). What was the reason for adding a new hook?
> [...]
> > > The only
> > > difference is that here it returns -EACCESS while on
> > > arch_validate_flags() failure it would return -EINVAL. It probably makes
> > > more to return -EACCESS as it matches map_deny_write_exec() but with
> > > your patches we are inconsistent between mmap() and mprotect() errors
> > > (the latter is still -EINVAL).
> >
> > Yes. Looking at it now, it would be better to add a single arch hook
> > to map_deny_write_exec(), and use that instead.
>
> This would work and matches the API better. Another way to look at the
> arm64 WXN feature is to avoid bothering with with the checks knowing
> that the hardware enforces XN when a permission is W. So it can be seen
> as a choice irrespective of the user PROT_EXEC|PROT_WRITE permission.
> But it's still an ABI change, so I guess better to be upfront with the
> user and reject such mmap()/mprotect() permission combinations.
>

Yes, that was the idea in the original patch.

> However, I've been looking through specs and realised that SCTLR_ELx.WXN
> is RES0 when the permission indirection is enabled (FEAT_PIE from the
> 2022 specs, hopefully you have access to it).

The latest public version of the ARM ARM does not cover FEAT_PIE at all.

> And while apparently WXN
> gets better as it allows separate EL0/EL1 controls, it seems to only
> apply when the base permission is RWX and the XN is toggled based on the
> overlay permission (pkeys which Joey is working on). So it looks like
> what the architects had in mind is optimising RW/RX switching via
> overlays (no syscalls) but keeping the base permission RWX. The
> traditional WXN hardening via SCTLR_EL1 disappeared.
>
> (adding Joey to the thread, he contributed the PIE support)
>

PIE sounds useful to implement things like JITs in user space, where
you want a certain mapping to transition to RW while all other CPUs
retain RX access concurrently.

WXN is intended to be static, where a single bit sets the system-wide
policy for all kernel and user space code.

It's rather unfortunate that FEAT_PIE relies on RWX mappings and
therefore needs to deprecate WXN entirely. It would have been nice to
have something like this for the kernel, which never has a need for
RWX mappings or transitioning mappings between RX and RW like that,
and so overlays don't seem to be a great fit.

> > > It also got me thinking on whether we could use this as a hardened
> > > version of the MDWE feature instead a CPU feature (though we'd end up
> > > context-switching this SCTLR_EL1 bit). I think your patches have been
> > > around before the MDWE feature was added to the kernel.
> >
> > Indeed.
> >
> > MDWE looks like a good match in principle, but combining the two seems tricky:
> > - EL1 is going to flip between WXN protection on and off depending on
> > which setting it is using for EL0;
> > - context switching SCTLR_EL1.WXN may become costly in terms of TLB
> > maintenance, unless we cheat and ignore the kernel mappings (which
> > should work as expected regardless of the value of SCTLR_EL1.WXN);
> >
> > If we can find a reasonable strategy to manage the TLB maintenance
> > that does not leave EL1 behind, I'm happy to explore this further. But
> > perhaps WXN should simply be treated as MDWE always-on for all
> > processes.
>
> Ah, I did not realise this bit can be cached in the TLB. So this doesn't
> really work (and the Arm ARM is also vague about whether it's cached per
> ASID or not).
>
> > > Sorry for not catching this early.
> >
> > No worries - it's more likely to be useful if we get this right.
>
> Given that with PIE this feature no longer works, I'll revert these two
> patches for now. We should revisit it in combination with PIE and POE
> (overlays) but it does look like the semantics are slightly different
> (unless I misread the specs). If we want a global MDWE=on on the command
> line, this can be generic.
>

I looked into this a bit more, and MDWE is a bit stricter than WXN,
and therefore less suitable for enabling system-wide. It disallows
adding executable permissions entirely, as well as adding write
permissions to a mapping that is already executable. WXN just
disallows setting both at the same time.

So using the same hook makes sense, but combining the logic beyond
that seems problematic too.

I'll code it up in any case to see what it looks like.