[musl] Re: [PATCH v8 00/38] arm64/gcs: Provide support for GCS in userspace

Tue Feb 20 15:59:58 PST 2024

On Tue, Feb 20, 2024, at 6:30 PM, Edgecombe, Rick P wrote:
> On Tue, 2024-02-20 at 13:57 -0500, Rich Felker wrote:
>> On Tue, Feb 20, 2024 at 06:41:05PM +0000, Edgecombe, Rick P wrote:
>> > Hmm, could the shadow stack underflow onto the real stack then? Not
>> > sure how bad that is. INCSSP (incrementing the SSP register on x86)
>> > loops are not rare so it seems like something that could happen.
>> 
>> Shadow stack underflow should fault on attempt to access
>> non-shadow-stack memory as shadow-stack, no?
>
> Maybe I'm misunderstanding. I thought the proposal included allowing
> shadow stack access to convert normal address ranges to shadow stack,
> and normal writes to convert shadow stack to normal.

Ideally for riscv only writes would cause conversion, an incssp underflow
which performs shadow stack reads would be able to fault early.

For arm, since a syscall is needed anyway to set up the token in a new
shadow stack region, it would make sense for conversion from non-shadow
to shadow usage to never be automatic.

>> > 
>> > Won't this prevent catching stack overflows when they happen? An
>> > overflow will just turn the shadow stack into normal stack and only
>> > get
>> > detected when the shadow stack unwinds?
>> 
>> I don't think that's as big a problem as it sounds like. It might
>> make
>> pinpointing the spot at which things went wrong take a little bit
>> more
>> work, but it should not admit any wrong-execution.
>
> Right, it's a point about debugging. I'm just trying to analyze the
> pros and cons and not calling it a showstopper.

It's certainly undesirable, so I'd like to have both mechanisms available
(shadow stacks in ordinary memory to support several problematic APIs,
and in dedicated mappings with guard pages otherwise).

>> > 
>> > Shadow stacks currently have automatic guard gaps to try to prevent
>> > one
>> > thread from overflowing onto another thread's shadow stack. This
>> > would
>> > somewhat opens that up, as the stack guard gaps are usually
>> > maintained
>> > by userspace for new threads. It would have to be thought through
>> > if
>> > these could still be enforced with checking at additional spots.
>> 
>> I would think the existing guard pages would already do that if a
>> thread's shadow stack is contiguous with its own data stack.
>
> The difference is that the kernel provides the guard gaps, where this
> would rely on userspace to do it. It is not a showstopper either.
>
> I think my biggest question on this is how does it change the
> capability for two threads to share a shadow stack. It might require
> some special rules around the syscall that writes restore tokens. So
> I'm not sure. It probably needs a POC.

I'm not quite understanding what the property you're looking for here is.

>> From the musl side, I have always looked at the entirely of shadow
>> stack stuff with very heavy skepticism, and anything that breaks
>> existing interface contracts, introduced places where apps can get
>> auto-killed because a late resource allocation fails, or requires
>> applications to code around the existence of something that should be
>> an implementation detail, is a non-starter. To even consider shadow
>> stack support, it must truely be fully non-breaking.
>
> The manual assembly stack switching and JIT code in the apps needs to
> be updated. I don't think there is a way around it.

Naturally.  If an application uses nonportable functionality like JIT
and inline assembly, it's fine (within reason) for those nonportable
components to need changes for shadow stack support.

The objective of this proposal is to allow applications that do _not_
use inline assembly but rather only C APIs defined in POSIX.1-2004 to
execute correctly in an environment where shadow stacks are enabled
by default.

> I agree though that the late allocation failures are not great. Mark is
> working on clone3 support which should allow moving the shadow stack
> allocation to happen in userspace with the normal stack. Even for riscv
> though, doesn't it need to update a new register in stack switching?
>
> BTW, x86 shadow stack has a mode where the shadow stack is writable
> with a special instruction (WRSS). It enables the SSP to be set
> arbitrarily by writing restore tokens. We discussed this as an option
> to make the existing longjmp() and signal stuff work more transparently
> for glibc.
>
>> 
>> > > _Without_ doing this, sigaltstack cannot be used to recover from
>> > > stack
>> > > overflows if the shadow stack limit is reached first, and
>> > > makecontext
>> > > cannot be supported without memory leaks and unreportable error
>> > > conditions.
>> > 
>> > FWIW, I think the makecontext() shadow stack leaking is a bad idea.
>> > I
>> > would prefer the existing makecontext() interface just didn't
>> > support
>> > shadow stack, rather than the leaking solution glibc does today.
>> 
>> AIUI the proposal by Stefan makes it non-leaking because it's just
>> using normal memory that reverts to normal usage on any
>> non-shadow-stack access.
>> 
>
> Right, but does it break any existing apps anyway (because of small
> ucontext stack sizes)?

Possibly, but that's what SIGSTKSZ/MINSIGSTKSZ is for.  This is already
variable on several platforms due to variable-length vector extensions.

> BTW, when I talk about "not supporting" I don't mean the app should
> crash. I mean it should instead run normally, just without shadow stack
> enabled. Not sure if that was clear. Since shadow stack is not
> essential for an application to function, it is only security hardening
> on top.

I appreciate that.  How far can we go in that direction?  If we can
automatically disable shadow stacks on any call to makecontext, sigaltstack,
or pthread_attr_setstack without causing other threads to crash if they were
in the middle of shadow stack maintenance we can probably simplify this
proposal, although I need to think more about what's possible.

> Although determining if an application supports shadow stack has turned
> out to be difficult in practice. Handling dlopen() is especially hard.

How so?  Is the hard part figuring out if you need to do something, or
doing it?

-s