[DISCUSSION] kstack offset randomization: bugs and performance

Tue Nov 18 03:25:05 PST 2025

On Tue, Nov 18, 2025 at 10:28:29AM +0000, Ryan Roberts wrote:
> On 17/11/2025 20:27, Kees Cook wrote:
> > On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
> >> On 17/11/2025 11:30, Ryan Roberts wrote:
> The original rationale for a separate choose_random_kstack_offset() at the end
> of the syscall is described as:
> 
>  * This position in the syscall flow is done to
>  * frustrate attacks from userspace attempting to learn the next offset:
>  * - Maximize the timing uncertainty visible from userspace: if the
>  *   offset is chosen at syscall entry, userspace has much more control
>  *   over the timing between choosing offsets. "How long will we be in
>  *   kernel mode?" tends to be more difficult to predict than "how long
>  *   will we be in user mode?"
>  * - Reduce the lifetime of the new offset sitting in memory during
>  *   kernel mode execution. Exposure of "thread-local" memory content
>  *   (e.g. current, percpu, etc) tends to be easier than arbitrary
>  *   location memory exposure.
> 
> I'm not totally convinced by the first argument; for arches that use the tsc,
> sampling the tsc at syscall entry would mean that userspace can figure out the
> random value that will be used for syscall N by sampling the tsc and adding a
> bit just before calling syscall N. Sampling the tsc at syscall exit would mean
> that userspace can figure out the random value that will be used for syscall N
> by sampling the tsc and subtracting a bit just after syscall N-1 returns. I
> don't really see any difference in protection?
> 
> If you're trying force the kernel-sampled tsc to be a specific value, then for
> the sample-on-exit case, userspace can just make a syscall with an invalid id as
> it's syscall N-1 and in that case the duration between entry and exit is tiny
> and fixed so it's still pretty simple to force the value.

FWIW, I agree. I don't think we're gaining much based on the placement
of choose_random_kstack_offset() at the start/end of the entry/exit
sequences.

As an aside, it looks like x86 calls choose_random_kstack_offset() for
*any* return to userspace, including non-syscall returns (e.g. from
IRQ), in arch_exit_to_user_mode_prepare(). There's some additional
randomness/perturbation that'll cause, but logically it's not necessary
to do that for *all* returns to userspace.

> So what do you think of this approach? :
> 
> #define add_random_kstack_offset(rand) do {				\
> 	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
> 				&randomize_kstack_offset)) {		\
> 		u32 offset = raw_cpu_read(kstack_offset);		\
> 		u8 *ptr;						\
> 									\
> 		offset = ror32(offset, 5) ^ (rand);			\
> 		raw_cpu_write(kstack_offset, offset);			\
> 		u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset));	\
> 		/* Keep allocation even after "ptr" loses scope. */	\
> 		asm volatile("" :: "r"(ptr) : "memory");		\
> 	}								\
> } while (0)
> 
> This ignores "Maximize the timing uncertainty" (but that's ok because the
> current version doesn't really do that either), but strengthens "Reduce the
> lifetime of the new offset sitting in memory".

Is this assuming that 'rand' can be generated in a non-preemptible
context? If so (and this is non-preemptible), that's fine.

I'm not sure whether that was the intent, or this was ignoring the
rescheduling problem.

If we do this per-task, then that concern disappears, and this can all
be preemptible.

Mark.