[RFC] arm64: syscall: Direct PRNG kstack randomization

Wed Feb 21 01:48:12 PST 2024

On Wed, Feb 21, 2024, at 03:02, Jeremy Linton wrote:
> The existing arm64 stack randomization uses the kernel rng to acquire
> 5 bits of address space randomization. This is problematic because it
> creates non determinism in the syscall path when the rng needs to be
> generated or reseeded. This shows up as large tail latencies in some
> benchmarks and directly affects the minimum RT latencies as seen by
> cyclictest.

Hi Jeremy,

I think from your description it's clear that reseeding the
rng is a problem for predictable RT latencies, but at the same
time we have too many things going on to fix this by
special-casing kstack randomization on one architecture:

- if reseeding latency is a problem, can we be sure that
  none of the other ~500 files containing a call to
  get_random_{bytes,long,u8,u16,u32,u64} are in an equally
  critical path for RT? Maybe those are just harder to hit?

- CONFIG_RANDOMIZE_KSTACK_OFFSET can already be disabled
  at compile or at at boot time to avoid the overhead entirely,
  which may be the right thing to do for users that care more
  deeply about syscall latencies than the fairly weak stack
  randomization. Most architectures don't implement it at all.

- It looks like the unpredictable latency from reseeding
  started with f5b98461cb81 ("random: use chacha20 for
  get_random_int/long"), which was intended to make
  get_random() faster and better, but it could be seen
  as  regression for real-time latency guarantees. If this
  turns out to be a general problem for RT workloads,
  the answer might be to bring back an option to make
  get_random() have predictable overhead everywhere
  rather than special-casing the stack randomization.

> Other architectures are using timers/cycle counters for this function,
> which is sketchy from a randomization perspective because it should be
> possible to estimate this value from knowledge of the syscall return
> time, and from reading the current value of the timer/counters.
>
> So, a poor rng should be better than the cycle counter if it is hard
> to extract the stack offsets sufficiently to be able to detect the
> PRNG's period.

I'm not convinced by the argument that the implementation you
have here is less predictable than the cycle counter, but I
have not done any particular research here and would rely on
others to take a closer look. The 32 bit global state variable
does appear weak, and I know that 

OTOH if we can show that a particular implementation is in fact
better than a cycle counter, I strongly think we should
use the same one across all architectures that currently
use the cycle counter.

      Arnd