[RFC/RFT PATCH 0/6] Improve get_random_u8() for use in randomize kstack

Ryan Roberts ryan.roberts at arm.com
Fri Nov 28 02:32:02 PST 2025


On 28/11/2025 10:07, Ard Biesheuvel wrote:
> On Thu, 27 Nov 2025 at 16:57, Ryan Roberts <ryan.roberts at arm.com> wrote:
>>
>> On 27/11/2025 15:03, Ard Biesheuvel wrote:
>>> On Thu, 27 Nov 2025 at 15:18, Ryan Roberts <ryan.roberts at arm.com> wrote:
>>>>
>>>> On 27/11/2025 12:28, Ard Biesheuvel wrote:
>>>>> On Thu, 27 Nov 2025 at 13:12, Ryan Roberts <ryan.roberts at arm.com> wrote:
>>>>>>
>>>>>> On 27/11/2025 09:22, Ard Biesheuvel wrote:
>>>>>>> From: Ard Biesheuvel <ardb at kernel.org>
>>>>>>>
>>>>>>> Ryan reports that get_random_u16() is dominant in the performance
>>>>>>> profiling of syscall entry when kstack randomization is enabled [0].
>>>>>>>
>>>>>>> This is the reason many architectures rely on a counter instead, and
>>>>>>> that, in turn, is the reason for the convoluted way the (pseudo-)entropy
>>>>>>> is gathered and recorded in a per-CPU variable.
>>>>>>>
>>>>>>> Let's try to make the get_random_uXX() fast path faster, and switch to
>>>>>>> get_random_u8() so that we'll hit the slow path 2x less often. Then,
>>>>>>> wire it up in the syscall entry path, replacing the per-CPU variable,
>>>>>>> making the logic at syscall exit redundant.
>>>>>>
>>>>>> I ran the same set of syscall benchmarks for this series as I've done for my
>>>>>> series.
>>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>> The baseline is v6.18-rc5 with stack randomization turned *off*. So I'm showing
>>>>>> performance cost of turning it on without any changes to the implementation,
>>>>>> then the reduced performance cost of turning it on with my changes applied, and
>>>>>> finally cost of turning it on with Ard's changes applied:
>>>>>>
>>>>>> arm64 (AWS Graviton3):
>>>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>>>> | Benchmark       | Result Class |   v6.18-rc5 | per-task-prng | fast-get-random |
>>>>>> |                 |              | rndstack-on |               |                 |
>>>>>> +=================+==============+=============+===============+=================+
>>>>>> | syscall/getpid  | mean (ns)    |  (R) 15.62% |     (R) 3.43% |      (R) 11.93% |
>>>>>> |                 | p99 (ns)     | (R) 155.01% |     (R) 3.20% |      (R) 11.00% |
>>>>>> |                 | p99.9 (ns)   | (R) 156.71% |     (R) 2.93% |      (R) 11.39% |
>>>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>>>> | syscall/getppid | mean (ns)    |  (R) 14.09% |     (R) 2.12% |      (R) 10.44% |
>>>>>> |                 | p99 (ns)     | (R) 152.81% |         1.55% |       (R) 9.94% |
>>>>>> |                 | p99.9 (ns)   | (R) 153.67% |         1.77% |       (R) 9.83% |
>>>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>>>> | syscall/invalid | mean (ns)    |  (R) 13.89% |     (R) 3.32% |      (R) 10.39% |
>>>>>> |                 | p99 (ns)     | (R) 165.82% |     (R) 3.51% |      (R) 10.72% |
>>>>>> |                 | p99.9 (ns)   | (R) 168.83% |     (R) 3.77% |      (R) 11.03% |
>>>>>> +-----------------+--------------+-------------+---------------+-----------------+
>>>>>>
>>>>>
>>>>> What does the (R) mean?
>>>>>
>>>>>> So this fixes the tail problem. I guess get_random_u8() only takes the slow path
>>>>>> every 768 calls, whereas get_random_u16() took it every 384 calls. I'm not sure
>>>>>> that fully explains it though.
>>>>>>
>>>>>> But it's still a 10% cost on average.
>>>>>>
>>>>>> Personally I think 10% syscall cost is too much to pay for 6 bits of stack
>>>>>> randomisation. 3% is better, but still higher than we would all prefer, I'm sure.
>>>>>>
>>>>>
>>>>> Interesting!
>>>>>
>>>>> So the only thing that get_random_u8() does that could explain the
>>>>> delta is calling into the scheduler on preempt_enable(), given that it
>>>>> does very little beyond that.
>>>>>
>>>>> Would you mind repeating this experiment after changing the
>>>>> put_cpu_var() to preempt_enable_no_resched(), to test this theory?
>>>>
>>>> This has no impact on performance.
>>>>
>>>
>>> Thanks. But this is really rather surprising: what else could be
>>> taking up that time, given that on the fast path, there are only some
>>> loads and stores to the buffer, and a cmpxchg64_local(). Could it be
>>> the latter that is causing so much latency? I suppose the local
>>> cmpxchg() semantics don't really exist on arm64, and this uses the
>>> exact same LSE instruction that would be used for an ordinary
>>> cmpxchg(), unlike on x86 where it appears to omit the LOCK prefix.
>>>
>>> In any case, there is no debate that your code is faster on arm64.
>>
>> The results I have for x86 show it's faster than the rdtsc too, although that's
>> also somewhat surprising. I'll run your series on x86 to get the equivalent data.
>>
> 
> OK, brown paper bag time ...
> 
> I swapped the order of the 'old' and 'new' cmpxchg64_local()
> arguments, resulting in some very odd behavior. I think this explains
> why the tail latency was eliminated entirely, which is bizarre.
> 
> The speedup is also more modest now (~2x), which may still be
> worthwhile, but likely insufficient for the kstack randomization case.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=lockless-random-v2

Ahh, so we were never taking the slow path. That would definitely explain it.

I had a go at running this on x86 but couldn't even get the kernel to boot on my
AWS Sapphire Rapids instance. Unfortunately I don't have access to the serial
console so can't tell why it failed. But I used the exact same procedure and
baseline for other runs so the only difference is your change.

I wonder if this issue somehow breaks the boot on that platform?

Anyway, I'll chuck this update at the benchmarks, but probably won't be until
next week...




More information about the linux-arm-kernel mailing list