[REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere

Dmitry Vyukov dvyukov at google.com
Fri Apr 24 00:56:54 PDT 2026


On Thu, 23 Apr 2026 at 21:31, Thomas Gleixner <tglx at linutronix.de> wrote:
>
> On Thu, Apr 23 2026 at 12:51, Mathias Stearn wrote:
> > On Thu, Apr 23, 2026 at 12:39 PM Thomas Gleixner <tglx at linutronix.de> wrote:
> >> The kernel clears rseq_cs reliably when user space was interrupted and:
> >>
> >>     the task was preempted
> >> or
> >>     the return from interrupt delivers a signal
> >>
> >> If the task invoked a syscall then there is absolutely no reason to do
> >> either of this because syscalls from within a critical section are a
> >> bug and catched when enabling rseq debugging.
> >>
> >> The original code did this along with unconditionally updating CPU/MMCID
> >> which resulted in ~15% performance regression on a syscall heavy
> >> database benchmark once glibc started to register rseq.
> >
> > Just to be clear TCMalloc does not need either rseq_cs to be cleared
> > or cpu_id_start to be written to on syscalls because it doesn't do
> > syscalls from critical sections. It will actually benefit (slightly)
> > from not updating cpu_id_start on syscalls.
>
> I know that it does not do syscalls from within critical sections, but
> it relies on cpu_id_start being unconditionally updated in one way or
> the other.
>
> > It is specifically in the cases where an rseq would need to be aborted
> > (preemption, signals, migration, and membarrier IPI with the rseq
> > flag) that TCMalloc relies on cpu_id_start being written. It does rely
> > on that write even when not inside the critical section, because it
> > effectively uses that to detect if there were any would-cause-abort
> > events in between two critical sections. But since it leaves the
> > rseq_cs pointer non-null between critical sections, so you dont need
> > to add _any_ overhead for programs that never make use of rseq after
> > registration, or add any overhead to syscalls even for those who do.
>
> Well. According to the comment in the tcmalloc code:
>
> // Calculation of the address of the current CPU slabs region is needed for
> // allocation/deallocation fast paths, but is quite expensive. Due to variable
> // shift and experimental support for "virtual CPUs", the calculation involves
> // several additional loads and dependent calculations. Pseudo-code for the
> // address calculation is as follows:
> //
> //   cpu_offset = TcmallocSlab.virtual_cpu_id_offset_;
> //   cpu = *(&__rseq_abi + virtual_cpu_id_offset_);
> //   slabs_and_shift = TcmallocSlab.slabs_and_shift_;
> //   shift = slabs_and_shift & kShiftMask;
> //   shifted_cpu = cpu << shift;
> //   slabs = slabs_and_shift & kSlabsMask;
> //   slabs += shifted_cpu;
> //
> // To remove this calculation from fast paths, we cache the slabs address
> // for the current CPU in thread local storage. However, when a thread is
> // rescheduled to another CPU, we somehow need to understand that the cached
>
>                   ^^^^^^^^^^^
>
> // address is not valid anymore. To achieve this, we overlap the top 4 bytes
> // of the cached address with __rseq_abi.cpu_id_start. When a thread is
> // rescheduled the kernel overwrites cpu_id_start with the current CPU number,
> // which gives us the signal that the cached address is not valid anymore.
>
> The kernel still as of today (the arm64 bug aside) updates the
> cpu_id_start and cpu_id fields in rseq when a task is rescheduled to
> another CPU.
>
> So if the code only requires to know when it got rescheduled to another
> CPU then it still should work, no?

This was my first thought too:
https://lore.kernel.org/lkml/CACT4Y+a9GnOh3wHKSRwzoKF6_OSksQ8qehnHfpCgkQSt_OOmYg@mail.gmail.com/
The only problem is with membarrier (it used to force write to
__rseq_abi.cpu_id_start for all threads, but now it does not).
Otherwise the caching scheme works.

I have a tentative fix for tcmalloc:
https://github.com/dvyukov/tcmalloc/commit/58d0eca91503f539b26d20b6f55fb2f6f8bc0c37

The crux is as follows.
Tcmalloc needs to make all threads stop using old cached slab
pointers. The stopping procedure is now:

slab->stopped = true;
membarrier();

and all rseq critical sections now check the stopped flag in the
cached slab pointer. If it's set, the thread does not proceed to use
the slab.




> But it does not, which makes it clear that it relies on this
> undocumented behaviour of the kernel to rewrite rseq::cpu_id_start
> unconditionally. I'm not yet convinced that it relies on it only when
> interrupted between two subsequent critical sections. We'll see.
>
> ....
>
> Now we come to the best part of this comment:
>
> // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose.
>
> So any code sequence which ends up in:
>
>    x = tcmalloc();
>    dostuff(x)
>      evaluate(rseq::cpu_id_start, rseq::cpu_id)
>
> is doomed. This might be acceptable for Google internal usage where they
> control the full stack and can prevent anyone else to utilize rseq, but
> in an open ecosystem that's obviously a non-starter.
>
> And they definitely forgot to add this to the comment:
>
> // Never enable CONFIG_RSEQ_DEBUG in the kernel when you use tcmalloc as
> // it will expose the blatant ABI abuse and therefore will kill your
> // application.
>
> If your assumption that the rewrite is only required when rseq::rseq_cs
> is non NULL and user space was interrupted is correct, then the obvious
> no-brainer would have been to add:
>
>         __u64   rseq_usr_data;
>
> to struct rseq and clear that unconditionally when rseq::rseq_cs is
> cleared.
>
> But that would have been too simple, would work independent of endianess
> and not in the way of anybody else.
>
> But I know that's incompatible with the features first, correctness
> later and we own the world anyway mindset.
>
> Just for giggles I asked Google Gemini about the implications of
> tmalloc's rseq abuse. The answer is pretty clear:
>
>    "In short, TCMalloc treats RSEQ as a private optimization rather than
>     a shared system resource, which compromises the stability and
>     extensibility of any application that needs RSEQ for anything other
>     than memory allocation."
>
> It's also very clear about the wilful ignorance of the tcmalloc people:
>
>    "In summary, the developers have known for at least 6 years that the
>     implementation was non-standard and conflicting with other rseq
>     usage. The github issue which requested glibc compatibility was
>     opened in 2022 and has been unresolved since then."
>
> Thanks,
>
>         tglx



More information about the linux-arm-kernel mailing list