[REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere

Thu Apr 23 05:11:42 PDT 2026

On Thu, Apr 23, 2026 at 1:48 PM Thomas Gleixner <tglx at linutronix.de> wrote:
> That would work and not bring the performance issues back, but:
>
>   1) Did you validate that adding the reset into rseq_update_user_cs() is
>      actually sufficient?

Not yet, although I confirmed with the tcmalloc maintainers that they
thought it would be sufficient before suggesting it. I'm currently
building your patch from upthread to test that out. I can try this
after, although I don't think I'll be able to get to that today. I'll
try to get a coworker to test it though.

>      Which means, that tcmalloc is holding everybody else hostage.
>      That's just not acceptable. Not even under the no regression rule.

Agree. I don't love the situation either. Or that we need to advise
setting the environment variable to tell glibc not to use rseq. But I
also want our users to be able to use existing mongo binaries on new
kernels.

>   3) The fact that tcmalloc prevents a user from enabling rseq debugging
>      is equally unacceptable as it does not allow me to validate my own
>      rseq magic code in my mongodb client because enabling it will make
>      the DB I want to test against go away.

Glad to hear you use mongodb :)

> The most amazing part is that tcmalloc uses this to spare two
> instruction cycles, but nobody noticed in 8 years how much performance
> the unconditional rseq nonsense in the kernel left on the table.

I am looking into a change to our copy of tcmalloc to have it stop
squatting on cpu_id_start, and will run that through our correctness
and performance tests. I can't promise anything (and I certainly can't
speak for what Google may choose to do), but I share your expectation
that it should be possible with minimal impact. It _is_ more than 2
cycles though, since it extends the load dependency chain by one or
two pointer chases and a bit of ALU ops. I'd guesstimate it will
likely cost on the order of 5-10 cycles per call to malloc or free. I
think we can absorb that, but will need to test.

Of course, even if we make that change, it will only apply to _future_
binaries. That's why we prefer a kernel fix so that users will be able
to run our existing releases (or any containers that use them) on a
modern kernel.