[REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
Mathias Stearn
mathias at mongodb.com
Thu Apr 23 03:38:56 PDT 2026
On Wed, Apr 22, 2026 at 3:13 PM Peter Zijlstra <peterz at infradead.org> wrote:
>
> On Wed, Apr 22, 2026 at 02:56:47PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 22, 2026 at 11:50:26AM +0200, Mathias Stearn wrote:
> >
> > > Additionally, it breaks tcmalloc specifically by failing to overwrite
> > > the cpu_id_start field at points where it was relied on for
> > > correctness.
> >
> > This specific behaviour was documented as being wrong and running with
> > DEBUG_RSEQ would have flagged it.
> >
> > The tcmalloc issue has been contentious for a long time. The tcmalloc
> > folks relied on something that was documented to be wrong. It has been
> > reported to the tcmalloc people many years ago and if you were to run
> > tcmalloc on most any kernel (very much including 6.19) with
> > DEBUG_RSEQ=y, it would have yelled.
> >
> > The tcmalloc people didn't care. There was a proposal for an RSEQ
> > extension for what they need, and they didn't care. All this should be
> > in their bugzilla or whatever.
> >
> > The RSEQ rework improved performance significantly for everyone, and
> > kept all the documented behaviour (+- arm64 bug). Tcmalloc got screwed
> > over because they relied on implementation behaviour that was
> > specifically documented to be broken. And they didn't care. Google was
> > very much aware of this. And hasn't lifted a finger to remedy it.
>
> Also: https://lore.kernel.org/all/874io5andc.ffs@tglx/
(Sorry for the resend to folks who got this already - I got an alert
that it was rejected by the mailinglists because it contained html so
attempting to resend as plain text)
I won't claim that tcmalloc _should_ be abusing cpu_id_start as it is.
I agree that it seems questionable at best. However, I will strongly
disagree with the following comment in that message:
> What it not longer does is updating the
> CPU number for the preemption case on the same CPU
> because that's just a massive waste of CPU cycles.
I don't think it will cost _any_ cycles to implement what I proposed.
And it especially should have no impact from just enabling rseq on a
thread as glibc now does. It should only result in different
instructions being executed when the program actually _uses_ rseq by
setting the rseq_cs variable to a non-null pointer. I will repeat the
proposal with a bit more commentary in case you missed some of the
details that make it free:
Any time a critical section might be aborted (migration, preemption,
signal delivery, and membarrier IPI), the kernel _already_ must check
the rseq_cs field to see if the thread is in a critical section [and
if it is null because the program isn't using rseq critical sections,
no further action is taken]. This is documented as nulling the pointer
after (I assume to make later checks cheaper) [if this changed, then
it *is* a change in _documented behavior_, not just an implementation
detail]. It would be sufficient for tcmalloc's internal usage if every
time the kernel nulled out rseq_cs, it also wrote the cpu id to
cpu_id_start. [This is one additional store to a cacheline you are
already writing to so it should be ~free on modern OoO CPUs and cheap
on others. There might be a small cost to loading the current cpu, but
since nothing depends on that other than the store, I still expect it
to be ~free]
To make this more concrete, I am proposing adding
unsafe_put_user((u32)task_cpu(t), &t->rseq.usrptr->cpu_id_start, efault);
after each place where you currently do
unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
in rseq_update_user_cs. Is that something that you would expect to
cause a performance issue?
Again, I'm not claiming that it is "good" that this needs to be done.
But it does seem like a small price to pay to keep existing binaries
working on new kernels. Quoting the first paragraph of
https://docs.kernel.org/admin-guide/reporting-regressions.html:
> “We don’t cause regressions” is the first rule of Linux kernel development; Linux founder and lead developer Linus Torvalds established it himself and ensures it’s obeyed.
I don't see anything on that page that says it doesn't count as a
regression if the userspace program "relied on implementation
behaviour that was specifically documented to be broken".
More information about the linux-arm-kernel
mailing list