[RFC please help] membarrier: Rewrite sync_core_before_usermode()

Mon Dec 28 22:09:12 EST 2020

Excerpts from Andy Lutomirski's message of December 29, 2020 10:56 am:
> On Mon, Dec 28, 2020 at 4:36 PM Nicholas Piggin <npiggin at gmail.com> wrote:
>>
>> Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am:
>> > On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers
>> > <mathieu.desnoyers at efficios.com> wrote:
>> >>
>> >> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto at kernel.org wrote:
>> >>
>> >> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
>> >> > <linux at armlinux.org.uk> wrote:
>> >> >>
>> >> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
>> >> >> > After chatting with rmk about this (but without claiming that any of
>> >> >> > this is his opinion), based on the manpage, I think membarrier()
>> >> >> > currently doesn't really claim to be synchronizing caches? It just
>> >> >> > serializes cores. So arguably if userspace wants to use membarrier()
>> >> >> > to synchronize code changes, userspace should first do the code
>> >> >> > change, then flush icache as appropriate for the architecture, and
>> >> >> > then do the membarrier() to ensure that the old code is unused?
>> >>
>> >> ^ exactly, yes.
>> >>
>> >> >> >
>> >> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()
>> >> >> > syscall. That might cause you to end up with two IPIs instead of one
>> >> >> > in total, but we probably don't care _that_ much about extra IPIs on
>> >> >> > 32-bit arm?
>> >>
>> >> This was the original thinking, yes. The cacheflush IPI will flush specific
>> >> regions of code, and the membarrier IPI issues context synchronizing
>> >> instructions.
>>
>> APIs should be written in terms of the service they provide to
>> userspace, and in highest level terms as possible, rather than directing
>> hardware to do some low level operation. Unfortunately we're stuck with
>> this for now. We could deprecate it and replace it though.
>>
>> If userspace wants to modify code and ensure that after the system call
>> returns then no other thread will be executing the previous code, then
>> there should be an API for that. It could actually combine the two IPIs
>> into one for architectures that require both too.
> 
> I agree.  The membarrier API for SYNC_CORE is pretty nasty.  I would
> much prefer a real API for JITs to use.
> 
>>
>> >>
>> >> Architectures with coherent i/d caches don't need the cacheflush step.
>> >
>> > There are different levels of coherency -- VIVT architectures may have
>> > differing requirements compared to PIPT, etc.
>> >
>> > In any case, I feel like the approach taken by the documentation is
>> > fundamentally confusing.  Architectures don't all speak the same
>> > language  How about something like:
>> >
>> > The SYNC_CORE operation causes all threads in the caller's address
>> > space (including the caller) to execute an architecture-defined
>> > barrier operation.  membarrier() will ensure that this barrier is
>> > executed at a time such that all data writes done by the calling
>> > thread before membarrier() are made visible by the barrier.
>> > Additional architecture-dependent cache management operations may be
>> > required to use this for JIT code.
>>
>> As said this isn't what SYNC_CORE does, and it's not what powerpc
>> context synchronizing instructions do either, it will very much
>> re-order visibility of stores around such an instruction.
> 
> Perhaps the docs should be entirely arch-specific.  It may well be
> impossible to state what it does in an arch-neutral way.

I think what I wrote above -- after the call returns, none of the
threads in the process will be executing instructions overwritten
previously by the caller (provided their i-caches are made coherent
with the caller's modifications).

>> A thread completes store instructions into a store queue, which is
>> as far as a context synchronizing event goes. Visibility comes at
>> some indeterminite time later.
> 
> As currently implemented, it has the same visibility semantics as
> regular membarrier, too.

Ah I actually missed that SYNC_CORE is in _addition_ to the memory
barriers, and that's documented API too, not just implementation
sorry.

> So if I do:
> 
> a = 1;
> membarrier(SYNC_CORE);
> b = 1;
> 
> and another thread does:
> 
> while (READ_ONCE(b) != 1)
>   ;
> barrier();
> assert(a == 1);

Right that's true, due to the MEMBARRIER_CMD_PRIVATE_EXPEDITED. Neither
that barrier or the added SYNC_CORE behaviour imply visibility though.

> 
> then the assertion will pass.  Similarly, one can do this, I hope:
> 
> memcpy(codeptr, [some new instructions], len);
> arch_dependent_cache_flush(codeptr, len);
> membarrier(SYNC_CORE);
> ready = 1;
> 
> and another thread does:
> 
> while (READ_ONCE(ready) != 1)
>   ;
> barrier();
> (*codeptr)();
> 
> arch_dependent_cache_flush is a nop on x86.  On arm and arm64, it
> appears to be a syscall, although maybe arm64 can do it from
> userspace.  I still don't know what it is on powerpc.
> 
> Even using the term "cache" here is misleading.  x86 chips have all
> kinds of barely-documented instruction caches, and they have varying
> degrees of coherency.  The architecture actually promises that, if you
> do a certain incantation, then you get the desired result.
> membarrier() allows a user to do this incantation.  But trying to
> replicate the incantation verbatim on an architecture like ARM is
> insufficient, and trying to flush the things that are documented as
> being caches on x86 is expensive and a complete waste of time on x86.
> When you write some JIT code, you do *not* want to flush it all the
> way to main memory, especially on CPUs don't have the ability to write
> back invalidating.  (That's most CPUs.)
> 
> Even on x86, I suspect that the various decoded insn caches are rather
> more coherent than documented, and I have questions in to Intel about
> this.  No answers yet.
> 
> So perhaps the right approach is to say that membarrier() helps you
> perform the architecture-specific sequence of steps needed to safely
> modify code.  On x86, you use it like this.  On arm64, you do this
> other thing.  On powerpc, you do something else.

I think it should certainly be documented in terms of what guarantees
it provides to application, _not_ the kinds of instructions it may or
may not induce the core to execute. And if existing API can't be
re-documented sanely, then deprecatd and new ones added that DTRT.
Possibly under a new system call, if arch's like ARM want a range
flush and we don't want to expand the multiplexing behaviour of
membarrier even more (sigh).

> 
>>
>> I would be surprised if x86's serializing instructions were different
>> than powerpc. rdtsc ordering or flushing stores to cache would be
>> surprising.
>>
> 
> At the very least, x86 has several levels of what ARM might call
> "context synchronization" AFAICT.  STAC, CLAC, and POPF do a form of
> context synchronization in that the changes they cause to the MMU take
> effect immediately, but they are not documented as synchronizing the
> instruction stream.  "Serializing" instructions do all kinds of
> things, not all of which may be architecturally visible at all.
> MFENCE and LFENCE do various complicated things, and LFENCE has magic
> retroactive capabilities on old CPUs that were not documented when
> those CPUs were released.  SFENCE does a different form of
> synchronization entirely.  LOCK does something else.  (The
> relationship between LOCK and MFENCE is confusing at best.)  RDTSC
> doesn't serialize anything at all, but RDTSCP does provide a form of
> serialization that's kind of ilke LFENCE.  It's a mess.  Even the
> manuals are inconsistent about what "serialize" means.  JMP has its
> own magic on x86, but only on very very old CPUs.
> 
> The specific instruction that flushes everything into the coherency
> domain is SFENCE, and SFENCE is not, for normal purposes, needed for
> self- or cross-modifying code.
> 

Good reason to avoid such language in the system call interface!

Thanks,
Nick