[RFC PATCH] membarrier: riscv: Provide core serializing command

Fri Aug 4 13:06:05 PDT 2023

On 8/4/23 15:16, Andrea Parri wrote:
> On Fri, Aug 04, 2023 at 02:05:55PM -0400, Mathieu Desnoyers wrote:
>> On 8/4/23 10:59, Andrea Parri wrote:
>>>> What is the relationship between FENCE.I and instruction cache flush on
>>>> RISC-V ?
>>>
>>> The exact nature of this relationship is implementation-dependent.  From
>>> commentary included in the ISA portion referred to in the changelog:
>>>
>>>     A simple implementation can flush the local instruction cache and
>>>     the instruction pipeline when the FENCE.I is executed.  A more
>>>     complex implementation might snoop the instruction (data) cache on
>>>     every data (instruction) cache miss, or use an inclusive unified
>>>     private L2 cache to invalidate lines from the primary instruction
>>>     cache when they are being written by a local store instruction.  If
>>>     instruction and data caches are kept coherent in this way, or if
>>>     the memory system consists of only uncached RAMs, then just the
>>>     fetch pipeline needs to be flushed at a FENCE.I.  [..]
>>>
>>> Mmh, does this help?
>>
>> Quoting
>>
>> https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf
>>
>> Chapter 3 "“Zifencei” Instruction-Fetch Fence, Version 2.0"
>>
>> "First, it has been recognized that on some systems, FENCE.I will be expensive to implement
>> and alternate mechanisms are being discussed in the memory model task group. In particular,
>> for designs that have an incoherent instruction cache and an incoherent data cache, or where
>> the instruction cache refill does not snoop a coherent data cache, both caches must be completely
>> flushed when a FENCE.I instruction is encountered. This problem is exacerbated when there are
>> multiple levels of I and D cache in front of a unified cache or outer memory system.
>>
>> Second, the instruction is not powerful enough to make available at user level in a Unix-like
>> operating system environment. The FENCE.I only synchronizes the local hart, and the OS can
>> reschedule the user hart to a different physical hart after the FENCE.I. This would require the
>> OS to execute an additional FENCE.I as part of every context migration. For this reason, the
>> standard Linux ABI has removed FENCE.I from user-level and now requires a system call to
>> maintain instruction-fetch coherence, which allows the OS to minimize the number of FENCE.I
>> executions required on current systems and provides forward-compatibility with future improved
>> instruction-fetch coherence mechanisms.
>>
>> Future approaches to instruction-fetch coherence under discussion include providing more
>> restricted versions of FENCE.I that only target a given address specified in rs1, and/or allowing
>> software to use an ABI that relies on machine-mode cache-maintenance operations."
>>
>> I start to suspect that even the people working on the riscv memory model have noticed
>> that letting a single instruction such as FENCE.I take care of both cache coherency
>> *and* flush the instruction pipeline will be a performance bottleneck, because it
>> can only clear the whole instruction cache.
>>
>> Other architectures are either cache-coherent, or have cache flushing which can be
>> performed on a range of addresses. This is kept apart from whatever instruction
>> flushes the instruction pipeline of the processor.
>>
>> By keeping instruction cache flushing separate from instruction pipeline flush, we can
>> let membarrier (and context switches, including thread migration) only care about the
>> instruction pipeline part, and leave instruction cache flush to either a dedicated
>> system call, or to specialized instructions which are available from user-mode.
>>
>> Considering that FENCE.I is forced to invalidate the whole i-cache, I don't think you
>> will get away with executing it from switch_mm without making performance go down the
>> drain on cache incoherent implementations.
>>
>> In my opinion, what we would need from RISC-V for membarrier (and context switch) is a
>> lightweight version of FENCE.I which only flushes the instruction pipeline of the local
>> processor. This should ideally come with a way for architectures with incoherent caches
>> to flush the relevant address ranges of the i-cache which are modified by a JIT. This
>> i-cache flush would not be required to flush the instruction pipeline, as it is typical
>> to batch invalidation of various address ranges together and issue a single instruction
>> pipeline flush on each CPU at the end. The i-cache flush could either be done by new
>> instructions available from user-space (similar to aarch64), or through privileged
>> instructions available through system calls (similar to arm cacheflush).
> 
> Thanks for the remarks, Mathieu.  I think it will be very helpful to
> RISC-V architects (and memory model people) to have this context and
> reasoning written down.

One more noteworthy detail: if a system call similar to ARM cacheflush(2) is implemented for
RISC-V, perhaps an iovec ABI (similar to readv(2)/writev(2)) would be relevant to handle
batching of cache flushing when address ranges are not contiguous. Maybe with a new name
like "cacheflushv(2)", so eventually other architectures could implement it as well ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com