[RFC PATCH] membarrier: riscv: Provide core serializing command

Fri Oct 13 10:29:34 PDT 2023

On Mon, 07 Aug 2023 06:19:18 PDT (-0700), parri.andrea at gmail.com wrote:
>> One more noteworthy detail: if a system call similar to ARM cacheflush(2) is implemented for
>> RISC-V, perhaps an iovec ABI (similar to readv(2)/writev(2)) would be relevant to handle
>> batching of cache flushing when address ranges are not contiguous. Maybe with a new name
>> like "cacheflushv(2)", so eventually other architectures could implement it as well ?
>
> I believe that's a sensible idea.  But the RISC-V maintainers can provide
> a more reliable feedback.

Sorry I missed this, I'm still a bit backlogged from COVID.  A few of us 
were having a meeting, just to try and summarize (many of these points 
came up in the thread, so sorry for rehashing things):

We don't have a fence.i in the scheduling path, as fence.i is very slow 
on systems that implement it by flushing the icache.  Instead we have a 
mechanism for deferring the fences (see flush_icache_deferred, though 
I'm no longer sure that's correct which I'll mention below).  As a 
result userspace can't do a fence.i directly, but instead needs to make 
a syscall/vdsocall so the kernel can do this bookkeeping.  There's some 
proposals for ISA extensions that replace fence.i, but they're still WIP 
and there's a lot of fence.i-only hardware so we'll have to deal with 
it.

When we did this we had a feeling this may be sub-optimal for systems 
that have faster fence.i implementations (ie, coherent instruction 
caches), but nobody's gotten around to doing that yet -- and maybe 
there's no hardware that behaves this way.  The rough plan was along the 
lines of adding a prctl() where userspace can request the ability to 
directly emit fence.i, which would then result in the kernel eagerly 
emitting fence.i when scheduling.  Some of the Java people have been 
asking for this sort of feature.

>From looking at the membarrier arch/scheduler hooks, I think we might 
have a bug in our deferred icache flushing mechanism: specifically we 
hook into switch_mm(), which this comment has me worried about

         * When switching through a kernel thread, the loop in
         * membarrier_{private,global}_expedited() may have observed that
         * kernel thread and not issued an IPI. It is therefore possible to
         * schedule between user->kernel->user threads without passing though
         * switch_mm(). Membarrier requires a barrier after storing to
         * rq->curr, before returning to userspace, so provide them here:

Even if there's not a bug in the RISC-V stuff, it seems that we've ended 
up with pretty similar schemes here and we could remove some 
arch-specific code by de-duplicating things -- IIRC there was no 
membarrier when we did the original port, so I think we've just missed a 
cleanup opportunity.

So I'd propose doing the following:

* Pick up a patch like this.  Mmaybe exactly this, I'm going to give it 
  a proper review to make sure.
* Remove the RISC-V implemenation of deferred icache flushes and instead 
  just call into membarrier.  We might need to add some more bookkeeping 
  here, but from a quick look it seems membarrier is doing pretty much 
  the same thing.
* Implement that prctl that allows userspace to ask for permission to do 
  direct fence.i instructions -- sort of a different project, but if 
  we're going to be tearing into all this code we might as well do it 
  now.

Charlie is volunteering to do the work here, so hopefully we'll have 
something moving forward.

>
>   Andrea