[tech-speculation-barriers] [PATCH] riscv, bpf: Emit fence.i for BPF_NOSPEC

Mon Jan 12 08:19:54 PST 2026

"Stefan O'Rear" <sorear at fastmail.com> writes:

> On Thu, Jan 8, 2026, at 10:37 PM, Bo Gan via lists.riscv.org wrote:
>> Hi Lukas,
>>
>> Stefan and I have some doubts on fence.i's effectiveness as speculation
>> barrier. Flushing entire local instruction cache and instruction pipeline
>> is not absolutely necessary on impl having coherent I/D caches. Quoting
>> from Unprivileged SPEC ver. 20250508:
>>
>> "The FENCE.I instruction was designed to support a wide variety of
>>   implementations. A simple implementation can flush the local instruction
>>   cache and the instruction pipeline when the FENCE.I is executed. A more
>>   complex implementation might snoop the instruction (data) cache on every
>>   data (instruction) cache miss, or use an inclusive unified private L2
>>   cache to invalidate lines from the primary instruction cache when they
>>   are being written by a local store instruction. If instruction and data
>>   caches are kept coherent in this way, or if the memory system consists of
>>   only uncached RAMs, then just the fetch pipeline needs to be flushed at a
>>   FENCE.I"
>
> Note that this is non-normative text and the actual range of allowed
> implementations is wider than this.
>
> I'm particularly concerned that the security property we appear to need (I am
> more familiar with u-arch vulnerabilities than BPF ISA details) is an _issue
> barrier_, but correctness for FENCE.I as currently specified only requires
> a _retirement barrier_.

> FENCE.I requires that instructions after the FENCE.I in program order not
> retire unless the hart can verify that they were not overwritten in memory
> between the time they were fetched and the memory-order point of the FENCE.I.
> Simple implementations will probably achieve this by re-fetching and
> preventing retirement of all instructions after the FENCE.I,

Are you aware of any CPU implementation that works like this?

> but if the
> instructions speculatively execute then it is not useful for a Spectre v1
> barrier.

You are correct. However, I believe a retirement barrier would still
already be useful as a stopgap solution as it should still reduce the
success rate and therefore the bandwith of the exploit. Also, because
the verifier currently adds barriers very early (it does not only add
the barrier before the non-CT operation on a secret but already when
anything that is not allowed architecturally happens), it might actually
prevent some exploits. This does of course not offer full protection,
but it still makes it harder to develop working exploits.

> Issuing an instruction which cannot possibly retire wastes energy,
> but if FENCE.I is assumed to be an extremely rare operation this may not be
> a priority.
>
> Particularly complex implementations can go as far as to treat FENCE.I as a
> no-op if they snoop the reorder buffer as well as the instruction cache.
>
>> There's the question on overhead, too. Perhaps there's a more accurate and
>> lightweight insn available? I'm not an expert in u-arch. My gut feeling is
>> that we should not be dependent on specific impl's behavior and the riscv
>> SPEC should provide guidelines on speculation barrier instructions and how
>> to use them. Thus, I'm forwarding this to the Speculation Barriers Task-
>> Group, which I hope should be the perfect place to discuss such kind of
>> issues. @Speculation Barriers TG Please share your thoughts. Note that we
>> are dealing with existing HW, so we expect something to be working with
>> current SPEC and actual silicon. I'd be happy if I'm proven wrong, and
>> fence.i can actually be a speculation barrier. That's also a relief. Thank
>> you everyone.
>
> The JH7110 has 512 I-cache lines per core, all of which must be invalidated
> on a FENCE.I. I'm not sure how many cycles that takes for the invalidation,
> but some fraction of those will subsequently be needed before they would
> otherwise be evicted, which could add up to several thousand cycles of
> overhead depending on the cache miss latency, for a BPF program with a single
> BPF_NOSPEC. Compared to roughly one thousand cycles for a kernel entry and
> exit, it may be more practical to disable BPF and rely on userspace event
> processing for affected hardware, even if FENCE.I is otherwise useful as a
> speculation barrier.

Thank you very much. For JH7110 I think it would then be best to avoid
adding anything unless there is any strong evidence showing this CPU is
vulnerable to Spectre v1. (A quick search yielded no result but maybe
someone else has more information on this.) Even if it is vulnerable,
people might want performant unprivileged eBPF simply for compatibility
reasons. As we can not know whether untrusted unprivileged users are
actually a concern, a dmesg warning might be best then. Or maybe there
is some other construct that achieves the desired effect on JH7110.

> I don't think it's possible to define a set of speculation barriers that apply
> to all possible existing and future hardware with the current specifications,
> because the ratified specifications cannot be changed and other than Zkt/Zvkt
> do not constrain uarch side channels.
>
> The TG is defining a new specification which includes semantically rich
> barriers that can have optimal overhead on new hardware. If we find a set of
> speculation barriers that work on some or (optimistically) all existing
> hardware, we would need to define a retroactive specification which documents
> that behavior, much like Zkt/Zvkt or many of the extensions defined in the
> profiles specification.

That would be very useful to eBPF.

Are you aware of any existing RISC-V CPU implementations where FENCE.I,
or any other instruction, does delay issueing/retirement by a few cycles
(e.g., until concurrent stores have completed)? (I am aware that
retirement is not sufficient formally, but it is better than nothing.)
Anything like this might be preferable over simply filling up the
reorder buffer with some NOP instruction. And even if a part of the CPU
state may still be speculative, any such instruction would already be
useful as a (partial) stopgap solution.

Or does FENCE.I do this on any concrete CPU other than JH7110?