[PATCH v4 3/4] riscv: Stop emitting preventive sfence.vma for new vmalloc mappings

Mon Jun 1 01:45:24 PDT 2026

Hi Alexandre,

We are looking at commit 503638e0babf ("riscv: Stop emitting
preventive sfence.vma for new vmalloc mappings") and ran into a case
that I do not think the current early new_vmalloc_check covers.

The commit removes the preventive sfence.vma after installing a vmalloc
mapping and relies on a later kernel page fault to run
new_vmalloc_check. That works if the first stale-translation fault is
taken while sscratch is still zero. On our core, however, invalid PTE
results may be cached/prefetched before the vmalloc PTE is installed.
Without an sfence.vma, an old negative translation for the vmalloc VA can
remain cached for ASID A.

A sequence we can reproduce is:

  1. Before the vmalloc PTE is installed, the walker/prefetcher observes
     the VA and caches an invalid result for ASID A.
  2. The vmalloc PTE is installed as a global kernel mapping, but no
     sfence.vma is issued.
  3. A different task/ASID B accesses the same VA and installs a valid
     global translation.
  4. The same hart later has both the stale invalid result for ASID A and
     the valid global translation available for that VA. Which one is used
     is implementation-dependent until an sfence.vma is executed.

The failure we see happens in ret_from_exception while returning to user
mode:

  ret_from_exception:
      REG_L s0, PT_STATUS(sp)   /* succeeds via the valid global entry */
      ...
      csrw CSR_SCRATCH, tp      /* returning to user; sscratch is now non-zero */
  1:
      REG_L a0, PT_STATUS(sp)   /* hits the stale invalid entry and faults */

The second load takes a kernel page fault on the vmalloc stack. But
because sscratch has already been restored to tp for the upcoming return
to user mode, handle_exception takes the bnez tp, .Lsave_context path and
does not execute new_vmalloc_check. The fault is then not converted into
the intended sfence.vma + retry path.

Our understanding is that this CPU behavior is allowed: without an
sfence.vma after changing the PTE, software cannot require the MMU to stop
using a stale invalid translation result. Caching invalid translation
results can also be useful for performance.

Is the delayed-vmalloc-fence scheme intended to rely on stale invalid
entries not coexisting with a later valid global entry for the same VA?
Or should new_vmalloc_check also cover the short return-to-user window
after sscratch is made non-zero but before the final sret?

If the latter, would you prefer a fix that delays writing sscratch until
after the last access through the vmalloc stack in ret_from_exception, or
a fix that detects this S-mode vmalloc fault even when sscratch is
non-zero?

Thanks,
Yaxing