[PATCH v3 0/4] riscv: optimize Vector context restore on syscall

Thu May 21 12:15:07 PDT 2026

Hi Andy,

On Thu, May 21, 2026 at 11:25:16AM -0500, Andy Chiu wrote:
> This patch series optimizes riscv vector state handling across syscall
> boundaries and context switches. The kernel now keeps track of the
> INITIAL state in sstatus.vs to optimize unnecessary context management
> operations.
> 
> This version merges daichengrong's RFC patch [1] for the state tracking
> code as it looks cleaner than my v2/v1.
> 
> [1]: https://lore.kernel.org/linux-riscv/7ba2f4b7-8475-4ec3-ab31-58b332bda47e@iscas.ac.cn/#r
> Link to v2: https://lore.kernel.org/linux-riscv/20260402043414.2421916-1-andybnac@gmail.com/

A patchset like this would be really helped by some kind of numbers in the
cover letter to indicate how much performance moved, given a claim of
optimization.

Just for kicks I tried a simple microbenchmark for syscalls from
a vector-enabled process:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <time.h>
#include <stdint.h>

static inline uint64_t ns_now(void) {
    struct timespec t;
    clock_gettime(CLOCK_MONOTONIC, &t);
    return t.tv_sec * 1000000000ull + t.tv_nsec;
}

int main(int argc, char **argv) {
    int iters = argc > 1 ? atoi(argv[1]) : 10000000;
    int use_v = argc > 2 ? atoi(argv[2]) : 1;

    if (use_v) {
        asm volatile(
            ".option push\n\t.option arch, +v\n\t"
            "vsetivli x0, 1, e32, m1, ta, ma\n\t"
            "vmv.v.i v0, 1\n\t"
            ".option pop\n\t" ::: "memory");
    }

    for (int i = 0; i < 10000; i++) syscall(SYS_getppid);  // warmup

    uint64_t t0 = ns_now();
    for (int i = 0; i < iters; i++) syscall(SYS_getppid);
    uint64_t t1 = ns_now();

    printf("V=%d %.1f ns/call (%lu ns / %d iters)\n",
           use_v, (double)(t1 - t0) / iters, t1 - t0, iters);
    return 0;
}

I compiled with gcc -O3, default GCC 14.2 on Debian 13. Host is x280
(Blackhole). Base kernel sources is 7.1.0-rc4-next-20260520 defconfig. Ran
with taskset to pin to one of the CPUs.

The testcase doesn't use vector inbetween each syscall, but will obviously
have initiated the state (if started with '1' as second argument).

Without this patchset:
V=1 242.9 ns/call (12144527848 ns / 50000000 iters)

With this patchset:
V=1 264.5 ns/call (13226852900 ns / 50000000 iters)

Interestingly enough, with V=0 test it sped up slightly (194.3 -> 189.5 ns).

I repeated the runs a few times, with similar results so I don't think it's
explainable as noise.

Given that more code will be vector enabled in the new shiny RVA23 world
we are entering, I'm uncertain whether this is the right trade-off. You won't
get the syscall perf cost returned unless you need the vector context swapped
in without the lazy fault between calls.

I suspect running userspace workloads on a RVA23 platform (SpaceMIT
K3) with Ubuntu 26.04 would be the most meaningful data to collect. My
ordered board is still in shipping, unfortunately.

PS: There's a new build warning due to an unused 'uvstate' variable in
riscv_v_start_kernel_context() that you might want to fix.

-Olof