[PATCH v3 0/4] riscv: optimize Vector context restore on syscall
Olof Johansson
olof at lixom.net
Thu May 21 20:30:56 PDT 2026
On Thu, May 21, 2026 at 04:09:40PM -0500, Andy Chiu wrote:
> On Thu, May 21, 2026 at 12:15:07PM -0700, Olof Johansson wrote:
> > Hi Andy,
> >
> > On Thu, May 21, 2026 at 11:25:16AM -0500, Andy Chiu wrote:
> > > This patch series optimizes riscv vector state handling across syscall
> > > boundaries and context switches. The kernel now keeps track of the
> > > INITIAL state in sstatus.vs to optimize unnecessary context management
> > > operations.
> > >
> > > This version merges daichengrong's RFC patch [1] for the state tracking
> > > code as it looks cleaner than my v2/v1.
> > >
> > > [1]: https://lore.kernel.org/linux-riscv/7ba2f4b7-8475-4ec3-ab31-58b332bda47e@iscas.ac.cn/#r
> > > Link to v2: https://lore.kernel.org/linux-riscv/20260402043414.2421916-1-andybnac@gmail.com/
> >
> > A patchset like this would be really helped by some kind of numbers in the
> > cover letter to indicate how much performance moved, given a claim of
> > optimization.
> >
>
> Thanks for pointing it out, I totally agree with you. I had included a
> test result on sifive's hardware in v2[1]. But that was on FPGA, I will
> test it on a real silicon as soon as I have an access. Sorry for the
> confusing claim here.
>
> My test was running on a vector enabled version of lat_ctx. I modified
> the main function to make sure the process touches vector, then run with
> 2 threads. Since lat_ctx uses syscall interface to notify another
> process, the kernel will trash their vector registers instead of wasting
> cycles on saving/restoring them.
>
> >
> > Just for kicks I tried a simple microbenchmark for syscalls from
> > a vector-enabled process:
> >
> > #define _GNU_SOURCE
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <sys/syscall.h>
> > #include <unistd.h>
> > #include <time.h>
> > #include <stdint.h>
> >
> > static inline uint64_t ns_now(void) {
> > struct timespec t;
> > clock_gettime(CLOCK_MONOTONIC, &t);
> > return t.tv_sec * 1000000000ull + t.tv_nsec;
> > }
> >
> > int main(int argc, char **argv) {
> > int iters = argc > 1 ? atoi(argv[1]) : 10000000;
> > int use_v = argc > 2 ? atoi(argv[2]) : 1;
> >
> > if (use_v) {
> > asm volatile(
> > ".option push\n\t.option arch, +v\n\t"
> > "vsetivli x0, 1, e32, m1, ta, ma\n\t"
> > "vmv.v.i v0, 1\n\t"
> > ".option pop\n\t" ::: "memory");
> > }
> >
> > for (int i = 0; i < 10000; i++) syscall(SYS_getppid); // warmup
> >
> > uint64_t t0 = ns_now();
> > for (int i = 0; i < iters; i++) syscall(SYS_getppid);
> > uint64_t t1 = ns_now();
> >
> > printf("V=%d %.1f ns/call (%lu ns / %d iters)\n",
> > use_v, (double)(t1 - t0) / iters, t1 - t0, iters);
> > return 0;
> > }
> >
> >
> > I compiled with gcc -O3, default GCC 14.2 on Debian 13. Host is x280
> > (Blackhole). Base kernel sources is 7.1.0-rc4-next-20260520 defconfig. Ran
> > with taskset to pin to one of the CPUs.
> >
> > The testcase doesn't use vector inbetween each syscall, but will obviously
> > have initiated the state (if started with '1' as second argument).
> >
> > Without this patchset:
> > V=1 242.9 ns/call (12144527848 ns / 50000000 iters)
> >
> > With this patchset:
> > V=1 264.5 ns/call (13226852900 ns / 50000000 iters)
>
> This 9% regression is suprising to me as this patch set (without the fix
> below) should be equivalent in performance on this code path (and better
> if getpid takes one process switch).
>
> Before the patch, nulling v resgisters happens at the entry point. After
> this patchset we mark sstatus.vs to INITIAL at the entry and nulling
> happens right before getting back to the user space.
>
> >
> > Interestingly enough, with V=0 test it sped up slightly (194.3 -> 189.5 ns).
> >
>
> This result is expected as with V=0 the kernel doesn't have to maintain
> vstate at all. But we are also looking into ways to improve mode switch
> latencies.
>
> > I repeated the runs a few times, with similar results so I don't think it's
> > explainable as noise.
> >
>
> Thanks for carrying out the experiment, it's very sound! I actually
> missed one thing on this patch for it to be optimized on this specific
> case:
>
> diff --git a/arch/riscv/include/asm/vector.h b/arch/riscv/include/asm/vector.h
> index 8c1e64e0dd0b..5d1282870a20 100644
> --- a/arch/riscv/include/asm/vector.h
> +++ b/arch/riscv/include/asm/vector.h
> @@ -40,6 +40,15 @@
> _res; \
> })
>
> +#define __riscv_v_vstate_check_gt(_val, TYPE) ({ \
> + bool _res; \
> + if (has_xtheadvector()) \
> + _res = ((_val) & SR_VS_THEAD) > SR_VS_##TYPE##_THEAD; \
> + else \
> + _res = ((_val) & SR_VS) > SR_VS_##TYPE; \
> + _res; \
> +})
> +
> extern unsigned long riscv_v_vsize;
> int riscv_v_setup_vsize(void);
> bool insn_is_vector(u32 insn_buf);
> @@ -323,7 +332,7 @@ static inline void riscv_v_vstate_set_restore(struct task_struct *task,
>
> static inline void riscv_v_vstate_discard(struct pt_regs *regs)
> {
> - if (riscv_v_vstate_query(regs)) {
> + if (__riscv_v_vstate_check_gt(regs->status, INITIAL)) {
> riscv_v_vstate_set_restore(current, regs);
> riscv_v_vstate_init(regs);
> }
>
> In this way if the user never touch vregs again then we will not null
> out the context at syscall exit.
Significantly better indeed, now ~191ns without vector, ~192ns with -- so
a proper optimization.
> Again, I only have it functionally tested at the moment. I appreciate it
> if you could get the number with the above diff. Meanwhile, I am going
> souce and run on an actual hardware and hopefully find the reason for
> the above regression, before rolling out v4.
Sounds good.
-Olof
More information about the linux-riscv
mailing list