[PATCH] riscv: Add sysctl to control discard of vstate during syscall

Fri Aug 1 14:41:51 PDT 2025

On Wed, Jul 30, 2025 at 06:05:59PM -0700, Palmer Dabbelt wrote:
> My first guess here would be that trashing the V register state is still
> faster on the machines that triggered this patch, it's just that the way
> we're trashing it is slow.  We're doing some wacky things in there (VILL,
> LMUL, clearing to -1), so it's not surprising that some implementations are
> slow on these routines.
> 
> This came up during the original patch and we decided to just go with this
> way (which is recommended by the ISA) until someone could demonstrate it's
> slow, so sounds like it's time to go revisit those.
> 
> So I'd start with something like
> 
>    diff --git a/arch/riscv/include/asm/vector.h b/arch/riscv/include/asm/vector.h
>    index b61786d43c20..1fba33e62d2b 100644
>    --- a/arch/riscv/include/asm/vector.h
>    +++ b/arch/riscv/include/asm/vector.h
>    @@ -287,7 +287,6 @@ static inline void __riscv_v_vstate_discard(void)
>                    "vmv.v.i        v8, -1\n\t"
>                    "vmv.v.i        v16, -1\n\t"
>                    "vmv.v.i        v24, -1\n\t"
>    -               "vsetvl         %0, x0, %1\n\t"
>                    ".option pop\n\t"
>                    : "=&r" (vl) : "r" (vtype_inval));
> 
> to try and see if we're tripping over bad implementation behavior, in which
> case we can just hide this all in the kernel.  Then we can split out these
> performance issues from other things like lazy save/restore and a
> V-preserving uABI, as it stands this is all sort of getting mixed up.

Thank you for your insights and the suggestion of removing vsetvl.

Using our v6.16-rc1 branch [1], the avg duration of getppid() is 198 ns
with the existing upstream behavior in __riscv_v_vstate_discard():

debian at tt-blackhole:~$ ./null_syscall --vsetvli
vsetvli complete
 iterations: 1000000000
   duration: 198 seconds
avg latency: 198.10 ns

I removed 'vsetvl' as you suggested but the average duration only
decreased a very small amount to 197.5 ns, so it seems that the other
instructions are what is taking a lot of time on the X280 cores:

debian at tt-blackhole:~$ ./null_syscall --vsetvli
vsetvli complete
 iterations: 1000000000
   duration: 197 seconds
avg latency: 197.53 ns

This is compared to a duration of 150 ns when using this patch with
abi.riscv_v_vstate_discard=0 which skips all the clobbering assembly.

Do you have any other suggestions for the __riscv_v_vstate_discard()
inline assembly that might be worth me testing on the X280 cores?

Thanks,
Drew

[1] https://github.com/tenstorrent/linux/tree/tt-blackhole-v6.16-rc1