[PATCH] riscv: Add sysctl to control discard of vstate during syscall
Drew Fustini
fustini at kernel.org
Fri Aug 1 14:41:51 PDT 2025
On Wed, Jul 30, 2025 at 06:05:59PM -0700, Palmer Dabbelt wrote:
> My first guess here would be that trashing the V register state is still
> faster on the machines that triggered this patch, it's just that the way
> we're trashing it is slow. We're doing some wacky things in there (VILL,
> LMUL, clearing to -1), so it's not surprising that some implementations are
> slow on these routines.
>
> This came up during the original patch and we decided to just go with this
> way (which is recommended by the ISA) until someone could demonstrate it's
> slow, so sounds like it's time to go revisit those.
>
> So I'd start with something like
>
> diff --git a/arch/riscv/include/asm/vector.h b/arch/riscv/include/asm/vector.h
> index b61786d43c20..1fba33e62d2b 100644
> --- a/arch/riscv/include/asm/vector.h
> +++ b/arch/riscv/include/asm/vector.h
> @@ -287,7 +287,6 @@ static inline void __riscv_v_vstate_discard(void)
> "vmv.v.i v8, -1\n\t"
> "vmv.v.i v16, -1\n\t"
> "vmv.v.i v24, -1\n\t"
> - "vsetvl %0, x0, %1\n\t"
> ".option pop\n\t"
> : "=&r" (vl) : "r" (vtype_inval));
>
> to try and see if we're tripping over bad implementation behavior, in which
> case we can just hide this all in the kernel. Then we can split out these
> performance issues from other things like lazy save/restore and a
> V-preserving uABI, as it stands this is all sort of getting mixed up.
Thank you for your insights and the suggestion of removing vsetvl.
Using our v6.16-rc1 branch [1], the avg duration of getppid() is 198 ns
with the existing upstream behavior in __riscv_v_vstate_discard():
debian at tt-blackhole:~$ ./null_syscall --vsetvli
vsetvli complete
iterations: 1000000000
duration: 198 seconds
avg latency: 198.10 ns
I removed 'vsetvl' as you suggested but the average duration only
decreased a very small amount to 197.5 ns, so it seems that the other
instructions are what is taking a lot of time on the X280 cores:
debian at tt-blackhole:~$ ./null_syscall --vsetvli
vsetvli complete
iterations: 1000000000
duration: 197 seconds
avg latency: 197.53 ns
This is compared to a duration of 150 ns when using this patch with
abi.riscv_v_vstate_discard=0 which skips all the clobbering assembly.
Do you have any other suggestions for the __riscv_v_vstate_discard()
inline assembly that might be worth me testing on the X280 cores?
Thanks,
Drew
[1] https://github.com/tenstorrent/linux/tree/tt-blackhole-v6.16-rc1
More information about the linux-riscv
mailing list