[PATCH v5] riscv: hwprobe: Fix stale vDSO data for late-initialized keys at boot
Palmer Dabbelt
palmer at dabbelt.com
Wed Jul 16 11:48:03 PDT 2025
On Sat, 05 Jul 2025 08:08:28 PDT (-0700), wangjingwei at iscas.ac.cn wrote:
> The hwprobe vDSO data for some keys, like MISALIGNED_VECTOR_PERF,
> is determined by an asynchronous kthread. This can create a race
> condition where the kthread finishes after the vDSO data has
> already been populated, causing userspace to read stale values.
>
> To fix this, a completion-based framework is introduced to robustly
> synchronize the async probes with the vDSO data population. The
> waiting function, init_hwprobe_vdso_data(), now blocks on
> wait_for_completion() until all probes signal they are done.
>
> Furthermore, to prevent this potential blocking from impacting boot
> performance, the initialization is deferred to late_initcall. This
> is safe as the data is only required by userspace (which starts
> much later) and moves the synchronization delay off the critical
> boot path.
>
> Reported-by: Tsukasa OI <research_trasio at irq.a4lg.com>
> Closes: https://lore.kernel.org/linux-riscv/760d637b-b13b-4518-b6bf-883d55d44e7f@irq.a4lg.com/
> Fixes: e7c9d66e313b ("RISC-V: Report vector unaligned access speed hwprobe")
> Cc: Palmer Dabbelt <palmer at dabbelt.com>
> Cc: Alexandre Ghiti <alexghiti at rivosinc.com>
> Cc: Olof Johansson <olof at lixom.net>
> Cc: stable at vger.kernel.org
> Signed-off-by: Jingwei Wang <wangjingwei at iscas.ac.cn>
> ---
> Changes in v5:
> - Reworked the synchronization logic to a robust "sentinel-count"
> pattern based on feedback from Alexandre.
> - Fixed a "multiple definition" linker error for nommu builds by changing
> the header-file stub functions to `static inline`, as pointed out by Olof.
> - Updated the commit message to better explain the rationale for moving
> the vDSO initialization to `late_initcall`.
>
> Changes in v4:
> - Reworked the synchronization mechanism based on feedback from Palmer
> and Alexandre.
> - Instead of a post-hoc refresh, this version introduces a robust
> completion-based framework using an atomic counter to ensure async
> probes are finished before populating the vDSO.
> - Moved the vdso data initialization to a late_initcall to avoid
> impacting boot time.
>
> Changes in v3:
> - Retained existing blank line.
>
> Changes in v2:
> - Addressed feedback from Yixun's regarding #ifdef CONFIG_MMU usage.
> - Updated commit message to provide a high-level summary.
> - Added Fixes tag for commit e7c9d66e313b.
>
> v1: https://lore.kernel.org/linux-riscv/20250521052754.185231-1-wangjingwei@iscas.ac.cn/T/#u
>
> arch/riscv/include/asm/hwprobe.h | 8 +++++++-
> arch/riscv/kernel/sys_hwprobe.c | 20 +++++++++++++++++++-
> arch/riscv/kernel/unaligned_access_speed.c | 9 +++++++--
> 3 files changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/arch/riscv/include/asm/hwprobe.h b/arch/riscv/include/asm/hwprobe.h
> index 7fe0a379474ae2c6..3b2888126e659ea1 100644
> --- a/arch/riscv/include/asm/hwprobe.h
> +++ b/arch/riscv/include/asm/hwprobe.h
> @@ -40,5 +40,11 @@ static inline bool riscv_hwprobe_pair_cmp(struct riscv_hwprobe *pair,
>
> return pair->value == other_pair->value;
> }
> -
> +#ifdef CONFIG_MMU
> +void riscv_hwprobe_register_async_probe(void);
> +void riscv_hwprobe_complete_async_probe(void);
> +#else
> +static inline void riscv_hwprobe_register_async_probe(void) {}
> +static inline void riscv_hwprobe_complete_async_probe(void) {}
> +#endif
> #endif
> diff --git a/arch/riscv/kernel/sys_hwprobe.c b/arch/riscv/kernel/sys_hwprobe.c
> index 0b170e18a2beba57..ee02aeb03e7bd3d8 100644
> --- a/arch/riscv/kernel/sys_hwprobe.c
> +++ b/arch/riscv/kernel/sys_hwprobe.c
> @@ -5,6 +5,8 @@
> * more details.
> */
> #include <linux/syscalls.h>
> +#include <linux/completion.h>
> +#include <linux/atomic.h>
> #include <asm/cacheflush.h>
> #include <asm/cpufeature.h>
> #include <asm/hwprobe.h>
> @@ -467,6 +469,20 @@ static int do_riscv_hwprobe(struct riscv_hwprobe __user *pairs,
>
> #ifdef CONFIG_MMU
>
> +static DECLARE_COMPLETION(boot_probes_done);
> +static atomic_t pending_boot_probes = ATOMIC_INIT(1);
> +
> +void riscv_hwprobe_register_async_probe(void)
> +{
> + atomic_inc(&pending_boot_probes);
> +}
> +
> +void riscv_hwprobe_complete_async_probe(void)
> +{
> + if (atomic_dec_and_test(&pending_boot_probes))
> + complete(&boot_probes_done);
> +}
> +
> static int __init init_hwprobe_vdso_data(void)
> {
> struct vdso_arch_data *avd = vdso_k_arch_data;
> @@ -474,6 +490,8 @@ static int __init init_hwprobe_vdso_data(void)
> struct riscv_hwprobe pair;
> int key;
>
> + if (unlikely(!atomic_dec_and_test(&pending_boot_probes)))
> + wait_for_completion(&boot_probes_done);
I'm not actually sure this is safe,
That said, it will end up bringing us back to the situation where we're
waiting for CPUs to probe before we boot. So even if it's safe, it's
not great as we're going to introduce the boot time regression again.
We'd be better off if we could defer waiting on the completion until we
actually need the values. I think we could do that with something like
diff --git a/arch/riscv/include/asm/vdso/arch_data.h b/arch/riscv/include/asm/vdso/arch_data.h
index da57a3786f7a..88b37af55175 100644
--- a/arch/riscv/include/asm/vdso/arch_data.h
+++ b/arch/riscv/include/asm/vdso/arch_data.h
@@ -12,6 +12,12 @@ struct vdso_arch_data {
/* Boolean indicating all CPUs have the same static hwprobe values. */
__u8 homogeneous_cpus;
+
+ /*
+ * A gate to check and see if the hwprobe data is actually ready, as
+ * probing is deferred to avoid boot slowdowns.
+ */
+ __u8 ready;
};
#endif /* __RISCV_ASM_VDSO_ARCH_DATA_H */
diff --git a/arch/riscv/kernel/sys_hwprobe.c b/arch/riscv/kernel/sys_hwprobe.c
index ee02aeb03e7b..9c809c24a1d9 100644
--- a/arch/riscv/kernel/sys_hwprobe.c
+++ b/arch/riscv/kernel/sys_hwprobe.c
@@ -454,19 +454,6 @@ static int hwprobe_get_cpus(struct riscv_hwprobe __user *pairs,
return 0;
}
-static int do_riscv_hwprobe(struct riscv_hwprobe __user *pairs,
- size_t pair_count, size_t cpusetsize,
- unsigned long __user *cpus_user,
- unsigned int flags)
-{
- if (flags & RISCV_HWPROBE_WHICH_CPUS)
- return hwprobe_get_cpus(pairs, pair_count, cpusetsize,
- cpus_user, flags);
-
- return hwprobe_get_values(pairs, pair_count, cpusetsize,
- cpus_user, flags);
-}
-
#ifdef CONFIG_MMU
static DECLARE_COMPLETION(boot_probes_done);
@@ -483,15 +470,20 @@ void riscv_hwprobe_complete_async_probe(void)
complete(&boot_probes_done);
}
-static int __init init_hwprobe_vdso_data(void)
+static int complete_hwprobe_vdso_data(void)
{
struct vdso_arch_data *avd = vdso_k_arch_data;
u64 id_bitsmash = 0;
struct riscv_hwprobe pair;
int key;
+ /* We've probably already produced these values. */
+ if (likely(avd->ready))
+ return 0;
+
if (unlikely(!atomic_dec_and_test(&pending_boot_probes)))
wait_for_completion(&boot_probes_done);
+
/*
* Initialize vDSO data with the answers for the "all CPUs" case, to
* save a syscall in the common case.
@@ -519,13 +511,50 @@ static int __init init_hwprobe_vdso_data(void)
* vDSO should defer to the kernel for exotic cpu masks.
*/
avd->homogeneous_cpus = id_bitsmash != 0 && id_bitsmash != -1;
+
+ /*
+ * Make sure all the VDSO values are visible before we look at them.
+ * This pairs with the implicit "no speculativly visible accesses"
+ * barrier in the VDSO hwprobe code.
+ */
+ smp_wmb();
+ avd->ready = true;
return 0;
}
+static int __init init_hwprobe_vdso_data(void)
+{
+ struct vdso_arch_data *avd = vdso_k_arch_data;
+
+ /*
+ * Prevent the vDSO cached values from being used, as they're not ready
+ * yet.
+ */
+ avd->ready = false;
+}
+
late_initcall(init_hwprobe_vdso_data);
+#else
+static int complete_hwprobe_vdso_data(void) { return 0; }
#endif /* CONFIG_MMU */
+static int do_riscv_hwprobe(struct riscv_hwprobe __user *pairs,
+ size_t pair_count, size_t cpusetsize,
+ unsigned long __user *cpus_user,
+ unsigned int flags)
+{
+ complete_hwprobe_vdso_data();
+
+ if (flags & RISCV_HWPROBE_WHICH_CPUS)
+ return hwprobe_get_cpus(pairs, pair_count, cpusetsize,
+ cpus_user, flags);
+
+ return hwprobe_get_values(pairs, pair_count, cpusetsize,
+ cpus_user, flags);
+}
+
+
SYSCALL_DEFINE5(riscv_hwprobe, struct riscv_hwprobe __user *, pairs,
size_t, pair_count, size_t, cpusetsize, unsigned long __user *,
cpus, unsigned int, flags)
diff --git a/arch/riscv/kernel/vdso/hwprobe.c b/arch/riscv/kernel/vdso/hwprobe.c
index 2ddeba6c68dd..bf77b4c1d2d8 100644
--- a/arch/riscv/kernel/vdso/hwprobe.c
+++ b/arch/riscv/kernel/vdso/hwprobe.c
@@ -27,7 +27,7 @@ static int riscv_vdso_get_values(struct riscv_hwprobe *pairs, size_t pair_count,
* homogeneous, then this function can handle requests for arbitrary
* masks.
*/
- if ((flags != 0) || (!all_cpus && !avd->homogeneous_cpus))
+ if ((flags != 0) || (!all_cpus && !avd->homogeneous_cpus) || unlikely(!avd->ready))
return riscv_hwprobe(pairs, pair_count, cpusetsize, cpus, flags);
/* This is something we can handle, fill out the pairs. */
(which I haven't even built yet, as I don't have a complier set up in
this Mac yet...).
There's probably a better way to do this that would require a bit more
work, as there's sort of a double-completion here (each probe's
workqueue and then the top-level completion)). In the long run we
should probably fold that all together so we can just block on probing
the values we need, but this was quick...
> /*
> * Initialize vDSO data with the answers for the "all CPUs" case, to
> * save a syscall in the common case.
> @@ -504,7 +522,7 @@ static int __init init_hwprobe_vdso_data(void)
> return 0;
> }
>
> -arch_initcall_sync(init_hwprobe_vdso_data);
> +late_initcall(init_hwprobe_vdso_data);
>
> #endif /* CONFIG_MMU */
>
> diff --git a/arch/riscv/kernel/unaligned_access_speed.c b/arch/riscv/kernel/unaligned_access_speed.c
> index ae2068425fbcd207..4b8ad2673b0f7470 100644
> --- a/arch/riscv/kernel/unaligned_access_speed.c
> +++ b/arch/riscv/kernel/unaligned_access_speed.c
> @@ -379,6 +379,7 @@ static void check_vector_unaligned_access(struct work_struct *work __always_unus
> static int __init vec_check_unaligned_access_speed_all_cpus(void *unused __always_unused)
> {
> schedule_on_each_cpu(check_vector_unaligned_access);
Do we need something like
diff --git a/arch/riscv/kernel/unaligned_access_speed.c b/arch/riscv/kernel/unaligned_access_speed.c
index 4b8ad2673b0f..0f4eb95072ea 100644
--- a/arch/riscv/kernel/unaligned_access_speed.c
+++ b/arch/riscv/kernel/unaligned_access_speed.c
@@ -370,6 +370,12 @@ static void check_vector_unaligned_access(struct work_struct *work __always_unus
(speed == RISCV_HWPROBE_MISALIGNED_VECTOR_FAST) ? "fast" : "slow");
per_cpu(vector_misaligned_access, cpu) = speed;
+ /*
+ * Ensure the store that sets up the misaligned access value is
+ * visible before it is used by the other CPUs. This orders with the
+ * atomic_dec_and_test() in riscv_hwprobe_complete_async_probe().
+ */
+ smp_wmb();
free:
__free_pages(page, MISALIGNED_BUFFER_ORDER);
to make sure we're ordered here? I think we're probably safe in
practice because the scheduler code has orderings. Either way it's an
existing bug, so I'll just send it along as another patch.
> + riscv_hwprobe_complete_async_probe();
>
> return 0;
> }
> @@ -473,8 +474,12 @@ static int __init check_unaligned_access_all_cpus(void)
> per_cpu(vector_misaligned_access, cpu) = unaligned_vector_speed_param;
> } else if (!check_vector_unaligned_access_emulated_all_cpus() &&
> IS_ENABLED(CONFIG_RISCV_PROBE_VECTOR_UNALIGNED_ACCESS)) {
> - kthread_run(vec_check_unaligned_access_speed_all_cpus,
> - NULL, "vec_check_unaligned_access_speed_all_cpus");
> + riscv_hwprobe_register_async_probe();
> + if (IS_ERR(kthread_run(vec_check_unaligned_access_speed_all_cpus,
> + NULL, "vec_check_unaligned_access_speed_all_cpus"))) {
> + pr_warn("Failed to create vec_unalign_check kthread\n");
> + riscv_hwprobe_complete_async_probe();
> + }
> }
>
> /*
More information about the linux-riscv
mailing list