vector status when vlen doesn't match

Wed Feb 28 15:32:47 PST 2024

On Wed, Feb 28, 2024 at 01:56:51PM -0800, Palmer Dabbelt wrote:
> On Wed, 28 Feb 2024 08:02:47 PST (-0800), Conor Dooley wrote:
> > Yo Andy,
> > 
> > I was wondering if you could clarify something that came up today on
> > IRC.
> > 
> > When we enable vector we check vlenb and if it doesn't match the kernel
> > is suppose to not support vector. We do this in a few places, the first
> > is in riscv_fill_hwcap():
> > 
> > 	if (elf_hwcap & COMPAT_HWCAP_ISA_V) {
> > 		riscv_v_setup_vsize();
> > 		/*
> > 		 * ISA string in device tree might have 'v' flag, but
> > 		 * CONFIG_RISCV_ISA_V is disabled in kernel.
> > 		 * Clear V flag in elf_hwcap if CONFIG_RISCV_ISA_V is disabled.
> > 		 */
> > 		if (!IS_ENABLED(CONFIG_RISCV_ISA_V))
> > 			elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > 	}
> > 
> > Why does this riscv_v_setup_vsize() failing have no impact? Because this
> > is called only once on the boot CPU, right? I feel like it deserves a
> > comment as to why.
> 
> Unless I'm missing something, that first call can't fail: there's no
> previous riscv_v_size set, so there's nothing to trigger the failure.
> 
> Might warrant a comment, though...

Yeah. I did write (or maybe moreso remodel) that function so I know it
only gets called on the boot CPU, I was just pointing out what is not
immediately obvious about that particular callsite.

> > But the main thing I was wondering about was the other time it is called,
> > in smp_callin():
> > 	if (has_vector()) {
> > 		if (riscv_v_setup_vsize())
> > 			elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > 	}
> > 
> > If this fails, we clear the HWCAP bit, but I am not immediately grasping
> > how this affects the rest of the kernel.
> 
> I think it's actually a bug: we're not clearing the bit in riscv_isa, so
> this only ends up disabling the V bit in AT_HWCAP and thus doesn't do
> anything in the kernel.

Yeah, that's what I was getting at - I could not see how disabling it in
hwcap would make a difference either in the kernel or in hwprobe. I was
not sure if I was missing something in terms of the
userspace-using-vector side of things, but I was pretty sure that
hwprobe and in-kernel vector had this bug.

> > has_vector() just checks if the extension is supported by all cores on
> > the cpu using either an alternative or the extension support bitmap.
> > There are lots of sites in the kernel that use has_vector() as the
> > gating check (as far as I can tell), so if vlen does not match between
> > CPUs should we not be returning an error for has_vector()?
> > 
> > The alternative that has_vector() relies on is patched before the non-boot
> > CPUs are enabled, so we cannot modify the result once the non-boot CPUs
> > are in their callin functions and detect a vlen mismatch while at at the
> > same time using it in smp_calling(), so should this code be changed to
> > something along the lines of:
> > 
> > 	if (riscv_has_extension_unlikely(v)) {
> > 		if (riscv_v_setup_vsize()) {
> > 			elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > 			riscv_v_vlen_mismatch = true;
> > 		}
> > 	}
> > and then has_vector() becomes something like
> > static __always_inline bool has_vector(void)
> > {
> > 	return riscv_has_extension_unlikely(RISCV_ISA_EXT_v) && unlikely(riscv_v_vlen_mismatch);
> > }
> > 
> > Probably there's value to be gained in static branches etc here, but I
> > was just trying to explain what I was getting at. Maybe I am missing
> > something though? I do remember talking about this back when the vector
> > patches were still in review.
> 
> I actually don't think that alone fixes it: even if we make has_vector()
> respect the changes from the new CPU (whether we re-apply the alternatives
> or make a more dynamic check), we'd need to make sure the code that has
> already looked at has_vector() doesn't get scheduled on one of the
> shorter-VLENB CPUs.

Is it possible to have that happen given this is in smp_callin()?
I guess the concern is some sort of hotplug situations, given that a CPU
could be hotplugged into a fully running system with a vlen mismatch.

> So maybe the right answer here is to flip things around, and just refuse to
> enable CPUs that come in with a VLENB different than the boot CPU?  That way
> we could avoid being broken by systems that behave this way,

smp_callin() doesn't return an error to report a failure, but it does
have a completion. I suppose we would just return early from
smp_callin() so that the completion never gets completed, in turn
causing __cpu_up() to fail?

> if someone ends
> up building one then they can figure out how to make it work better.

Apparently the next gen sophgo product does. Not sure if people will wnt
to run across all the cores though, apparently they different by a lot
more than vlen.
A DT property would allow us to get the vlen info before the CPUs have
themselves been brought up. Apparently there are Zvl extensions for
this, but I'd rather just have a property like we have for the cbo
sizes if it comes to DT properties.

Cheers,
Conor.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-riscv/attachments/20240228/a03684aa/attachment.sig>