vector status when vlen doesn't match

Andy Chiu andy.chiu at sifive.com
Thu Feb 29 05:32:56 PST 2024


On Thu, Feb 29, 2024 at 7:32 AM Conor Dooley <conor at kernel.org> wrote:
>
> On Wed, Feb 28, 2024 at 01:56:51PM -0800, Palmer Dabbelt wrote:
> > On Wed, 28 Feb 2024 08:02:47 PST (-0800), Conor Dooley wrote:
> > > Yo Andy,
> > >
> > > I was wondering if you could clarify something that came up today on
> > > IRC.
> > >
> > > When we enable vector we check vlenb and if it doesn't match the kernel
> > > is suppose to not support vector. We do this in a few places, the first
> > > is in riscv_fill_hwcap():
> > >
> > >     if (elf_hwcap & COMPAT_HWCAP_ISA_V) {
> > >             riscv_v_setup_vsize();
> > >             /*
> > >              * ISA string in device tree might have 'v' flag, but
> > >              * CONFIG_RISCV_ISA_V is disabled in kernel.
> > >              * Clear V flag in elf_hwcap if CONFIG_RISCV_ISA_V is disabled.
> > >              */
> > >             if (!IS_ENABLED(CONFIG_RISCV_ISA_V))
> > >                     elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > >     }
> > >
> > > Why does this riscv_v_setup_vsize() failing have no impact? Because this
> > > is called only once on the boot CPU, right? I feel like it deserves a
> > > comment as to why.
> >
> > Unless I'm missing something, that first call can't fail: there's no
> > previous riscv_v_size set, so there's nothing to trigger the failure.
> >
> > Might warrant a comment, though...
>
> Yeah. I did write (or maybe moreso remodel) that function so I know it
> only gets called on the boot CPU, I was just pointing out what is not
> immediately obvious about that particular callsite.

Yes, this call can't fall. Ok, I'll update the comment.

>
> > > But the main thing I was wondering about was the other time it is called,
> > > in smp_callin():
> > >     if (has_vector()) {
> > >             if (riscv_v_setup_vsize())
> > >                     elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > >     }
> > >
> > > If this fails, we clear the HWCAP bit, but I am not immediately grasping
> > > how this affects the rest of the kernel.
> >
> > I think it's actually a bug: we're not clearing the bit in riscv_isa, so
> > this only ends up disabling the V bit in AT_HWCAP and thus doesn't do
> > anything in the kernel.
>
> Yeah, that's what I was getting at - I could not see how disabling it in
> hwcap would make a difference either in the kernel or in hwprobe. I was
> not sure if I was missing something in terms of the
> userspace-using-vector side of things, but I was pretty sure that
> hwprobe and in-kernel vector had this bug.

hwprobe still reports the hardware has Vector. Shouldn't this be ok
because hwprobe reflects what is supported by the hardware? I think
we'd need another bit in hwprobe to reflect the value in hwcap though.

Userspace context switch code is gated by VS == 0. I am not sure if we
need something more obvious.

The in-kernel vector part was not well-considered. It sounds ok for
the non-preemptible vector. Userspace won't be able to execute any
user vector on a non-symmetric vector platform. So all in-kernel
vector code doesn't have to save/restore userspace vector registers.
The task executing non-preemptible in-kernel vector code has no chance
to be rescheduled on other cores during vector execution. So it should
be fine as long as multiple parts of kernel_vector_{begin,end}() do
not assume using the same vlen. Most vector code doesn't assume
running on a particular vlen (vlen agnostic), so ideally it should be
safe.

Parts of the kernel blindly assume riscv_v_size != 0. Those should be
changed. However, race conditions may happen if userspace is already
up and running at this stage (cpu hotplug?). We must not reschedule a
userspace process which is actively executing vector to a core that
runs in a different vlen.

It is not possible to execute the preemptible vector in a
non-symmetric vector platform. This should be gated by failing the
allocation of kernel_vstate.datap.

>
> > > has_vector() just checks if the extension is supported by all cores on
> > > the cpu using either an alternative or the extension support bitmap.
> > > There are lots of sites in the kernel that use has_vector() as the
> > > gating check (as far as I can tell), so if vlen does not match between
> > > CPUs should we not be returning an error for has_vector()?

No, has_vector() still returns true because it is backed by
alternative and determined during boot.

> > >
> > > The alternative that has_vector() relies on is patched before the non-boot
> > > CPUs are enabled, so we cannot modify the result once the non-boot CPUs
> > > are in their callin functions and detect a vlen mismatch while at at the
> > > same time using it in smp_calling(), so should this code be changed to
> > > something along the lines of:
> > >
> > >     if (riscv_has_extension_unlikely(v)) {
> > >             if (riscv_v_setup_vsize()) {
> > >                     elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > >                     riscv_v_vlen_mismatch = true;
> > >             }
> > >     }
> > > and then has_vector() becomes something like
> > > static __always_inline bool has_vector(void)
> > > {
> > >     return riscv_has_extension_unlikely(RISCV_ISA_EXT_v) && unlikely(riscv_v_vlen_mismatch);
> > > }
> > >
> > > Probably there's value to be gained in static branches etc here, but I
> > > was just trying to explain what I was getting at. Maybe I am missing
> > > something though? I do remember talking about this back when the vector
> > > patches were still in review.
> >
> > I actually don't think that alone fixes it: even if we make has_vector()
> > respect the changes from the new CPU (whether we re-apply the alternatives
> > or make a more dynamic check), we'd need to make sure the code that has
> > already looked at has_vector() doesn't get scheduled on one of the
> > shorter-VLENB CPUs.

I agree that flipping off the alternative for has_vector() could be
dangerous at this point.

>
> Is it possible to have that happen given this is in smp_callin()?
> I guess the concern is some sort of hotplug situations, given that a CPU
> could be hotplugged into a fully running system with a vlen mismatch.
>
> > So maybe the right answer here is to flip things around, and just refuse to
> > enable CPUs that come in with a VLENB different than the boot CPU?  That way
> > we could avoid being broken by systems that behave this way,
>
> smp_callin() doesn't return an error to report a failure, but it does
> have a completion. I suppose we would just return early from
> smp_callin() so that the completion never gets completed, in turn
> causing __cpu_up() to fail?

I agree disabling the entire vector or fail bringing up the cpu is the
safe way though.

>
> > if someone ends
> > up building one then they can figure out how to make it work better.
>
> Apparently the next gen sophgo product does. Not sure if people will wnt
> to run across all the cores though, apparently they different by a lot
> more than vlen.

The vector patchset assumes running on an SMP system. Supporting
non-symmetric vlen or non-symmetric isa may require some efforts.

> A DT property would allow us to get the vlen info before the CPUs have
> themselves been brought up. Apparently there are Zvl extensions for
> this, but I'd rather just have a property like we have for the cbo
> sizes if it comes to DT properties.

Yes, perhaps having a DT property is the first step to go. Because we
can prevent race set/get-ing riscv_v_vsize in such case.

>
> Cheers,
> Conor.

Thanks,
Andy



More information about the linux-riscv mailing list