vector status when vlen doesn't match
Andy Chiu
andy.chiu at sifive.com
Thu Feb 29 05:32:56 PST 2024
On Thu, Feb 29, 2024 at 7:32 AM Conor Dooley <conor at kernel.org> wrote:
>
> On Wed, Feb 28, 2024 at 01:56:51PM -0800, Palmer Dabbelt wrote:
> > On Wed, 28 Feb 2024 08:02:47 PST (-0800), Conor Dooley wrote:
> > > Yo Andy,
> > >
> > > I was wondering if you could clarify something that came up today on
> > > IRC.
> > >
> > > When we enable vector we check vlenb and if it doesn't match the kernel
> > > is suppose to not support vector. We do this in a few places, the first
> > > is in riscv_fill_hwcap():
> > >
> > > if (elf_hwcap & COMPAT_HWCAP_ISA_V) {
> > > riscv_v_setup_vsize();
> > > /*
> > > * ISA string in device tree might have 'v' flag, but
> > > * CONFIG_RISCV_ISA_V is disabled in kernel.
> > > * Clear V flag in elf_hwcap if CONFIG_RISCV_ISA_V is disabled.
> > > */
> > > if (!IS_ENABLED(CONFIG_RISCV_ISA_V))
> > > elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > > }
> > >
> > > Why does this riscv_v_setup_vsize() failing have no impact? Because this
> > > is called only once on the boot CPU, right? I feel like it deserves a
> > > comment as to why.
> >
> > Unless I'm missing something, that first call can't fail: there's no
> > previous riscv_v_size set, so there's nothing to trigger the failure.
> >
> > Might warrant a comment, though...
>
> Yeah. I did write (or maybe moreso remodel) that function so I know it
> only gets called on the boot CPU, I was just pointing out what is not
> immediately obvious about that particular callsite.
Yes, this call can't fall. Ok, I'll update the comment.
>
> > > But the main thing I was wondering about was the other time it is called,
> > > in smp_callin():
> > > if (has_vector()) {
> > > if (riscv_v_setup_vsize())
> > > elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > > }
> > >
> > > If this fails, we clear the HWCAP bit, but I am not immediately grasping
> > > how this affects the rest of the kernel.
> >
> > I think it's actually a bug: we're not clearing the bit in riscv_isa, so
> > this only ends up disabling the V bit in AT_HWCAP and thus doesn't do
> > anything in the kernel.
>
> Yeah, that's what I was getting at - I could not see how disabling it in
> hwcap would make a difference either in the kernel or in hwprobe. I was
> not sure if I was missing something in terms of the
> userspace-using-vector side of things, but I was pretty sure that
> hwprobe and in-kernel vector had this bug.
hwprobe still reports the hardware has Vector. Shouldn't this be ok
because hwprobe reflects what is supported by the hardware? I think
we'd need another bit in hwprobe to reflect the value in hwcap though.
Userspace context switch code is gated by VS == 0. I am not sure if we
need something more obvious.
The in-kernel vector part was not well-considered. It sounds ok for
the non-preemptible vector. Userspace won't be able to execute any
user vector on a non-symmetric vector platform. So all in-kernel
vector code doesn't have to save/restore userspace vector registers.
The task executing non-preemptible in-kernel vector code has no chance
to be rescheduled on other cores during vector execution. So it should
be fine as long as multiple parts of kernel_vector_{begin,end}() do
not assume using the same vlen. Most vector code doesn't assume
running on a particular vlen (vlen agnostic), so ideally it should be
safe.
Parts of the kernel blindly assume riscv_v_size != 0. Those should be
changed. However, race conditions may happen if userspace is already
up and running at this stage (cpu hotplug?). We must not reschedule a
userspace process which is actively executing vector to a core that
runs in a different vlen.
It is not possible to execute the preemptible vector in a
non-symmetric vector platform. This should be gated by failing the
allocation of kernel_vstate.datap.
>
> > > has_vector() just checks if the extension is supported by all cores on
> > > the cpu using either an alternative or the extension support bitmap.
> > > There are lots of sites in the kernel that use has_vector() as the
> > > gating check (as far as I can tell), so if vlen does not match between
> > > CPUs should we not be returning an error for has_vector()?
No, has_vector() still returns true because it is backed by
alternative and determined during boot.
> > >
> > > The alternative that has_vector() relies on is patched before the non-boot
> > > CPUs are enabled, so we cannot modify the result once the non-boot CPUs
> > > are in their callin functions and detect a vlen mismatch while at at the
> > > same time using it in smp_calling(), so should this code be changed to
> > > something along the lines of:
> > >
> > > if (riscv_has_extension_unlikely(v)) {
> > > if (riscv_v_setup_vsize()) {
> > > elf_hwcap &= ~COMPAT_HWCAP_ISA_V;
> > > riscv_v_vlen_mismatch = true;
> > > }
> > > }
> > > and then has_vector() becomes something like
> > > static __always_inline bool has_vector(void)
> > > {
> > > return riscv_has_extension_unlikely(RISCV_ISA_EXT_v) && unlikely(riscv_v_vlen_mismatch);
> > > }
> > >
> > > Probably there's value to be gained in static branches etc here, but I
> > > was just trying to explain what I was getting at. Maybe I am missing
> > > something though? I do remember talking about this back when the vector
> > > patches were still in review.
> >
> > I actually don't think that alone fixes it: even if we make has_vector()
> > respect the changes from the new CPU (whether we re-apply the alternatives
> > or make a more dynamic check), we'd need to make sure the code that has
> > already looked at has_vector() doesn't get scheduled on one of the
> > shorter-VLENB CPUs.
I agree that flipping off the alternative for has_vector() could be
dangerous at this point.
>
> Is it possible to have that happen given this is in smp_callin()?
> I guess the concern is some sort of hotplug situations, given that a CPU
> could be hotplugged into a fully running system with a vlen mismatch.
>
> > So maybe the right answer here is to flip things around, and just refuse to
> > enable CPUs that come in with a VLENB different than the boot CPU? That way
> > we could avoid being broken by systems that behave this way,
>
> smp_callin() doesn't return an error to report a failure, but it does
> have a completion. I suppose we would just return early from
> smp_callin() so that the completion never gets completed, in turn
> causing __cpu_up() to fail?
I agree disabling the entire vector or fail bringing up the cpu is the
safe way though.
>
> > if someone ends
> > up building one then they can figure out how to make it work better.
>
> Apparently the next gen sophgo product does. Not sure if people will wnt
> to run across all the cores though, apparently they different by a lot
> more than vlen.
The vector patchset assumes running on an SMP system. Supporting
non-symmetric vlen or non-symmetric isa may require some efforts.
> A DT property would allow us to get the vlen info before the CPUs have
> themselves been brought up. Apparently there are Zvl extensions for
> this, but I'd rather just have a property like we have for the cbo
> sizes if it comes to DT properties.
Yes, perhaps having a DT property is the first step to go. Because we
can prevent race set/get-ing riscv_v_vsize in such case.
>
> Cheers,
> Conor.
Thanks,
Andy
More information about the linux-riscv
mailing list