[PATCH v3] RISC-V: Probe misaligned access speed in parallel

Tue Nov 7 09:45:10 PST 2023

On Tue, 07 Nov 2023 09:26:03 PST (-0800), Evan Green wrote:
> On Tue, Nov 7, 2023 at 12:34 AM Sebastian Andrzej Siewior
> <bigeasy at linutronix.de> wrote:
>>
>> On 2023-11-06 14:58:55 [-0800], Evan Green wrote:
>> > Probing for misaligned access speed takes about 0.06 seconds. On a
>> > system with 64 cores, doing this in smp_callin() means it's done
>> > serially, extending boot time by 3.8 seconds. That's a lot of boot time.
>> >
>> > Instead of measuring each CPU serially, let's do the measurements on
>> > all CPUs in parallel. If we disable preemption on all CPUs, the
>> > jiffies stop ticking, so we can do this in stages of 1) everybody
>> > except core 0, then 2) core 0. The allocations are all done outside of
>> > on_each_cpu() to avoid calling alloc_pages() with interrupts disabled.
>> >
>> > For hotplugged CPUs that come in after the boot time measurement,
>> > register CPU hotplug callbacks, and do the measurement there. Interrupts
>> > are enabled in those callbacks, so they're fine to do alloc_pages() in.
>>
>> I think this is dragged out of proportion. I would do this (if needed
>> can can't be identified by CPU-ID or so) on boot CPU only. If there is
>> evidence/ proof/ blessing from the high RiscV council that different
>> types of CPU cores are mixed together then this could be extended.
>> You brought Big-Little up in the other thread. This is actually known.
>> Same as with hyper-threads on x86, you know which CPU is the core and
>> which hyper thread (CPU) belongs to it.
>> So in terms of BigLittle you _could_ limit this to one Big and one
>> Little core instead running it on all.
>
> Doing it on one per cluster might also happen to work, but I still see
> nothing that prevents variety within a cluster, so I'm not comfortable
> with that assumption. It also doesn't buy much. I'm not sure what kind
> of guidance RVI is providing on integrating multiple CPUs into a
> system. I haven't seen any myself, but am happy to reassess if there's
> documentation banning the scenarios I'm imagining.

IIUC there's pretty much no rules here, and vendors are already building 
wacky systems (the K230 just showed up with heterogenous-ISA cores, 
we've got a handful now).  I guess we could write up some guidance in 
Documentation/riscv describing what sort of systems we generally test 
on, but given how RISC-V generally goes vendors are just going to build 
the crazy stuff anyway and we'll have to deal with it.

>
>>
>> But this is just my few on this. From PREEMPT_RT's point of view, the
>> way you restructured the memory allocation should work now.
>
> Thanks!
>
>>
>> > Reported-by: Jisheng Zhang <jszhang at kernel.org>
>> > Closes: https://lore.kernel.org/all/mhng-9359993d-6872-4134-83ce-c97debe1cf9a@palmer-ri-x1c9/T/#mae9b8f40016f9df428829d33360144dc5026bcbf
>> > Fixes: 584ea6564bca ("RISC-V: Probe for unaligned access speed")
>> > Signed-off-by: Evan Green <evan at rivosinc.com>
>> >
>> >
>> > diff --git a/arch/riscv/kernel/cpufeature.c b/arch/riscv/kernel/cpufeature.c
>> > index 6a01ded615cd..fe59e18dbd5b 100644
>> > --- a/arch/riscv/kernel/cpufeature.c
>> > +++ b/arch/riscv/kernel/cpufeature.c
>> …
>> >
>> > -static int __init check_unaligned_access_boot_cpu(void)
>> > +/* Measure unaligned access on all CPUs present at boot in parallel. */
>> > +static int check_unaligned_access_all_cpus(void)
>> >  {
>> > -     check_unaligned_access(0);
>> > +     unsigned int cpu;
>> > +     unsigned int cpu_count = num_possible_cpus();
>> > +     struct page **bufs = kzalloc(cpu_count * sizeof(struct page *),
>> > +                                  GFP_KERNEL);
>>
>> kcalloc(). For beauty reasons you could try a reverse xmas tree.
>>
>> > +
>> > +     if (!bufs) {
>> > +             pr_warn("Allocation failure, not measuring misaligned performance\n");
>> > +             return 0;
>> > +     }
>> > +
>> > +     /*
>> > +      * Allocate separate buffers for each CPU so there's no fighting over
>> > +      * cache lines.
>> > +      */
>> > +     for_each_cpu(cpu, cpu_online_mask) {
>> > +             bufs[cpu] = alloc_pages(GFP_KERNEL, MISALIGNED_BUFFER_ORDER);
>> > +             if (!bufs[cpu]) {
>> > +                     pr_warn("Allocation failure, not measuring misaligned performance\n");
>> > +                     goto out;
>> > +             }
>> > +     }
>> > +
>> > +     /* Check everybody except 0, who stays behind to tend jiffies. */
>> > +     on_each_cpu(check_unaligned_access_nonboot_cpu, bufs, 1);
>>
>> comments! _HOW_ do you ensure that CPU0 is left out? You don't. CPU0
>> does this and the leaves which is a waste. Using on_each_cpu_cond()
>> could deal with this. And you have the check within the wrapper
>> (check_unaligned_access_nonboot_cpu()) anyway.
>>
>> > +     /* Check core 0. */
>> > +     smp_call_on_cpu(0, check_unaligned_access, bufs[0], true);
>>
>> Now that comment is obvious. If you want to add a comment, why not state
>> why CPU0 has to be done last?
>>
>> > +
>> > +     /* Setup hotplug callback for any new CPUs that come online. */
>> > +     cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "riscv:online",
>> > +                               riscv_online_cpu, NULL);
>> Instead riscv:online you could use riscv:unaliged_check or something
>> that pin points the callback to something obvious. This is exported via
>> sysfs.
>>
>> Again, comment is obvious. For that to make sense would require RiscV to
>> support physical-hotplug. For KVM like environment (where you can plug in
>> CPUs later) this probably doesn't make sense at all. Why not? Because
>>
>> - without explicit CPU pinning your slow/ fast CPU mapping (host <->
>>   guest) could change if the scheduler on the host moves the threads
>>   around.
>
> Taking a system with non-identical cores and allowing vcpus to bounce
> between them sounds like a hypervisor configuration issue to me,
> regardless of this patch.
>
>>
>> - without explicit task offload and resource partitioning on the host
>>   your guest thread might get interrupt during the measurement. This is
>>   done during boot so chances are high that it runs 100% of its time
>>   slice and will be preempted once other tasks on the host ask for CPU
>>   run time.
>
> The measurement takes the best (lowest time) iteration. So unless
> every iteration gets interrupted, I should get a good read in there
> somewhere.
> -Evan