[PATCH -next] arch_topology: Fix cache attributes detection in the CPU hotplug path

Thu Jul 14 07:17:33 PDT 2022

On 13/07/2022 14:33, Sudeep Holla wrote:

Hey Sudeep,
I could not get this patch to actually apply, tried a couple
different versions of -next :/

It is in -next already though, which I suspect might be part of why
it does not apply.. Surely you can fast forward your arch_topology
for-next branch to gregs merge commit rather than generating this
from the premerge branch & re-merging into your branch that Stephen
picks up?

Either way, I tested it directly in -next since that's back to
booting for me today and ...

> init_cpu_topology() is called only once at the boot and all the cache
> attributes are detected early for all the possible CPUs. However when
> the CPUs are hotplugged out, the cacheinfo gets removed. While the
> attributes are added back when the CPUs are hotplugged back in as part
> of CPU hotplug state machine, it ends up called quite late after the
> update_siblings_masks() are called in the secondary_start_kernel()
> resulting in wrong llc_sibling_masks.
> 
> Move the call to detect_cache_attributes() inside update_siblings_masks()
> to ensure the cacheinfo is updated before the LLC sibling masks are
> updated. This will fix the incorrect LLC sibling masks generated when
> the CPUs are hotplugged out and hotplugged back in again.
> 
> Reported-by: Ionela Voinescu <ionela.voinescu at arm.com>
> Signed-off-by: Sudeep Holla <sudeep.holla at arm.com>
> ---
>   drivers/base/arch_topology.c | 16 ++++++----------
>   1 file changed, 6 insertions(+), 10 deletions(-)
> 
> Hi Conor,
> 
> Ionela reported an issue with the CPU hotplug and as a fix I need to
> move the call to detect_cache_attributes() which I had thought to keep
> it there from first but for no reason had moved it to init_cpu_topology().
> 
> Wonder if this fixes the -ENOMEM on RISC-V as this one is called on the
> cpu in the secondary CPUs init path while init_cpu_topology executed
> detect_cache_attributes() for all possible CPUs much earlier. I think
> this might help as the percpu memory might be initialised in this case.

Actually, we are now worse off than before:
     0.009813] smp: Bringing up secondary CPUs ...
[    0.011530] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:274
[    0.011550] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
[    0.011566] preempt_count: 1, expected: 0
[    0.011580] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.19.0-rc6-next-20220714-dirty #1
[    0.011599] Hardware name: Microchip PolarFire-SoC Icicle Kit (DT)
[    0.011608] Call Trace:
[    0.011620] [<ffffffff80005070>] dump_backtrace+0x1c/0x24
[    0.011661] [<ffffffff8066b0c4>] show_stack+0x2c/0x38
[    0.011699] [<ffffffff806704a2>] dump_stack_lvl+0x40/0x58
[    0.011725] [<ffffffff806704ce>] dump_stack+0x14/0x1c
[    0.011745] [<ffffffff8002f42a>] __might_resched+0x100/0x10a
[    0.011772] [<ffffffff8002f472>] __might_sleep+0x3e/0x66
[    0.011793] [<ffffffff8014d774>] __kmalloc+0xd6/0x224
[    0.011825] [<ffffffff803d631c>] detect_cache_attributes+0x37a/0x448
[    0.011855] [<ffffffff803e8fbe>] update_siblings_masks+0x24/0x246
[    0.011885] [<ffffffff80005f32>] smp_callin+0x38/0x5c
[    0.015990] smp: Brought up 1 node, 4 CPUs

> 
> Anyways give this a try, also test the CPU hotplug and check if nothing
> is broken on RISC-V. We noticed this bug only on one platform while

So, our system monitor that runs openSBI does not actually support
any hotplug features yet, so:

# echo 0 > /sys/devices/system/cpu/cpu3/online
[   47.233627] CPU3: off
[   47.236018] CPU3 may not have stopped: 3
# echo 1 > /sys/devices/system/cpu/cpu3/online
[   54.911868] CPU3: failed to start

And this one confused the hell out of it...

# echo 0 > /sys/devices/system/cpu/cpu1/online
[ 2903.057706] CPU1: off
HSS_OpenSBI_Reboot() called
[ 2903.062447] CPU1 may not have stopped: 3
#
# [8.218591] HSS_Boot_PMPSetupHandler(): Hart1 setup complete

This is the hart that brought up openSBI so when the request
to offline it comes through it causes a system reboot haha

Either way, I think both imply that the hotplug code on the
Linux side is sane.

FWIW Sudeep, if you  want to add me as a reviewer for generic
arch topology stuff since I do care about testing it etc, please
feel free (although for the sake of my filters, make the email
conor at kernel.org if you do).

Thanks,
Conor.

> 
> Regards,
> Sudeep
> 
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 441e14ac33a4..0424b59b695e 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -732,7 +732,11 @@ const struct cpumask *cpu_clustergroup_mask(int cpu)
>   void update_siblings_masks(unsigned int cpuid)
>   {
>   	struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
> -	int cpu;
> +	int cpu, ret;
> +
> +	ret = detect_cache_attributes(cpuid);
> +	if (ret)
> +		pr_info("Early cacheinfo failed, ret = %d\n", ret);
>   	/* update core and thread sibling masks */
>   	for_each_online_cpu(cpu) {
> @@ -821,7 +825,7 @@ __weak int __init parse_acpi_topology(void)
>   #if defined(CONFIG_ARM64) || defined(CONFIG_RISCV)
>   void __init init_cpu_topology(void)
>   {
> -	int ret, cpu;
> +	int ret;
>   	reset_cpu_topology();
>   	ret = parse_acpi_topology();
> @@ -836,13 +840,5 @@ void __init init_cpu_topology(void)
>   		reset_cpu_topology();
>   		return;
>   	}
> -
> -	for_each_possible_cpu(cpu) {
> -		ret = detect_cache_attributes(cpu);
> -		if (ret) {
> -			pr_info("Early cacheinfo failed, ret = %d\n", ret);
> -			break;
> -		}
> -	}
>   }
>   #endif
> --2.37.1
>