arm64 torture test hotplug failures (offlining causes -EBUSY)
Joel Fernandes
joel at joelfernandes.org
Thu Jan 19 00:26:16 PST 2023
> On Jan 18, 2023, at 10:21 PM, Zhouyi Zhou <zhouzhouyi at gmail.com> wrote:
>
> On Thu, Jan 19, 2023 at 6:39 AM Joel Fernandes <joel at joelfernandes.org> wrote:
>>
>>> On Wed, Jan 18, 2023 at 10:37 PM Joel Fernandes <joel at joelfernandes.org> wrote:
>>>
>>>> On Tue, Jan 17, 2023 at 08:00:58PM -0800, Paul E. McKenney wrote:
>>> [...]
>>>>>>>> Is there a plan to make CPU hotplug failures more frequent?
>>>>>>>
>>>>>>> I am not aware of such a plan but I was going by "There are quite some
>>>>>>> reasons why a CPU-hotplug or a hot-unplug operation can fail, which is
>>>>>>> not a fatal problem, really." in [1].
>>>>>>>
>>>>>>> What about an rcutorture to skip hotplug for a certain cpu id,
>>>>>>> rcutorture.skip_hotplug_cpus="0". Can be a last resort. But we/I
>>>>>>> should debug this issue more before getting to that.
>>>>>>
>>>>>> Yes, in fact there already are some checks along those lines, for example,
>>>>>> the torture_offline() function's check of cpu_is_hotpluggable(). So for
>>>>>> example, as I understand it, a CONFIG_NO_HZ_FULL=y system should mark
>>>>>> the housekeeping CPU as !cpu_is_hotpluggable().
>>>>>
>>>>> I don't think CONFIG_NO_HZ_FULL does any such marking (at least I am
>>>>> not seeing it). Even on x86, if you enable
>>>>> CONFIG_BOOTPARAM_HOTPLUG_CPU0=y , and CONFIG_NO_HZ_FULL=y, and run
>>>>> rcutorture with boot args:
>>>>>
>>>>> nohz_full=0-3 rcutorture.onoff_interval=100 rcutorture.onoff_holdoff=2
>>>>> rcutorture.shutdown_secs=30
>>>>>
>>>>> You will see this in the kernel logs:
>>>>> [ 2.816022] rcu-torture:torture_onoff task: offline 0 failed: errno -16
>>>>> [ 2.975913] rcu-torture:torture_onoff task: offline 0 failed: errno -16
>>>>>
>>>>> So RCU torture test clearly thought the CPUs were hot-pluggable, when
>>>>> they was chance for them to return -EBUSY (due to housekeeping and
>>>>> what not). So this issue seems to be architecture independent, in that
>>>>> sense.
>>>>>
>>>>> So the 2 ways forward I see are:
>>>>> - Make the torture test aware of which CPUs are 'house keeping'
>>>>> - Make it possible to turn off CPU0 hotplugging on ARM64 by default
>>>>> (via CONFIG or boot option).
>>>>>
>>>>> Another option could be, forgive -EBUSY on CPU0 for
>>>>> CONFIG_NO_HZ_FULL=y. Is it possible to assign a non-0 CPU id as a
>>>>> housekeeping CPU?
>>>>
>>>> I would be happier to forgive failure to offline housekeeping CPUs than
>>>> blanket forgiveness of CPU 0. Especially given that I recently got
>>>> burned by a non-zero boot cpu. ;-)
>>>>
>>>> But wouldn't it be even better for cpu_is_hotpluggable() to know the
>>>> NO_HZ_FULL rules of the road?
>>>
>>> That's a great idea. I found a way to do that without having to do the
>>> EXPORT_SYMBOL (like in Zhouyi's patch).
>>>
>>> Would the following be acceptable (only build-tested)?
>>>
>>> I can run more tests and submit a patch:
>>>
>>> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
>>> index 55405ebf23ab..f73bc520b70e 100644
>>> --- a/drivers/base/cpu.c
>>> +++ b/drivers/base/cpu.c
>>> @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = {
>>> bool cpu_is_hotpluggable(unsigned int cpu)
>>> {
>>> struct device *dev = get_cpu_device(cpu);
>>> - return dev && container_of(dev, struct cpu, dev)->hotpluggable;
>>> + return dev && container_of(dev, struct cpu, dev)->hotpluggable
>>> + && !tick_nohz_cpu_hotpluggable(cpu);
>>
>> Oops, I should lose that "!" , but otherwise should be ok.
> Looks plausible to me, According to your fantastic fix, I will perform
> a new round of tests on the PPC VM of open source Lab of Oregon State
> University.
Thank you! And if it passes, I will add your Tested-by tag for attribution if you do not mind.
> I learned a lot during this process
Cool!!
- Joel
>
> Thanks
> Zhouyi
More information about the linux-arm-kernel
mailing list