arm64 torture test hotplug failures (offlining causes -EBUSY)

Zhouyi Zhou zhouzhouyi at gmail.com
Thu Jan 19 04:17:43 PST 2023


On Thu, Jan 19, 2023 at 4:26 PM Joel Fernandes <joel at joelfernandes.org> wrote:
>
>
>
> > On Jan 18, 2023, at 10:21 PM, Zhouyi Zhou <zhouzhouyi at gmail.com> wrote:
> >
> > On Thu, Jan 19, 2023 at 6:39 AM Joel Fernandes <joel at joelfernandes.org> wrote:
> >>
> >>> On Wed, Jan 18, 2023 at 10:37 PM Joel Fernandes <joel at joelfernandes.org> wrote:
> >>>
> >>>> On Tue, Jan 17, 2023 at 08:00:58PM -0800, Paul E. McKenney wrote:
> >>> [...]
> >>>>>>>> Is there a plan to make CPU hotplug failures more frequent?
> >>>>>>>
> >>>>>>> I am not aware of such a plan but I was going by "There are quite some
> >>>>>>> reasons why a CPU-hotplug or a hot-unplug operation can fail, which is
> >>>>>>> not a fatal problem, really." in [1].
> >>>>>>>
> >>>>>>> What about an rcutorture to skip hotplug for a certain cpu id,
> >>>>>>> rcutorture.skip_hotplug_cpus="0". Can be a last resort. But we/I
> >>>>>>> should debug this issue more before getting to that.
> >>>>>>
> >>>>>> Yes, in fact there already are some checks along those lines, for example,
> >>>>>> the torture_offline() function's check of cpu_is_hotpluggable().  So for
> >>>>>> example, as I understand it, a CONFIG_NO_HZ_FULL=y system should mark
> >>>>>> the housekeeping CPU as !cpu_is_hotpluggable().
> >>>>>
> >>>>> I don't think CONFIG_NO_HZ_FULL does any such marking (at least I am
> >>>>> not seeing it). Even on x86, if you enable
> >>>>> CONFIG_BOOTPARAM_HOTPLUG_CPU0=y , and CONFIG_NO_HZ_FULL=y, and run
> >>>>> rcutorture with boot args:
> >>>>>
> >>>>> nohz_full=0-3 rcutorture.onoff_interval=100 rcutorture.onoff_holdoff=2
> >>>>> rcutorture.shutdown_secs=30
> >>>>>
> >>>>> You will see this in the kernel logs:
> >>>>> [    2.816022] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> >>>>> [    2.975913] rcu-torture:torture_onoff task: offline 0 failed: errno -16
> >>>>>
> >>>>> So RCU torture test clearly thought the CPUs were hot-pluggable, when
> >>>>> they was chance for them to return -EBUSY (due to housekeeping and
> >>>>> what not). So this issue seems to be architecture independent, in that
> >>>>> sense.
> >>>>>
> >>>>> So the 2 ways forward I see are:
> >>>>> - Make the torture test aware of which CPUs are 'house keeping'
> >>>>> - Make it possible to turn off CPU0 hotplugging on ARM64 by default
> >>>>> (via CONFIG or boot option).
> >>>>>
> >>>>> Another option could be, forgive -EBUSY on CPU0 for
> >>>>> CONFIG_NO_HZ_FULL=y.  Is it possible to assign a non-0 CPU id as a
> >>>>> housekeeping CPU?
> >>>>
> >>>> I would be happier to forgive failure to offline housekeeping CPUs than
> >>>> blanket forgiveness of CPU 0.  Especially given that I recently got
> >>>> burned by a non-zero boot cpu.  ;-)
> >>>>
> >>>> But wouldn't it be even better for cpu_is_hotpluggable() to know the
> >>>> NO_HZ_FULL rules of the road?
> >>>
> >>> That's a great idea. I found a way to do that without having to do the
> >>> EXPORT_SYMBOL (like in Zhouyi's patch).
> >>>
> >>> Would the following be acceptable (only build-tested)?
> >>>
> >>> I can run more tests and submit a patch:
> >>>
> >>> diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> >>> index 55405ebf23ab..f73bc520b70e 100644
> >>> --- a/drivers/base/cpu.c
> >>> +++ b/drivers/base/cpu.c
> >>> @@ -487,7 +487,8 @@ static const struct attribute_group *cpu_root_attr_groups[] = {
> >>> bool cpu_is_hotpluggable(unsigned int cpu)
> >>> {
> >>>        struct device *dev = get_cpu_device(cpu);
> >>> -       return dev && container_of(dev, struct cpu, dev)->hotpluggable;
> >>> +       return dev && container_of(dev, struct cpu, dev)->hotpluggable
> >>> +               && !tick_nohz_cpu_hotpluggable(cpu);
> >>
> >> Oops, I should lose that "!" , but otherwise should be ok.
> > Looks plausible to me, According to your fantastic fix, I will perform
> > a new round of tests on the PPC VM of open source Lab of Oregon State
> > University.
>
> Thank you! And if it passes, I will add your Tested-by tag for attribution if you do not mind.
Thank you very much in advance for giving me a Tested-by, I like it
very much ;-)
After patching 8e82c28ea2b4(torture: Make thread detection more robust
by using lspcu) to linux-5.15.y on PPC64 VM,
I can proceed with the torturing test now.

The test performed on original linux-5.15.y still needs an hour or two
to finish, after
that I can apply your fix, and perform another 20+ hours torturing
test (it is a little slow because it is on a virtual machine). Thank
you for your patience.

Cheers
Zhouyi
>
> > I learned a lot during this process
>
> Cool!!
>
>  - Joel
>
>
> >
> > Thanks
> > Zhouyi



More information about the linux-arm-kernel mailing list