arm64 torture test hotplug failures (offlining causes -EBUSY)

Joel Fernandes joel at joelfernandes.org
Tue Jan 17 11:50:59 PST 2023


On Tue, Jan 17, 2023 at 11:43 AM Zhouyi Zhou <zhouzhouyi at gmail.com> wrote:
[...]
> > >>>>
> > >>>> How about something simple like the following? (untested)
> > >>>>
> > >>>> ---8<-----------------------
> > >>>>
> > >>>> diff --git a/kernel/torture.c b/kernel/torture.c
> > >>>> index bc8fb361efc0..cd64110694c0 100644
> > >>>> --- a/kernel/torture.c
> > >>>> +++ b/kernel/torture.c
> > >>>> @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
> > >>>>                        // PCI probe frequently disables hotplug during boot.
> > >>>>                        (*n_offl_attempts)--;
> > >>>>                        s = " (-EBUSY forgiven during boot)";
> > >>>> +               } else if (tick_nohz_full_running && ret == -EBUSY) {
> > >>>> +                       (*n_offl_attempts)--;
> > >>>> +                       s = " (-EBUSY forgiven if nohz_full is running)";
> > >>> Fantastic fix!! thus we can fix the time keeper cpu torture problem
> > >>> without touch the time keeper code.
> > >>
> > >> Thanks. Unfortunately this does not fix the issue for TRACE02 and the patch
> > >> you shared does not fix it either -- because TRACE02 is not a no-hz-full
> > >> test. :-(
> > >>
> > >> We will need to do a bit of tracing to figure out where the -EBUSY is coming
> > >> from for TRACE02.
> > > agree TRACE02 is another issue, unfortunately I can't reproduce the
> > > bug neither with your original Image [1]
> > > nor with my cross compiled kernel using [2].
> > >
> > > I guess there may be two reasons:
> > > 1) my testbed is X86_64 based.
> > > 2) the command that I invoke qemu is not right:
> > > 2-1) the newly compiled linux-5.15.89-rc1
> > > qemu-system-aarch64 -machine virt -cpu cortex-a57 -nographic -smp 4
> >
> > Does 8 CPUs make any difference? That is my setup.
> 8 CPUs make no difference ;-(

Ah, it was worth a try! Hmm.

> > Not sure what else is different. It could be a CPU model specific issue, or something. But why donot you just use the same setup you used in November and check TRACE02? That is actually what I was requesting you to rest, since you saw the same issue on that setup.
> I guess it may be a CPU model specific issue, while I can't invoke
> qemu-system-aarch64  with  "-machine virt,gic-version=host -cpu host"
> because I didn't have an aarch64 bare metal host.
>
> OK, I am doing the same setup on linux-5.15.y as I did last November
> in the PPC VM of Open Source Lab of Oregon State University, this will
> take about 20 hours, and report what I found after the test finishes.

Sounds good, Thanks!

Thanks,

 - Joel



More information about the linux-arm-kernel mailing list