arm64 torture test hotplug failures (offlining causes -EBUSY)

Joel Fernandes joel at joelfernandes.org
Mon Jan 16 20:36:57 PST 2023



> On Jan 16, 2023, at 11:30 PM, Paul E. McKenney <paulmck at kernel.org> wrote:
> 
> On Tue, Jan 17, 2023 at 12:15:07AM +0000, Joel Fernandes wrote:
>>> On Mon, Jan 16, 2023 at 05:38:00PM -0500, Joel Fernandes wrote:
>>> Hi Zhouyi,
>>> 
>>> On Mon, Jan 16, 2023 at 1:33 PM Zhouyi Zhou <zhouzhouyi at gmail.com> wrote:
>>>> 
>>> [..]
>>>> On Tue, Jan 17, 2023 at 1:27 AM Joel Fernandes <joel at joelfernandes.org> wrote:
>>>>> 
>>>>> Hello,
>>>>> I am seeing -EBUSY returned a lot during torture_onoff() when running
>>>>> rcutorture on arm64. This causes hotplug failure 30% of the time. I am
>>>>> also seeing this in 6.1-rc kernels. I believe see this only for CPU0.
>>>>> 
>>>>> This causes warnings in torture tests:
>>>>> [  217.582290] rcu-torture:torture_onoff task: offline 0 failed: errno -16
>>>>> [  221.866362] rcu-torture:torture_onoff task: offline 0 failed: errno -16
>>>>> 
>>>>> Full kernel log here:
>>>>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TREE04/console.log
>>>>> 
>>>>> Any ideas on why this is happening and only for CPU 0 (presumably the
>>>>> boot CPU)? I'd personally need these warnings to go away for my tests
>>>>> as this causes rcutorture's tests to not cleanly pass for me. It
>>>>> appears remove_cpu() -> device_offline() is what returns the error.
>>>>> 
>>>> I guess this probably because CPU 0 is the tick_do_timer_cpu in
>>>> nohz_full mode, which prevent that cpu from
>>>> going offline [1]. We have discussed this topic, but there is no
>>>> agreement on how to solve it yet.
>>> 
>>> But I am seeing the issue in TRACE02 config which is:
>>> CONFIG_NO_HZ_IDLE=y
>>> # CONFIG_NO_HZ_FULL is not set
>>> 
>>> So that is not NO_HZ_FULL:
>>> http://box.joelfernandes.org:9080/job/rcutorture_stable_arm/job/linux-5.15.y/7/artifact/tools/testing/selftests/rcutorture/res/2023.01.15-14.51.11/TRACE02/console.log.diags/
>>> However, I can't seem to find the full kernel logs for that.
>>> 
>>> Also, other than the TRACE02 fail, I only see the issue with configs
>>> with CONFIG_NO_HZ_FULL=y
>>> 
>>> Can you try TRACE02 specifically, and see if you can reproduce the
>>> same issue on your setup? Meanwhile, I'll try to trace what is
>>> returning the -EBUSY.
>> 
>> How about something simple like the following? (untested)
>> 
>> ---8<-----------------------
>> 
>> diff --git a/kernel/torture.c b/kernel/torture.c
>> index bc8fb361efc0..cd64110694c0 100644
>> --- a/kernel/torture.c
>> +++ b/kernel/torture.c
>> @@ -220,6 +220,9 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
>>            // PCI probe frequently disables hotplug during boot.
>>            (*n_offl_attempts)--;
>>            s = " (-EBUSY forgiven during boot)";
>> +        } else if (tick_nohz_full_running && ret == -EBUSY) {
>> +            (*n_offl_attempts)--;
>> +            s = " (-EBUSY forgiven if nohz_full is running)";
> 
> But this should be forgiven for the timekeeping CPU, not everyone,
> correct?
> 
> Yes, I know that CPU-hotplug operations can fail, but in my testing
> they almost never do.  This means that a new failure might well be a
> real bug somewhere that needs attention.

Sure. We may need to expose some API to reveal that. 

It appeared though that Thomas in the other thread related to patch from Zhouyi, was suggesting that rcutorture tolerate hotplug failure though, because they are not abnormal, right?

Thanks,

 - Joel


> 
>                            Thanx, Paul
> 
>>        }
>>        if (verbose)
>>            pr_alert("%s" TORTURE_FLAG



More information about the linux-arm-kernel mailing list