arm64 torture test hotplug failures (offlining causes -EBUSY)

Mark Rutland mark.rutland at arm.com
Thu Jan 19 01:12:48 PST 2023


Hi Joel, Will,

On Wed, Jan 18, 2023 at 10:01:07PM +0000, Joel Fernandes wrote:
> On Wed, Jan 18, 2023 at 4:51 PM Will Deacon <will at kernel.org> wrote:
> > On Tue, Jan 17, 2023 at 08:00:58PM -0800, Paul E. McKenney wrote:
> > > On Wed, Jan 18, 2023 at 02:17:06AM +0000, Joel Fernandes wrote:
> > >
> > > I would be happier to forgive failure to offline housekeeping CPUs than
> > > blanket forgiveness of CPU 0.  Especially given that I recently got
> > > burned by a non-zero boot cpu.  ;-)
> > >
> > > But wouldn't it be even better for cpu_is_hotpluggable() to know the
> > > NO_HZ_FULL rules of the road?
> > >
> > > > Adding Frederic to CC as well as we are talking about
> > > > housekeeping/isolation stuff.
> > >
> > > But as you say, perhaps Frederic has a better idea.
> > >
> > > > > And topology_init() sets this based on platform_can_hotplug_cpu(cpu).
> > > > > And this function sets CPU 0 as !cpu_is_hotpluggable() unless the
> > > > > architecture specifies a .cpu_can_disable() function.
> > > >
> > > > Ah, that is 32-bit ARM code only. This issue is on 64-bit ARM (arch/arm64/).
> > >
> > > Apologies!  I will look more carefully at the pathnames next time!
> > >
> > > But maybe arm64 needs something similar?
> >
> > Just chiming quickly from the arm64 side here, but there's nothing in the
> > architecture that precludes offlining CPU 0 and it certainly works on some
> > platforms, so I'd be hesitant to rule it out entirely for testing.
> >
> > One reason why hotplug can fail in practice is if a trusted OS (i.e. code
> > running on the secure side of the fence outside of Linux's view of the
> > world) is resident on a core and rejects firmware requests to power it
> > off. The PSCI code (drivers/firmware/psci/) should detect this and return
> > -EPERM, although earlier in this thread there was mention of -EBUSY so it
> > sounds like something else...
> 
> Thank you for the heads up on that. To give you context, I am
> currently testing rcutorture on stable kernels 5.10, 5.15, 6.1 on my
> ARM64 QC7180 board. I certainly don't want to hit the -EPERM in the
> future on this or other ARM64 hardware. It would be great if
> cpu_psci_cpu_can_disable() in arm64 can return false if hotplugging
> causes -EPERM indefinitely. Then we do not need to make any changes.

That should already be the case, and I think we're good on that front.

A trusted OS (which blocks offlining a CPU) will always be resident on a
specific CPU (since we don't have any code to migrate trusted OSs across CPUs
as this is not standardised, and we don't have code to instantiate a trusted OS
from Linux). Where a non-migrateable trusted OS is present, it's going to have
been instantiated prior to booting Linux, and therefore will be on CPU0 (or a
CPU that Linux is not using at all).

Given the above, the return value of cpu_psci_cpu_can_disable() should not
change for a given CPU, and it should only be able to return false on CPU0.
Most systems don't have a trusted OS blocking PSCI CPU_OFF, and CPU0 can be
offlined.

Thanks,
Mark.



More information about the linux-arm-kernel mailing list