BUG?: kernel does not (re)set irq smp_affinity to reboot_cpu

Hans de Goede hdegoede at redhat.com
Mon Jun 27 05:53:16 PDT 2016


Hi Russell,

On 27-06-16 13:31, Russell King - ARM Linux wrote:
> On Mon, Jun 27, 2016 at 12:55:26PM +0200, Hans de Goede wrote:
>> Hi Russel,
>>
>> On 27-06-16 11:45, Russell King - ARM Linux wrote:
>>> On Mon, Jun 27, 2016 at 10:13:05AM +0100, Marc Zyngier wrote:
>>>> I'm wondering if that's not an effect of this patch:
>>>>
>>>> https://lkml.org/lkml/2015/9/24/138
>>>>
>>>> missing on the ARM side (the corresponding arm64 patch is 217d453d473c).
>>>
>>> No, because we don't take the other CPUs offline through CPU hotplug at
>>> reboot - we stop them.  That's because CPU hotplug involves scheduling,
>>> and a reboot can't be scheduled as it can happen from IRQ contexts.
>>>
>>> For a long time, we have not supported IRQs on any CPU after the system
>>> has gone down for halt/reboot/poweroff etc:
>>>
>>> ipi_cpu_stop() disables IRQs and FIQs before entering an infinite loop.
>>> machine_{halt,power_off,restart}() in arch/arm/kernel/reboot.c disables
>>> IRQs on the requesting CPU.
>>>
>>> So, IRQs get disabled on _all_ CPUs.  Code after this point should not
>>> re-enable IRQs to be able to use drivers, which it sounds like what's
>>> happening in Hans scenario.  Remember, as I've said above, these paths
>>> should not even be scheduling, and should never be reliant on receiving
>>> interrupts.  *Especially* as they can themselves be called from IRQ
>>> context.
>>
>> First of all thanks for your input.
>>
>> Note this is not reboot, this is poweroff.
>
> I think I covered that - all the paths are indentical in the ARM
> architecture code, and have been identical in this respect well before
> any of the drivers you've pointed out.

They may be identical, but there is no _need_ for them to be identical
AFAICT.

>> And for poweroff many (ARM) boards depend on working i2c, which
>> depends on irqs, for example all these mfd drivers:
>>
>> drivers/mfd/rn5t618.c
>> drivers/mfd/twl4030-power.c
>> drivers/mfd/palmas.c
>> drivers/mfd/dm355evm_msp.c
>> drivers/mfd/tps6586x.c
>> drivers/mfd/retu-mfd.c
>> drivers/mfd/max8907.c
>> drivers/mfd/tps65910.c
>> drivers/mfd/tps80031.c
>> drivers/mfd/rk808.c
>> drivers/mfd/axp20x.c
>>
>> Define pm_power_off and use i2c.
>
> Right, so these drivers are all buggy, and need fixing.

Maybe if so many drivers need fixing, the conditions set
down for poweroff are wrong, and we need to fix those instead ?

Also a little correction on my previous maik, previously I
listed: drivers/memory/emif.c (omap4 / omap5) as calling
kernel_power_off() from an interrupt context, but it is
actually doing a return IRQ_WAKE_THREAD and running it
from a threaded interrupt handler.

That only leaves:

arch/arm/mach-ixp4xx/dsmg600-setup.c
arch/arm/mach-ixp4xx/nas100d-setup.c
arch/arm/mach-ixp4xx/nslu2-setup.c

Which directly call machine_power_off in interrupt context,
and they all 3 define their own pm_power_off which
directly toggles a gpio.

AFAIK these 3 are all single core machines, so they
might just as well directly call their own pm_power_off
implementation, at which point there are no callers
which call machine_power_off from a non-schedulable context.

Note this does not mean that all pm_power_off implementations
are going to be happy with machine_power_off leaving irqs
enabled, I would esp. expect the efi and psci implementations
to potentially be unhappy about this. See below for a proposal
to deal with this.

>> So although you may very well be right that using irqs to implement poweroff
>> is not how things should be, in practice we've been using them for this for
>> quite a while now and this usually works fine.
>
> ... and they're all violating the conditions set down for by the
> architecture for an orderly poweroff

This sounds a lot like "we're doing things this way because we always
have been doings things this way", which does not really sound like
a good argument to me.

> presumably the reason this
> works for !SMP cases is because somewhere along the path, they're
> re-enabling IRQs behind the back of architecture code.
>
>> So it seems that the assumption that machine_power_off may be called
>> from irq context is not always true, specifically it is only true on
>> certain platforms (mach-ixp4xx, omap4, omap5 and whatever uses
>> ab8500.c). I would expect the pm_power_off implementations on these
>> platforms to indeed not use irqs themselves, that would indeed be
>> bad.
>
> Right, but the overriding thing here is that it _may_ be called from
> IRQ context _and_ pm_power_off() is called with IRQs disabled.  That
> second one is the more important point - pm_power_off() handlers are
> called with a non-schedulable context.
>
>> Which brings us back to our original problem, how do we fix
>> irq smp_affinity on power off ?
>
> Only if we accept that pm_power_off() should be called with IRQs
> enabled, which we haven't ascertained yet.

<snip>

> Now, we could do as you are suggesting, and route IRQs to the
> remaining CPU via all shutdown paths, but that would be papering over
> the fundamental bug here: if a function is called with IRQs disabled,
> it (or any called function) has no business re-enabling IRQs.

Which brings us to the fundamental question why are we disabling
irqs in machine_poweroff ?

Maybe this is necessary on some boards (efi, psci?), but at the
same time it breaks things badly on other boards. Thinking about
this more I actually believe that on boards with a i2c pmic
pm_power_off MUST be called with irqs enabled:

(from your second reply:)

 > More to that, the I2C core layer is setup to allow i2c_transfer() to
 > be called from non-schedulable contexts:
 >
 >                 if (in_atomic() || irqs_disabled()) {
 >                         ret = adap->trylock_bus(adap, I2C_LOCK_SEGMENT);
 >                         if (!ret)
 >                                 /* I2C activity is ongoing. */
 >                                 return -EAGAIN;
 >
 > prior to calling into the adapters ->master_xfer() function.  This
 > acknowledges that, if i2c_transfer() is called in a context which
 > is not schedulable or IRQs are disabled, the adapters ->master_xfer()
 > needs to handle this situation.

So what happens if we disable irqs as you suggest and machine_power_off
gets called when an i2c transfer is ongoing?

Then the i2c write in the pmic's pm_power_off implementation would
fail with EAGAIN, now we can but a retry loop with busy waiting around
this, but since irqs are disabled, the ongoing transfer will never
complete, so we will just sit there forever.

IOW if we hit a race where we do machine_power_off while an ongoing
i2c transfer is happening on the same i2c bus as the pmic, the only
way we can reliable poweroff is to actually _allow_ irqs so that
that transfer can complete and we can then do the poweroff.

IOW for i2c pmics pm_power_off MUST be called with irqs enabled.


So I see 2 solutions for this:

1) Never disable irqs in arm's machine_power_off, this assumes that
all pm_power_off implementations are safe to run with irqs enabled,
which is a big unknown really; or

2) Add a pm_power_off_needs_irqs flag and make machine_power_off
behave accordingly when that is set


2 would allow us to properly deal with the i2c pmic case (which currently
seems to work by chance as long as irq-balanced does not mess things up
wrt irq affinity), while at the same time ensuring that we do not break
any non i2c pm_power_off methods which may rely on irqs being turned off.

Regards,

Hans



p.s.

Something related to this is that most pmic drivers do:

         if (!pm_power_off)
                 pm_power_off = foo_power_off;

Which seems hardly race free, either we assume there is only one
driver per platform implementing pm_power_off and the
if() is not necessary, or we need some locking here. Maybe
a pm_set_power_off_func() helper would be a good idea ?



More information about the linux-arm-kernel mailing list