[PATCH] arm64: kdump: fix interrupt handling done during machine_crash_shutdown

Fri Mar 2 08:57:39 PST 2018

On Fri, Mar 02, 2018 at 04:44:13PM +0000, Mark Rutland wrote:
> On Fri, Mar 02, 2018 at 02:52:07PM +0100, Grzegorz Jaszczyk wrote:
> > 2018-03-02 14:15 GMT+01:00 Mark Rutland <mark.rutland at arm.com>:
> > > Do you see this for a panic() in *any* interrupt handler?
> > 
> > I only test with this two interrupt handlers: watchdog and i2c but I
> > think it will behave the same with others - I can try with other if
> > you want, any suggestion which? Maybe with some PPI interrupt instead?
> > >
> > > Can you trigger the issue with magic-sysrq c, for example?
> > 
> > There is no problem when I trigger it via 'echo c >
> > /proc/sysrq-trigger' - it works well all the time. The problem appears
> > only, when the kexec/kdump procedure is triggered from interrupt
> > context
> 
> I'd meant that you'd send sysrq + c over serial, rather than writing to
> /proc/sysrq-trigger. That way, the panic will be in the context of the
> UART IRQ handler.
> 
> If that shows the issue, that's ilikely to be the easiest way for
> someone else to reproduce and investigate this.

FWIW, having just given this a go on my Juno R1 with v4.16-rc3
defconfig, the UART IRQs work fine in the crash kernel. That crash
happened in IRQ context:

[  384.653153] Call trace:
[  384.655581]  sysrq_handle_crash+0x20/0x30
[  384.659559]  __handle_sysrq+0xa8/0x1a0
[  384.663278]  handle_sysrq+0x28/0x38
[  384.666738]  pl011_fifo_to_tty+0x150/0x1a8
[  384.670801]  pl011_int+0x30c/0x430
[  384.674177]  __handle_irq_event_percpu+0x5c/0x148
[  384.678843]  handle_irq_event_percpu+0x34/0x88
[  384.683250]  handle_irq_event+0x48/0x78
[  384.687056]  handle_fasteoi_irq+0xa8/0x180
[  384.691119]  generic_handle_irq+0x24/0x38
[  384.695095]  __handle_domain_irq+0x5c/0xb0
[  384.699158]  gic_handle_irq+0x58/0xa8
[  384.702790]  el1_irq+0xb0/0x128
[  384.705907]  cpuidle_enter_state+0x138/0x220
[  384.710142]  cpuidle_enter+0x18/0x20
[  384.713690]  call_cpuidle+0x1c/0x38
[  384.717151]  do_idle+0x1b0/0x1e8
[  384.720354]  cpu_startup_entry+0x20/0x28
[  384.724246]  rest_init+0xd0/0xe0
[  384.727450]  start_kernel+0x3e4/0x410

On a separate note, the crashkernel complained:

[    0.224730] CPU: CPUs started in inconsistent modes

... which is a separate disaster. I suspect the kexec code failed to punt the
crash CPU back to EL2 as it should have.

Thanks,
Mark.