Lockdep warnings on kexec (virtio_blk, hrtimers)
Thomas Gleixner
tglx at linutronix.de
Thu Dec 12 16:14:54 PST 2024
On Thu, Dec 12 2024 at 19:19, David Woodhouse wrote:
> On Thu, 2024-12-12 at 19:04 +0100, Thomas Gleixner wrote:
> Build current master (231825b2e1ff here). The config I'm using is at
> http://david.woodhou.se/config-x86-kjump-irqs although I don't think
> there's anything special other than CONFIG_KEXEC_JUMP and enough
> lockdep to trigger the complaints.
>
> I've been running in qemu with the test case shoved into an initrd for
> faster testing, but it works just as well done manually. If it matters,
> the QEMU command line on my Haswell box is
>
> qemu-system-x86_64 -accel kvm,kernel-irqchip=split -display none \
> -serial mon:stdio -kernel arch/x86/boot/bzImage -smp 2 -m 2g \
> -append "console=ttyS0 root=/dev/vda1 no_console_suspend earlyprintk=serial" \
> -drive file=/var/lib/libvirt/images/fedora.qcow2,if=virtio \
> -cpu host --no-reboot -nic user,model=virtio
>
> Probably the only important part of that is the no_console_suspend.
I misread that and somehow thought it does _not_ happen with
no_console_suspend. Duh!
So that made it reproduce and after adding some more lockdep hacks here
is the culprit:
[ 15.177945] DEBUG_LOCKS_WARN_ON(suspend_syscore_active)
[ 15.178980] CPU: 0 UID: 0 PID: 410 Comm: k.1 Not tainted 6.13.0-rc2+ #118
[ 15.185613] Call Trace:
[ 15.185760] <TASK>
[ 15.187678] __cond_resched+0x21/0x60
[ 15.187923] down_timeout+0x18/0x60
[ 15.188159] acpi_os_wait_semaphore+0x4c/0x80
[ 15.188424] acpi_ut_acquire_mutex+0x3d/0x100
[ 15.188665] acpi_ns_get_node+0x27/0x60
[ 15.188879] acpi_ns_evaluate+0x1cb/0x2d0
[ 15.189106] acpi_rs_set_srs_method_data+0x156/0x190
[ 15.189382] acpi_pci_link_set+0x11c/0x290
[ 15.189618] irqrouter_resume+0x54/0x60
[ 15.189826] syscore_resume+0x6a/0x200
[ 15.190033] kernel_kexec+0x145/0x1c0
[ 15.190239] __do_sys_reboot+0xeb/0x240
[ 15.190561] do_syscall_64+0x95/0x180
This is a long standing problem, which probably got more visible with
the recent printk changes. Something does a wakeup and the scheduler
sets the NEED_RESCHED flag.
cond_resched() sees it set and invokes schedule() from a completely bogus
context.
Nothing knowns about the current system state and therefore happily
assumes that everything is cool and shiny.
I also found that kexec jump fails to set system_state = SYSTEM_SUSPEND
before syscore_suspend() and back to SYSTEM_RUNNING after
syscore_resume(). See suspend/hibernate...
But setting this does not solve the problem because NEED_RESCHED still
gets set and cond_resched() is oblivious of the state.
Quite some of the code paths in syscore_suspend()/resume() can result in
triggering a wakeup with the exactly same consequences. They might not
have done so yet, but as they share a lot of code with normal operation
it's just a question of time.
And of course then you need a code path which is invoked after that,
which actually invokes cond_resched().
Adding more debug, there are indeed a couple of ways to set NEED_RESCHED
between invoking syscore_suspend() and returning from syscore_resume().
Some stuff schedules work [un]conditonally.
So the real question is how to deal with the general problem, which
obviously affects suspend as well.
The only thing I came up with so far is the hack below. It cures the
problem for PREEMPT_NONE and PREEMPT_VOLUNTARY. PREEMPT_FULL is not
affected because preemptible() checks whether the preemption count is
zero and interrupts are enabled and cond_resched() not active.
We could check for interrupts enabled in cond_resched() too, but that's
not pretty either.
With that applied the problem goes away, but after a lot of repetitions
of the reproducer in a tight loop the whole machinery stops dead:
[ 29.913179] Disabling non-boot CPUs ...
[ 29.930328] smpboot: CPU 1 is now offline
[ 29.930593] crash hp: kexec_trylock() failed, kdump image may be inaccurate
B[ 29.940588] Enabling non-boot CPUs ...
[ 29.940856] crash hp: kexec_trylock() failed, kdump image may be inaccurate
[ 29.941242] smpboot: Booting Node 0 Processor 1 APIC 0x1
[ 29.942654] CPU1 is up
[ 29.945856] virtio_blk virtio1: 2/0/0 default/read/poll queues
[ 29.948556] OOM killer enabled.
[ 29.948750] Restarting tasks ... done.
Success
[ 29.960044] Freezing user space processes
[ 29.961447] Freezing user space processes completed (elapsed 0.001 seconds)
[ 29.961861] OOM killer disabled.
[ 30.102485] ata2: found unknown device (class 0)
[ 30.107387] Disabling non-boot CPUs ...
That happens without 'no_console_suspend' on the command line as
well, but that's for tomorrow ...
Thanks,
tglx
---
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1025,6 +1025,7 @@ int kernel_kexec(void)
if (error)
goto Enable_cpus;
local_irq_disable();
+ system_state = SYSTEM_SUSPEND;
error = syscore_suspend();
if (error)
goto Enable_irqs;
@@ -1054,6 +1055,7 @@ int kernel_kexec(void)
if (kexec_image->preserve_context) {
syscore_resume();
Enable_irqs:
+ system_state = SYSTEM_RUNNING;
local_irq_enable();
Enable_cpus:
suspend_enable_secondary_cpus();
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7276,7 +7276,7 @@ void rt_mutex_setprio(struct task_struct
#if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
int __sched __cond_resched(void)
{
- if (should_resched(0)) {
+ if (should_resched(0) && system_state != SYSTEM_SUSPEND) {
preempt_schedule_common();
return 1;
}
More information about the kexec
mailing list