Memory corruption after resume from hibernate with Arm GICv3 ITS
Rafael J. Wysocki
rafael at kernel.org
Thu Jul 24 02:51:42 PDT 2025
On Thu, Jul 24, 2025 at 11:26 AM David Woodhouse <dwmw2 at infradead.org> wrote:
>
> On Wed, 2025-07-23 at 12:04 +0200, David Woodhouse wrote:
> > We have seen guests crashing when, after they resume from hibernate,
> > the hypervisor serializes their state for live update or live
> > migration.
> >
> > The Arm Generic Interrupt Controller is a complicated beast, and it
> > does scattershot DMA to little tables all across the guest's address
> > space, without even living behind an IOMMU.
> >
> > Rather than simply turning it off overall, the guest has to explicitly
> > tear down *every* one of the individual tables which were previously
> > configured, in order to ensure that the memory is no longer used.
> >
> > KVM's implementation of the virtual GIC only uses this guest memory
> > when asked to serialize its state. Instead of passing the information
> > up to userspace as most KVM devices will do for serialization, KVM
> > *only* supports scribbling it to guest memory.
> >
> > So, when the transition from boot to resumed kernel leaves the vGIC
> > pointing at the *wrong* addresses, that's why a subsequent LU/LM of
> > that guest triggers the memory corruption by writing the KVM state to a
> > guest address that the now-running kernel did *not* expect.
> >
> > I tried this, just to get some more information:
> >
> > --- a/drivers/irqchip/irq-gic-v3-its.c
> > +++ b/drivers/irqchip/irq-gic-v3-its.c
> > @@ -720,7 +720,7 @@ static struct its_collection *its_build_mapd_cmd(struct its_node *its,
> > its_encode_valid(cmd, desc->its_mapd_cmd.valid);
> >
> > its_fixup_cmd(cmd);
> > -
> > + printk("%s dev 0x%x valid %d addr 0x%lx\n", __func__, desc->its_mapd_cmd.dev->device_id, desc->its_mapd_cmd.valid, itt_addr);
> > return NULL;
> > }
> >
> > @@ -4996,10 +4996,15 @@ static int its_save_disable(void)
> > struct its_node *its;
> > int err = 0;
> >
> > + printk("%s\n", __func__);
> > raw_spin_lock(&its_lock);
> > list_for_each_entry(its, &its_nodes, entry) {
> > + struct its_device *its_dev;
> > void __iomem *base;
> >
> > + list_for_each_entry(its_dev, &its->its_device_list, entry) {
> > + its_send_mapd(its_dev, 0);
> > + }
> > base = its->base;
> > its->ctlr_save = readl_relaxed(base + GITS_CTLR);
> > err = its_force_quiescent(base);
> > @@ -5032,8 +5037,10 @@ static void its_restore_enable(void)
> > struct its_node *its;
> > int ret;
> >
> > + printk("%s\n", __func__);
> > raw_spin_lock(&its_lock);
> > list_for_each_entry(its, &its_nodes, entry) {
> > + struct its_device *its_dev;
> > void __iomem *base;
> > int i;
> >
> > @@ -5083,6 +5090,10 @@ static void its_restore_enable(void)
> > if (its->collections[smp_processor_id()].col_id <
> > GITS_TYPER_HCC(gic_read_typer(base + GITS_TYPER)))
> > its_cpu_init_collection(its);
> > +
> > + list_for_each_entry(its_dev, &its->its_device_list, entry) {
> > + its_send_mapd(its_dev, 1);
> > + }
> > }
> > raw_spin_unlock(&its_lock);
> > }
> >
> >
> > Running on a suitable host with qemu, I reproduce with
> > # echo reboot > /sys/power/disk
> > # echo disk > /sys/power/state
> >
> > Example qemu command line:
> > qemu-system-aarch64 -serial mon:stdio -M virt,gic-version=host -cpu max -enable-kvm -drive file=~/Fedora-Cloud-Base-Generic-42-1.1.aarch64.qcow2,id=nvm,if=none,snapshot=off,format=qcow2 -device nvme,drive=nvm,serial=1 -m 8g -nographic -nic user,model=virtio -kernel vmlinuz-6.16.0-rc7-dirty -initrd initramfs-6.16.0-rc7-dirty.img -append 'root=UUID=6c7b9058-d040-4047-a892-d2f1c7dee687 ro rootflags=subvol=root no_timer_check console=tty1 console=ttyAMA0,115200n8 systemd.firstboot=off rootflags=subvol=root no_console_suspend=1 resume_offset=366703 resume=/dev/nvme0n1p3' -trace gicv3_its\*
> >
> > As the kernel boots up for the first time, it sends a normal MAPD command:
> >
> > [ 1.292956] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> >
> > On hibernation, my newly added code unmaps and then *remaps* the same:
> >
> > [root at localhost ~]# echo disk > /sys/power/state
> > [ 42.118573] PM: hibernation: hibernation entry
> > [ 42.134574] Filesystems sync: 0.015 seconds
> > [ 42.134899] Freezing user space processes
> > [ 42.135566] Freezing user space processes completed (elapsed 0.000 seconds)
> > [ 42.136040] OOM killer disabled.
> > [ 42.136307] PM: hibernation: Preallocating image memory
> > [ 42.371141] PM: hibernation: Allocated 297401 pages for snapshot
> > [ 42.371163] PM: hibernation: Allocated 1189604 kbytes in 0.23 seconds (5172.19 MB/s)
> > [ 42.371170] Freezing remaining freezable tasks
> > [ 42.373465] Freezing remaining freezable tasks completed (elapsed 0.002 seconds)
> > [ 42.378350] Disabling non-boot CPUs ...
> > [ 42.378363] its_save_disable
> > [ 42.378363] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10f010000
> > [ 42.378363] PM: hibernation: Creating image:
> > [ 42.378363] PM: hibernation: Need to copy 153098 pages
> > [ 42.378363] PM: hibernation: Image created (115354 pages copied, 37744 zero pages)
> > [ 42.378363] its_restore_enable
> > [ 42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> > [ 42.383601] nvme nvme0: 1/0/0 default/read/poll queues
> > [ 42.384411] nvme nvme0: Ignoring bogus Namespace Identifiers
> > [ 42.384924] hibernate: Hibernating on CPU 0 [mpidr:0x0]
> > [ 42.387742] PM: Using 1 thread(s) for lzo compression
> > [ 42.387748] PM: Compressing and saving image data (115654 pages)...
> > [ 42.387757] PM: Image saving progress: 0%
> > [ 43.485794] PM: Image saving progress: 10%
> > [ 44.739662] PM: Image saving progress: 20%
> > [ 46.617453] PM: Image saving progress: 30%
> > [ 48.437644] PM: Image saving progress: 40%
> > [ 49.857855] PM: Image saving progress: 50%
> > [ 52.156928] PM: Image saving progress: 60%
> > [ 53.344810] PM: Image saving progress: 70%
> > [ 54.472998] PM: Image saving progress: 80%
> > [ 55.083950] PM: Image saving progress: 90%
> > [ 56.406480] PM: Image saving progress: 100%
> > [ 56.407088] PM: Image saving done
> > [ 56.407100] PM: hibernation: Wrote 462616 kbytes in 14.01 seconds (33.02 MB/s)
> > [ 56.407106] PM: Image size after compression: 148041 kbytes
> > [ 56.408210] PM: S|
> > [ 56.642393] Flash device refused suspend due to active operation (state 20)
> > [ 56.642871] Flash device refused suspend due to active operation (state 20)
> > [ 56.643432] reboot: Restarting system
> > [ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd4f1]
> >
> > Then the *boot* kernel comes up, does its own MAPD using a slightly different address:
> >
> > [ 1.270652] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f009000
> >
> > ... and then transfers control to the hibernated kernel, which again
> > tries to unmap and remap the ITT at its original address due to my
> > suspend/resume hack (which is clearly hooking the wrong thing, but is
> > at least giving us useful information):
> >
> > Starting systemd-hibernate-resume.service - Resume from hibernation...
> > [ 1.391340] PM: hibernation: resume from hibernation
> > [ 1.391861] random: crng reseeded on system resumption
> > [ 1.391927] Freezing user space processes
> > [ 1.392984] Freezing user space processes completed (elapsed 0.001 seconds)
> > [ 1.393473] OOM killer disabled.
> > [ 1.393486] Freezing remaining freezable tasks
> > [ 1.395012] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
> > [ 1.400817] PM: Using 1 thread(s) for lzo decompression
> > [ 1.400832] PM: Loading and decompressing image data (115654 pages)...
> > [ 1.400836] hibernate: Hibernated on CPU 0 [mpidr:0x0]
> > [ 1.438621] PM: Image loading progress: 0%
> > [ 1.554623] PM: Image loading progress: 10%
> > [ 1.594714] PM: Image loading progress: 20%
> > [ 1.639317] PM: Image loading progress: 30%
> > [ 1.683055] PM: Image loading progress: 40%
> > [ 1.720726] PM: Image loading progress: 50%
> > [ 1.768878] PM: Image loading progress: 60%
> > [ 1.800203] PM: Image loading progress: 70%
> > [ 1.822833] PM: Image loading progress: 80%
> > [ 1.840985] PM: Image loading progress: 90%
> > [ 1.871253] PM: Image loading progress: 100%
> > [ 1.871611] PM: Image loading done
> > [ 1.871617] PM: hibernation: Read 462616 kbytes in 0.47 seconds (984.28 MB/s)
> > [ 42.378350] Disabling non-boot CPUs ...
> > [ 42.378363] its_save_disable
> > [ 42.378363] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10f010000
> > [ 42.378363] PM: hibernation: Creating image:
> > [ 42.378363] PM: hibernation: Need to copy 153098 pages
> > [ 42.378363] hibernate: Restored 0 MTE pages
> > [ 42.378363] its_restore_enable
> > [ 42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> > [ 42.417445] OOM killer enabled.
> > [ 42.417455] Restarting tasks: Starting
> > [ 42.419915] nvme nvme0: 1/0/0 default/read/poll queues
> > [ 42.420407] Restarting tasks: Done
> > [ 42.420781] PM: hibernation: hibernation exit
> > [ 42.421149] nvme nvme0: Ignoring bogus Namespace Identifiers
>
> Rafael points out that the resumed kernel isn't doing the unmap/remap
> again; it's merely printing the *same* messages again from the printk
> buffer.
>
> Before writing the hibernate image, the kernel calls the suspend op:
>
> [ 42.378350] Disabling non-boot CPUs ...
> [ 42.378363] its_save_disable
> [ 42.378363] its_build_mapd_cmd dev 0x10 valid 0 addr 0x10f010000
> [ 42.378363] PM: hibernation: Creating image:
>
> Those messages are stored in the printk buffer in the image. Then the
> hibernating kernel calls the resume op, and writes the image:
>
> [ 42.378363] PM: hibernation: Image created (115354 pages copied, 37744 zero pages)
> [ 42.378363] its_restore_enable
> [ 42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> [ 42.383601] nvme nvme0: 1/0/0 default/read/poll queues
> [ 42.384411] nvme nvme0: Ignoring bogus Namespace Identifiers
> [ 42.384924] hibernate: Hibernating on CPU 0 [mpidr:0x0]
> [ 42.387742] PM: Using 1 thread(s) for lzo compression
> [ 42.387748] PM: Compressing and saving image data (115654 pages)...
> [ 42.387757] PM: Image saving progress: 0%
> [ 43.485794] PM: Image saving progress: 10%
> ...
>
> Then the boot kernel comes up and maps an ITT:
>
> [ 1.270652] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f009000
>
> The boot kernel never seems to *unmap* that because the suspend method
> doesn't get called before resuming the image.
>
> On resume, the previous kernel flushes the messages which were in its
> printk buffer to the serial port again, and then prints these *new*
> messages...
>
> [ 42.378363] hibernate: Restored 0 MTE pages
> [ 42.378363] its_restore_enable
> [ 42.378363] its_build_mapd_cmd dev 0x10 valid 1 addr 0x10f010000
> [ 42.417445] OOM killer enabled.
> [ 42.417455] Restarting tasks: Starting
>
> So the hibernated kernel seems to be doing the right thing in both
> suspend and resume phases but it looks like the *boot* kernel doesn't
> call the suspend method before transitioning;
No, it does this, but the messages are missing from the log.
The last message you see from the boot/restore kernel is about loading
the image; a lot of stuff happens afterwards.
This message:
[ 1.871617] PM: hibernation: Read 462616 kbytes in 0.47 seconds (984.28 MB/s)
is printed by load_compressed_image() which gets called by
swsusp_read(), which is invoked by load_image_and_restore().
It is successful, so hibernation_restore() gets called and it does
quite a bit of work, including calling resume_target_kernel(), which
among other things calls syscore_suspend(), from where your messages
should be printed if I'm not mistaken.
I have no idea why those messages don't get into the log (that would
happen if your boot kernel were different from the image kernel and it
didn't actually print them).
> is that intentional? I think we *should* unmap all the ITTs from the boot kernel.
Yes, it's better to unmap them, even though ->
> At least for the vGIC, when the hibernated image resumes it will
> *change* the mapping for every device that it knows about, but there's
> a *possibility* that the boot kernel might have set up one that the
> hibernated kernel didn't know about (if a new PCI device exists now?).
-> HW configuration is not supposed to change across hibernation/restore.
> And I'm not sure what the real hardware will do if it gets a subsequent
> MAPD without the previous one being unmapped.
More information about the linux-arm-kernel
mailing list