[Hypervisor Live Update] Notes from April 7, 2025
David Rientjes
rientjes at google.com
Sun Apr 13 18:57:46 PDT 2025
Hi everybody,
Here are the notes from the last Hypervisor Live Update call that happened
on Monday, April 7. Thanks to everybody who was involved!
These notes are intended to bring people up to speed who could not attend
the call as well as keep the conversation going in between meetings.
----->o-----
We debriefed the discussions at LSF/MM/BPF. The general understanding
was that the core MM community didn't have any major concerns or feedback
for the approach discussed, as long as there would not be intrusive
changes made. This would likely only start to become a concern when
extensions would be made for preserving hugetlbfs or tmpfs.
----->o-----
LUO and fdbox were discussed at LSF/MM/BPF. Jason suggested having
everything preserved using fds, including a single char device interface.
This could require some significant changes to VMM: Pasha noted we'd have
a VM suspended to memory but some KVM specific state would need to be
preserved for Confidential VM use cases. The VMM would still do the same
call pattern as today (open /dev/kvm, lots of ioctls) but would also note
to the kernel that some specific state would need be restored for the VM,
rather than retrieving the full fd for /dev/kvm that is preserved.
Jason said the VFIO and IOMMU need the /dev/kvm fd so there is no option
other than to preserve the full KVM as well -- otherwise we cannot
restore the full iommufd. Pasha noted an alternative would be to
preserve memory using the fd and the IOMMU is recreated with the memory
that was preserved. Jason noted KVM would have to be involved when we
started to preserve vIOMMU and for Confidential Computing. Pasha was
concerned with the amount of code changes that would be required for qemu
and other VMMs.
Jason stressed that starting up a VM in this case will inevitably be
different from starting up a clean VM. This will especially be required
for vIOMMU, but not necessarily only for vIOMMU; for example, the VMID
must be the same as KVM uses on the IOMMU and CPU side for ARM and this
can't be disrupted during the KHO. James Gowans asked if this state
could all be serialized to/from userspace which would not be transparent.
There was general debate about preserving all fds; Jason argued that it
will be complex but likely there is not an alternative. The underlying
hardware state would be destroyed when attempting to restore the IOMMUFD.
We have to preserve the hardware state, which is different than the
challenges that KVM has to face because it does not have the underlying
hardware state. He offered an example of preserving eight VMs with
corresponding IOMMU hardware state and how to map this to the correct VM
on the other side of the kexec. He was also concerned about what
permissions would be required to open an fd and take over a KHO; in this
case, a security token would be needed.
Jason noted the only thing VFIO needs to preserve is the fact that it
does not need to FLR the device and which iommufd is controlling the
translation. Preferably, there would be a consistent way of doing this
throughout the kernel, such as preserving fds, rather than anything
hacky; for this, we have freedom to determine what is supported with KHO
and what is not.
----->o-----
We discussed open questions for KHO, fdbox, and LUO after LSF/MM/BPF.
Pratyush wanted a feel for where this goes so that the next version of
fdbox could be worked on; clarity was needed in establishing fdbox's role
and where it overlaps with KHO. Pasha noted LUO was handling the state
machine and the dependency chain for devices -- this starts to fully
overlap fdbox. Pratyush noted it would be fine for fdbox to be part of
LUO and he would follow-up by looking at the latest LUO series.
----->o-----
Changyuan Lyu discussed what should be saved in the KHO FDT. Alex's
original patches allowed for copying smaller amounts of memory, or it's
possible to specify a pointer to save larger chunks of memory that the
new kernel would fetch from the FDT. He suggested only allowing KHO
users to save pointers to memory into the FDT and leave it to the users
to interpret the preserved data. Jason noted that this made sense with
the simplest example of just using a u64.
James noted that one very attractive feature of storing everything
directly in the FDT, while acknowledging the size limitation, was that
the state can be dumped for debugging purposes. The ability to dump this
state would still be possible, but with more complex parsing.
There was not full alignment, so James suggested following up with Mike
and Alex Graf on this topic on the mailing list. Jason suggested
separating this topic entirely from KHO.
----->o-----
Jason suggested if VFIO or iommufd were users of LUO then the case for
upstreaming, as well as addressing many of the questions in the
discussions about it, would be much more clear.
----->o-----
Next meeting will be on Monday, April 21 at 8am PDT (UTC-7), everybody is
welcome: https://meet.google.com/rjn-dmzu-hgq
Topics I think we should cover in the next meeting:
- finalize decision on everything being preserving by fds (complex
solution) or recreating state on the other side of kexec
+ discuss Live Update Orchestrater (LUO) based on RFC patches to
define the state machine
- update on next steps for fdbox
+ is this going to be pursued separately or as part of LUO
* does this support obsolete the need for guestmemfd in the long
term
+ allocating swiotlb in low memory and any other device requirements
- finalize decision on storing u64 in the KHO FDT to point to memory
without storing all state directly in the FDT itself
- alignment on memblock as the first use case for KHO to justify
upstreaming, including ftrace use cases
+ update on Mike's patch series for memory reservation
- discuss how KSTATE plays into KHO upstreaming and complementary or
overlapping goals
- decoupling 1GB pages for hugetlb, guest_memfd, and memfds and how fds
can be added to an fdbox
- iommufd patch series (as well as qemu) from James
- establishing an API for callbacks into drivers to serialize state
during brownout
- reducing blackout window during live update
- testing methodology for these components, including selftests
Please let me know if you'd like to propose additional topics for
discussion, thank you!
More information about the kexec
mailing list