[Hypervisor Live Update] Notes from May 5, 2025

Sat May 17 21:07:52 PDT 2025

Hi everybody,

Here are the notes from the last Hypervisor Live Update call that happened 
on Monday, May 5.  Thanks to everybody who was involved!

These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
We discussed reviews on the latest series of KHO changes.  Pasha noted
that Dave Hansen's feedback should be fully addressed.  Small changes
were requested that could be handled incrementally on top.  Changyuan
echoed that an incremental x86 patch on top could be posted as KHO v7 is
currently in mm-unstable.

I asked if there were any major blockers to the series and there did not
appear to be.  Jason suggested that we just leave the current status of
the series as-is so that it can graduate to mm-stable and then eventually
to upstream.

It was noted that compound_folio() may not be handled correctly based on
feedback from Mike Rapoport upstream.  Changyuan said there was currently
an issue where KHO would have difficulty in preserving a folio for the
same number of times that it was referenced; the same issue would likely
apply to compound folio.

Jason suggested that if a memfd needs to be preserved by multiple
processes that they still get a single fd, it doesn't get preserved
twice.

----->o-----
The current status of KHO is such that it should be possible to build
iommufd and memfd on top of it.  Pasha said that once the compound folio
concern was addressed, and now that all of Dave's feedback was also
addressed, that everything else should be able to be incremental on top
of the framework.

I started discussing the next phase of development for the KHO framework.
Pasha said that LUO is the next thing that should land upstream as it
controls the lifecycle.  For KHO itself, additional support for other
architectures, bug fixes, and scalability improvements are always
possible.

----->o-----
Pasha said that he would be sending out an RFC v2 for LUO.  The first
user of LUO would be memfd that Changyuan is working on.  He was porting
fdbox support on top of LUO.  Pratyush asked for the patches to be sent
out to fix some issues on what he was working on ahead of time.  Pasha
noted that he would present LUO design in the next biweekly meeting to
provide more visibility for the rest of the group.

We briefly chatted about the future of guestmemfs support after LUO and
whether there is a future for it.  Until we hear more about additional
requirements that necessitate a guestmemfs, we'll drop the topic for
subsequent meetings.

----->o-----
Andrey discussed with Chris Li about the next steps for KSTATE on top of
KHO.  Chris said that he would work with Andrey for state saving with PCI
code; he felt that KSTATE was a solid direction for PCI.  FDT is designed
for information saving but not serializing state.  In LUO, there are many
recursive FDT objects and that breaks the fact that it needs to be stored
in one big save state structure.  Chris suggested that objects would need
to support pointers.

Pasha asked about recursive FDT with LUO and said this was not the plan;
LUO does not necessarily care itself.  He supported KSTATE for this.  LUO
supports the 8-byte pointer for preservation.  Chris said they are
shifting to a tree like structure branched off of this 8-byte pointer.

Chris said that for every object we store, there will be a description of
the member type, ID, and type.  Chris said this needed to be stored as
part of the binary format for the new kernel to understand what the old
kernel was using, including for rollback.  He also noted that FDT does
not support descriptions for the acceptable ranges that members can take
on very well (and version number may be inappropriate to describe this).

Jason said that per-member schemas would probably be very complex.  Chris
said this would be needed.  Jason suggested per-field would be way too
granular, he was skeptical that rollback was something that the upstream
kernel should support; if a CSP wants to deploy v2 -> v3, then this is
entirely deferred to them.  Jason suggested having very coarse versioning
instead that captures everything.

Chris said it would be important for a vendor to be able to add their own
versions.

Jason noted that it would not be possible to enable a new feature in the
fleet until the CSP is no longer willing to roll back.  There may be some
minor exceptions to this, but for more features they will need to remain
off until fully deployed and it's not possible to roll back.  Amit Shah
agreed that the feature must be available but not enabled; once available
everywhere then it can be enabled throughout the fleet.  David Matlack
observed the similarities for KVM features today and agreed.

The amount of state to preserve across kexec would have to be minimal and
updating versions should be rare.  Andrey agreed with this.

Pasha asked about the situation when enabling a new feature would run
into issues in the fleet and whether reboot would be the only way to
recover from that.  Jason said this should go through the VMM so that the
only way to enable a new feature is when the VM restarts and once that is
committed, then it's there until the VM reboots.

----->o-----
Jason noted that he posted his first patch series to make the iommu page
tables common[1] that could become part of the KHO work.  Pasha said this
was exciting to see and that it would be possible to add page table
checks for this.

That consolidating iommu page table implementation patch series would
benefit from review from the community, so people are strongly encouraged
to take a look.

----->o-----
Pasha noted that the support would be sent upstream soon to support dev
dax as 1GB shards similar to how hugetlb is managed.  Currently under
review was support pmem regions into arbitrary lengths.  There was also
support to provision fsdax by default but dev dax would be optional as
defined by the kernel command line.  This would avoid the need for a
separate tool.

----->o-----
Frank van der Linden discussed his physical pool allocator.  His concept
is for a common layer that is separate from hugetlb; the topic has come
up a number of different times.  For example, if memfd needs to be backed
by 1GB, then a physical pool allocator could provide this, as well as for
guest_memfd.  This will decouple 1GB pages from hugetlb entirely.

His series provides a common allocator for physical memory and will be
sending an RFC prototype soon.  The 1GB pages comes from a static pool or
from a dynamic pool and can even be removed from the kernel direct map.
Goal will be to send this out in June.

Jason noted that we should avoid having to store information about the
vmemmap across kexec if at all possible.

----->o-----
Next meeting will be on Monday, May 19 at 8am PDT (UTC-7), everybody is
welcome: https://meet.google.com/rjn-dmzu-hgq

Topics for the next meeting:

 - 20 min: presentation of LUO v2 design
 - check back on latest status of KHO series in mm staging trees and any
   pending concerns
   + including possible refcount issues for compound folios across KHO
 - possibility of a Live Update Microconference for LPC this year
 - discuss support for sharding of dax devices into arbitrary lengths
 - discuss support for defaulting to fsdax and with optional devdax as
   needed, provisioned by the kernel, on the command line without
   additional tooling (ndctl)
 - update on physical pool allocator that can be used to provide pages
   for hugetlb, guest_memfd, and memfds
 - SEV-SNP support for preserving guest memory and what foundational
   components AMD can depend on, building on top of KHO v6 or KSTATE
 - later: testing methodology to allow downstream consumers to qualify
   that live update works from one version to another
 - later: reducing blackout window during live update

Please let me know if you'd like to propose additional topics for
discussion, thank you!

[1] https://marc.info/?l=linux-doc&m=174645437711873&q=mbox