[Hypervisor Live Update] Notes from June 2, 2025

Sat Jun 14 20:18:42 PDT 2025

Hi everybody,

Here are the notes from the last Hypervisor Live Update call that happened 
on Monday, June 2.  Thanks to everybody who was involved!

These notes are intended to bring people up to speed who could not attend 
the call as well as keep the conversation going in between meetings.

----->o-----
We chatted about LUO v2 and feedback on the upstream mailing list.  There
was also some functionality changes proposed for KHO.  After the comments
for LUO v2 are addressed, Pasha noted he will send the series to Pratyush
who will add the memfd preservation on top for LUO v3.  This will then be
sent as a complete package.  I asked if Jason had the opportunity to take
a look at it, he had put it aside but not looked at it detail yet.

Pratyush planned on doing the review of LUO this week.  He also noted
that there were some improvements made for memfd preservation since the
last biweekly and that it is in good shape.  He felt that it was ready
for integration into the LUO series itself.  A lot of testing was also
added for libluo that is suggested for inclusion in the kernel tree
itself under tools/.

David Matlack noted he was happy to see the test binaries moving into the
kernel and into selftests.  David's VFIO tests would also need some of
this functionality.

----->o-----
I asked about the the entire scope of libluo and future looking thoughts
for it.  Pratyush noted it was currently only a very thin wrapper on top
of ioctls.  This could be expanded for more orchestration if necessary in
the future but no immediate thoughts.  Right now, it's just a set of
ioctls and tests and command line.

Pasha noted one extension could be asynchronous fd preservation that
might be done in the future with new ioctls, which could also be a libluo
extension.

----->o-----
Jason asked if the group was comfortable with the current LUO v2 design
or if there were any major concerns: we have the ioctls and the file
descriptors, it also eliminates the fdbox work, it reduces the need for
guestmemfs, it changes KHO, etc.  David Matlack asked about the
requirement of CAP_SYS_ADMIN to preserve fd's; Jason suggested using fd
permissions instead.

Pratyush suggested there were some complexities with fd permissions in
the next kernel, Pasha echo'd that it was likely safer to require
CAP_SYS_ADMIN at least for now.  David expressed concern about a
malicious process potentially being able to take over an fd unless root
permissions were required.  Jason said a somewhat different angle here is
for the character device itself where ioctl permissions are typically
protected by the fd itself; we may not need a capability check on top of
that.

Jason suggested a broker agent that would be possible to enforce policy
in userspace that would do fd passing, it would do an ioctl to the kernel
and then fd pass it to the less privileged entities.  Filesystem
permissions are much more flexible than capabilities.

I asked about future use cases where we may want capability checks, Pasha
noted we want to be able to change the global state for the prepare stage
and asked if that should be capability protected.  David suggested it
could potentially be a separate character device.  If this was needed
later, it could be an incremental add-on.

----->o-----
Pasha brought up a topic about the lifecycle of a file descriptor if a
process dies or quits.  When the VMM is running, it can add an fd to LUO,
but what happens if the VMM exits.  Do we explicitly remove it from the
preservation before going into the finish state?  He suggested that it
should be automatically removed: we should never preserve an fd for a
process that has exited.  Jason suggested the kernel should not be doing
that and is one reason we may want a security domain, the broker agent
could cancel all state associated with that process.

Pasha asked what would happen if the agent itself dies; Jason suggested
the kernel should fully clean everything up.  Pasha acknowledged this was
the current plan.

Praveen Kumar asked how we would maintain a state for a graceful shutdown
when the application is in the preservation state.  Pasha said that a
graceful shutdown and a non-graceful shutdown are identical from a kernel
perspective; the only difference is if the shutdown happened before the
prepared state.  If before, it's not preserved; if after, serialization
has already been done and it's preserved so if the resource remains
unclaimed then they are cleaned up in the new kernel.

Pasha noted that once we passed prepare, then we are in the critical path
to a live update, we're not going to continue running in this state.  It
should be valid for the agent to exit this state because we cannot add
new stuff to preservation list (the agent has nothing to do after this).
Pratyush said that with LUO when you preserve the fd you get a token and
the token must be saved; the agent would grab this token so the handle is
not lost.  Jason said that once we terminate, then we lost the ability to
undo because the session is lost; Pratyush said that we can undo with the
preservation tokens.

Jason suggested if you close the fd then the kernel should clean up
everything associated with the live update.  Pasha asked how would we
make sure we are not killed before the reboot.  Systemd may make this
more complicated.  He suggested that if the agent is killed or exits
during the prepared phase we cannot undo, we have to reboot.

----->o-----
David Matlack expressed a concern that the flow as described would fall
apart for KVM since KVM fds cannot be transferred across processes as
they have a lot of state associated with the mm struct of the owning
process.  We'd have to dig into why this isn't allowed as a potential
extension.  Jason suggested an alternative would be an additional fd
that has a container property that can only do fd save and restore
within its own container.

KVM would have to do its serialization outside of its original process
and then recreate its context after the kexec.  Jason said the KVM fd
would need to be preserved because it is threaded through all the VFIO
and IOMMU subsystems.  Pasha said we would only preserve the amount of
information needed to recreate the VMs.

----->o-----
Next meeting will be on Monday, June 16 at 8am PDT (UTC-7), everybody is
welcome: https://meet.google.com/rjn-dmzu-hgq

Topics for the next meeting:

 - discuss current status of LUO with memfd preservation and any blockers
   for upstream merge
 - discuss userspace broker agent that is responsible for the fd's, the
   ioctls, and the state machine that interacts with LUO, and any
   potential open sourcing opportunities
 - determine timelines for selftest framework for live updates, which
   could be a significant amount of work
 - check on status of VFIO selftests that will be useful for automated
   testing of device preservation
 - discuss forking off a discussion on iommu and live update that is
   separate from Hypervisor Live Update (due to scheduling constraints)
   but to include Jason and interested parties
 - June 30: update on physical pool allocator that can be used to provide
   pages for hugetlb, guest_memfd, and memfds
 - later: testing methodology to allow downstream consumers to qualify
   that live update works from one version to another
 - later: reducing blackout window during live update

Please let me know if you'd like to propose additional topics for
discussion, thank you!