[RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update

Thu Apr 30 06:28:51 PDT 2026

I have some very similar observations to Alex and some very similar 
observations to David.  This has to imply that everyone will agree with 
me. :)

Seriously, the main contention point, from reading the thread, is the 
placement and lifecycle of the caretaker.  More on this later...

On 4/29/26 00:29, Pasha Tatashin wrote:
> While this proposal focuses on its critical role in minimally disruptive
> Live Update, the Caretaker is fundamentally designed as an extensible
> primitive. Its architecture allows it to be leveraged for a variety of
> other advanced virtualization use cases, such as running custom
> lightweight hypervisors or completely offloading virtualization duties
> to an accelerator card.

One step at a time please---and as an initial step, just place it inside 
the kernel, a la Arm nVHE.

Since your design would have anyway the ability to update the caretaker, 
you can embed that part into the reattachment process, so that the new 
kernel can use its own caretaker.

This reduces a lot the need to establish a stable-ish ABI.  Only the 
handover (kexec/LUO) needs to be stable, so that the new kernel can 
populate its kvm and kvm_vcpu structs.  And for that we mostly have a 
solution already: a stream of serialized ioctls.

> During the execution of the KVM_SET_CARETAKER ioctl, instead of
> pointing the hardware's return path to standard KVM entry points (e.g.,
> vmx_vmexit or svm_vcpu_run), KVM reprograms the host-state return area
> of the CPU's hardware virtualization control structures (e.g., Intel
> VMCS, AMD VMCB, or ARM equivalent) to point directly into the
> bare-metal Caretaker environment.

This can be done unconditionally for all VMs based on a module 
parameter, again as in Arm nVHE.

> Note on Optimization vs. Security: Constantly switching the page table
> (CR3) on every VM Exit can be expensive due to TLB flushing. To
> optimize performance, the Caretaker can share the host kernel's page
> tables while the kernel is still around, and dynamically replace
> HOST_CR3 with the dedicated, isolated page tables only when the vCPU is
> orphaned (during the detachment phase). On the other hand, maintaining
> a permanently isolated CR3 for the Caretaker adds a strong security
> boundary, achieving hardware-enforced separation similar to KVM Address
> Space Isolation (ASI).

Agreed on this.

> The Caretaker requires a defined ABI to communicate with the host KVM
> subsystem. This ABI is implemented via the shared, identity-mapped .ccb
> section of the ELF payload, acting as the Caretaker Control Block
> (CCB).
> 
> The CCB acts as the source of truth for the Caretaker's execution loop
> and contains three primary elements:
> 
>    * Attachment State Flag: An atomic variable indicating the current
>      relationship with the host KVM subsystem (e.g., KVM_ATTACHED or
>      KVM_DETACHED).

This must be done atomically at the time Linux offlines/onlines a pCPU. 
The interface from Linux to the caretaker must use some kind of IPI so 
that the new kernel can force a VMEXIT (if needed) in the caretaker, ask 
it to serialize the vm state, and pass it down to the new kernel's 
caretaker.

>    * KVM Routing Pointers: The physical function pointers that the
>      Caretaker uses to safely jump into the host KVM's standard VM Exit
>      handlers when operating in normal mode.
>    * Shared Configuration Metadata: A physical pointer to dedicated
>      memory pages used by the kernel to share dynamic vCPU configuration
>      data with the Caretaker. Because every guest is configured
>      differently, KVM populates these pages with the specific parameters
>      negotiated during VM initialization (such as CPUID feature masks,
>      APIC routing, and timer states). These pages also include a
>      pre-allocated Telemetry Buffer for the Caretaker to log VM Exits
>      and spin-wait durations. These dedicated pages are explicitly
>      preserved across the host reboot via KHO, ensuring the Caretaker
>      maintains continuous access to the exact context required to
>      accurately emulate trivial exits during the gap.

All this is mostly unnecessary if the caretaker is provided by the 
kernel.  The recently introduced remote ring buffers can be used for 
tracing too.

> The Caretaker first evaluates the VM Exit reason. If the exit belongs to
> a category that the Caretaker is programmed to resolve natively, it
> handles it internally. For example, profiling of guests has identified
> the following exit categories for potential local resolution:
> 
>    * Guest Idle Exits (e.g., HLT): When the guest OS goes idle, it
>      triggers idle exits. The Caretaker intercepts these and halts the
>      physical core until the next guest-bound interrupt fires, preserving
>      host power.

I don't think HLT can be handled entirely here.  Either you skip the 
exit completely or you have to go out to the scheduler.  The HLT exit 
could be skipped unconditionally for an orphaned VM, but while there is 
a running kernel the caretaker has to run entirely with interrupts off 
and that limits what you can do.

In fact there is already a blueprint of what can be handled easily in 
the caretaker, namely 
vmx_exit_handlers_fastpath()/svm_exit_handlers_fastpath().  Stick to 
what exists already.

>    * Timer and APIC Exits: Even an idle guest frequently writes to
>      interrupt controllers and system registers to configure internal
>      timers. The Caretaker handles these trivial writes directly,
>      acknowledging the timer updates.

This depends heavily on the implementation of the hypervisor, for 
example it can be done on Intel via the preemption timer but not on AMD 
where an actual hrtimer is needed.

[...]

> When the new VMM process spawns, it retrieves the
>     preserved session and issues LIVEUPDATE_SESSION_RETRIEVE_FD using
>     its token. LUO invokes KVM's .retrieve() callback to map the
>     preserved vcpufd back into the new VMM's file descriptor table. As
>     part of this retrieval process, the host formally brings the
>     isolated pCPU back online, and the new VMM userspace thread is
>     attached back to the active VM thread running on the vCPU. Finally,
>     KVM populates the new KVM Routing Pointers in the CCB and
>     atomically flips the Host State Flag back to KVM_ATTACHED. This
>     breaks the Caretaker's spin-wait loop (if it is in this state),
>     allowing standard KVM operation to resume.

This would also include some kind of serialization of the old VM into 
the new kernel's struct kvm_vcpu.

Also some kind of feature negotiation is needed (if that fails, the VMs 
are terminated unceremoniously) so I believe that the transition into 
and out of the gap must be synchronous.  For example with INIT/SIPI for 
the entry, and an IPI for the exit?

> Guest-to-Guest IPIs
> -------------------
> 
>    * The Problem: If the guest OS attempts to wake up a sleeping thread,
>      one orphaned vCPU will send an Inter-Processor Interrupt (IPI) to
>      another orphaned vCPU. In standard virtualization without hardware
>      assistance, writing to the APIC ICR (or sending an ARM SGI) causes
>      a VM Exit so the host KVM can emulate the message delivery. During
>      the gap, KVM is unavailable to route this message.
> 
>    * Proposed Solution: The architecture may leverage hardware virtualized
>      interrupts (Intel APICv, AMD AVIC, or ARM GICv4.1 virtual SGIs).
>      This allows the hardware silicon to handle IPI delivery between the
>      isolated pCPUs natively, eliminating the VM Exit. Alternatively,
>      the Caretaker can be programmed to emulate the IPI delivery. By
>      utilizing the shared memory metadata, the Caretaker can determine
>      the target vCPU and directly update its pending interrupt state.

Yeah, I think APIC emulation to some extent must be moved into the 
VMX/SVM fastpaths.  The good news is that this can be done already as a 
PoC without needing the whole caretaker and LUO infrastructure.

>    * The Problem: What happens if a Non-Maskable Interrupt (NMI), a
>      hardware timer tick, or a Machine Check Exception / System Error
>      (MCE / ARM SError) arrives while the CPU is actively executing
>      Caretaker code in KVM_DETACHED mode?
> 
>    * Proposed Solution: To safely handle these asynchronous events, [...]
>      on x86, when transitioning into the gap, KVM explicitly programs
>      HOST_IDTR and HOST_GDTR to [the caretaker's] tables.

Agreed and this also shows that the transition must be synchronous.
>    * The Problem: As the guest executes, it may attempt to access memory
>      that has not yet been mapped by the hypervisor, or it may interact
>      with MMIO regions. Normally, this triggers an EPT Violation (Intel)
>      or NPT Page Fault (AMD), prompting KVM to allocate host pages and
>      update the secondary page tables. How are these updates handled
>      when the host KVM subsystem is offline during the gap?
> 
>    * Proposed Solution: During the "Management Gap," there are absolutely
>      no updates made to the NPT/EPT. The existing secondary page tables
>      are fully preserved in memory via LUO kvmfd preservation prior to
>      detachment, allowing the guest to seamlessly access all previously
>      mapped memory. If the guest triggers a new page fault (requiring an
>      NPT/EPT update) during the gap, the Caretaker simply categorizes it
>      as a Blocking Exit.

Yes, by default everything is a blocking exit.  In particular, unless 
one day we do x86/pKVM, page tables can be handled entirely by Linux 
rather than the caretaker with no change to the existing MMU notifier 
architecture.

As a consequence, the caretaker is absolutely not going to be a TCB---at 
least not in the beginning.

> Compromised Caretaker
> ---------------------
> 
>    * The Problem: The Caretaker runs in Host Mode. If left unprotected,
>      this could allow a lightly privileged userspace process (e.g., QEMU
>      or crosvm) to inject arbitrary executable code directly into the
>      CPU's most privileged hardware state (VMX Root / Ring 0 / EL2).
> 
>    * Proposed Solution: To mitigate this risk, the KVM_SET_CARETAKER
>      ioctl may adopt the security model used by the kexec_file_load()
>      syscall. Rather than trusting userspace to pass physical addresses,
>      the kernel must take full ownership of payload validation:

-EOVERENGINEERED.  Just shove it into the kernel.

> Caretaker Update
> ----------------
> 
>    * The Problem: Given that the Caretaker is permanently installed
>      during VM setup, how does it get updated on long-running VMs?

Via kexec. :)  I understand you have bigger plans, but we need to crawl 
before walk^Wattempting a marathon.

I even wonder if, for long term simplicity, the interface for 
host->caretaker should be just for the caretaker to swallow the host 
into non-root mode, again as in Arm nVHE.  That would make it much 
harder to implement some kind of live update, but my answer to that 
*really* is just to use kexec.

Paolo