[RFC] proposal: KVM: Orphaned VMs: The Caretaker approach for Live Update

Fri May 1 01:56:25 PDT 2026

On Fri, 2026-05-01 at 05:32 +0200, Paolo Bonzini wrote:
> On 4/30/26 17:27, David Woodhouse wrote:
> > On Thu, 2026-04-30 at 15:28 +0200, Paolo Bonzini wrote:
> > > I even wonder if, for long term simplicity, the interface for
> > > host->caretaker should be just for the caretaker to swallow the host
> > > into non-root mode, again as in Arm nVHE.
> > 
> > There's a lot of merit in that approach.
> > 
> > I talked about wanting to use this 'caretaker' for secret hiding.  But
> > why have *voluntary* secret hiding with the kernel hiding things from
> > its own address space, when you have have *mandatory* secret hiding
> > with something running in EL2, like pKVM.
> 
> Well, other than because it's a lot of work? :)

If we avoided those things then we've never have any fun!

And in a week where there seems to be a new user-to-root exploit posted
every day, the 'deprivilege the VMM and assume the guest has owned it'
security model is looking rather scary. So the additional defence in
depth of knowing that even *root* can't get the kernel to access other
guests' memory might be the only thing that lets you sleep at night :)

Yes, it's a lot of work. But I think we've reached the point where
mandatory secret hiding is... well... mandatory.

> > The *userspace* ABI considerations are all about how you make a vCPU
> > that runs asynchronously (should it conceptually just be an async
> > KVM_RUN call, which allows the vCPU to run in a kernel thread up to the
> > point of kexec? Why is it fundamentally tied to kexec at all?).
> 
> It's not tied to kexec.  kexec is just forcing a handoff + forcing an
> update.
> 
> The big difference is that:
> 
> 1) if you don't tie it to kexec, a detached vCPU thread is a struct 
> vhost_task and a blocking vmexit schedules out the thread; while during 
> kexec you have s/kthread/pCPU/ and halting the CPU instead of scheduling 
> it out.

For now maybe. But "how does the caretaker do scheduling" is on
definitely the list of future problems, for any environment where a
physical host with N pCPUs is hosting >= N vCPUs.

(In the case of a true mandatory-secret-hiding caretaker at EL2, the
scheduling part *could* be done by the residual purgatory-caretaker-
thing at EL1 that all the secondary CPUs go to instead of being turned
off. It would just be calling into EL2 to run the actual vCPUs. Thus
leaving the EL2 code just to do its *one* job, which has the added
benefit that the automated reasoning people put the knives down and no
longer have that look in their eyes that they got when they thought you
wanted to put a scheduler in their formally-proven EL2 code...)

> 2) if you don't tie it to kexec, address space isolation is the only 
> real reason for the complication of treating the caretaker as a separate 
> bare metal program.  OTOH maybe that's a feature - you could do:
> 
> - ioctl(KVM_RUN_ASYNC)
> 
> - then vmfd/vcpufd handoff to a new mm on top

This much gives you a seamless upgrade of the userspace VMM without
having to play fd-handover tricks. The old VMM detaches, the new one
attaches. If you're quick, and the guests aren't doing much "admin"
work but only passing traffic through passthrough PCI devices, the
guests might not experience any non-negligible steal time at all.

> - then address space isolation on top

Even voluntary secret hiding lets you sleep at night when the next
Retbleed happens.

> - then kexec (de)serialization on top

... and this one is the holy grail.

So yes, that's exactly the kind of thing I was thinking, rather than
trying to boil the ocean. There are sensible milestones along the way
which give practical benefits.

But my point was *also* about understanding the actual userspace
interface for this, even if we were to just focus on the live update
and do it all in one amphetamine-and-tokens-fueled epic. What does it
even look like, from the VMM point of view? How does the new VMM under
the new kernel 'reattach' to the existing vCPUs?

I think we need the userspace API concepts for 'detach' and 'attach',
including the permissions model for reattach, and we might as well
implement and test them without the kexec in the middle to start with.

> > I'd love to start without kexec in the picture at all. Just show me the
> > KVM API for starting a *confidential* guest (pKVM, SEV-SNP, whatever),
> > leaving it running, completely stopping the VMM and then starting a new
> > VMM to pick up from where it left off.
> 
> Why confidential?

Mostly so that confidential VMs aren't an *afterthought*, and the
design of the detach/attach userspace ABI gets them right from the
start.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5069 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/kexec/attachments/20260501/b4854ec9/attachment.p7s>