[RFC 00/18] Pkernfs: Support persistence for live update

Gowans, James jgowans at amazon.com
Wed Feb 7 06:45:42 PST 2024


Hi Jason,

Thanks for this great feedback on the approach - it's exactly the sort
of thing we were looking for.

On Mon, 2024-02-05 at 13:42 -0400, Jason Gunthorpe wrote:
> 
> On Mon, Feb 05, 2024 at 12:01:45PM +0000, James Gowans wrote:
> 
> > The main aspect we’re looking for feedback/opinions on here is the concept of
> > putting all persistent state in a single filesystem: combining guest RAM and
> > IOMMU pgtables in one store. Also, the question of a hard separation between
> > persistent memory and ephemeral memory, compared to allowing arbitrary pages to
> > be persisted. Pkernfs does it via a hard separation defined at boot time, other
> > approaches could make the carving out of persistent pages dynamic.
> 
> I think if you are going to attempt something like this then the end
> result must bring things back to having the same data structures fully
> restored.
> 
> It is fine that the pkernfs holds some persistant memory that
> guarentees the IOMMU can remain programmed and the VM pages can become
> fixed across the kexec
> 
> But once the VMM starts to restore it self we need to get back to the
> original configuration:
>  - A mmap that points to the VM's physical pages
>  - An iommufd IOAS that points to the above mmap
>  - An iommufd HWPT that represents that same mapping
>  - An iommu_domain programmed into HW that the HWPT

(A quick note on iommufd vs VFIO: I'll still keep referring to VFIO for
now because that's what I know, but will explore iommufd more and reply
in more detail about iommufd in the other email thread.)

How much of this do you think should be done automatically, vs how much
should userspace need to drive? With this RFC userspace basically re-
drives everything, including re-injecting the file containing the
persistent page tables into the IOMMU domain via VFIO.

Part of the reason is simplicity, to avoid having auto-deserialise code
paths in the drivers and modules. Another part of the reason so that
userspace can get FD handles on the resources. Typically FDs are
returned by doing actions like creating VFIO containers. If we make all
that automatic then there needs to be some other mechanism for auto-
restored resources to present themselves to userspace so that userspace
can discover and pick them up again.

One possible way to do this would be to populate a bunch of files in
procfs for each persisted IOMMU domain that allows userspace to discover
and pick it up.

Can you expand on what you mean by "A mmap that points to the VM's
physical pages?" Are you suggesting that the QEMU process automatically
gets something appearing in it's address space? Part of the live update
process involves potentially changing the userspace binaries: doing
kexec and booting a new system is an opportunity to boot new versions of
the userspace binary. So we shouldn't try to preserve too much of
userspace state; it's better to let it re-create internal data
structures do fresh mmaps.

What I'm really asking is: do you have a specific suggestion about how
these persistent resources should present themselves to userspace and
how userspace can discover them and pick them up?

> 
> Ie you can't just reboot and leave the IOMMU hanging out in some
> undefined land - especially in latest kernels!

Not too sure what you mean by "undefined land" - the idea is that the
IOMMU keeps doing what it was going until userspace comes along re-
creates the handles to the IOMMU at which point it can do modifications
like change mappings or tear the domain down. This is what deferred
attached gives us, I believe, and why I had to change it to be enabled.
Just leave the IOMMU domain alone until userspace re-creates it with the
original tables.
Perhaps I'm missing your point. :-)

> 
> For vt-d you need to retain the entire root table and all the required
> context entries too, The restarting iommu needs to understand that it
> has to "restore" a temporary iommu_domain from the pkernfs.
> You can later reconstitute a proper iommu_domain from the VMM and
> atomic switch.

Why does it need to go via a temporary domain? The current approach is
just to leave the IOMMU domain running as-is via deferred attached, and
later when userspace starts up it will create the iommu_domain backed by
the same persistent page tables.
> 
> So, I'm surprised to see this approach where things just live forever
> in the kernfs, I don't see how "restore" is going to work very well
> like this.

Can you expand on why the suggested restore path will be problematic? In
summary the idea is to re-create all of the "ephemeral" data structures
by re-doing ioctls like MAP_DMA, but keeping the persistent IOMMU
root/context tables pointed at the original persistent page tables. The
ephemeral data structures are re-created in userspace but the persistent
page tables left alone. This is of course dependent on userspace re-
creating things *correctly* - it can easily do the wrong thing. Perhaps
this is the issue? Or is there a problem even if userspace is sane.

> I would think that a save/restore mentalitity would make more
> sense. For instance you could make a special iommu_domain that is fixed
> and lives in the pkernfs. The operation would be to copy from the live
> iommu_domain to the fixed one and then replace the iommu HW to the
> fixed one.
> 
> In the post-kexec world the iommu would recreate that special domain
> and point the iommu at it. (copying the root and context descriptions
> out of the pkernfs). Then somehow that would get into iommufd and VFIO
> so that it could take over that special mapping during its startup.

The save and restore model is super interesting - I'm keen to discuss
this as an alternative. You're suggesting that IOMMU driver have a
serialise phase just before kexec where it dumps everything into
persistent memory and then after kexec pulls it back into ephemeral
memory. That's probably do-able, but it may increase the critical
section latency of live update (every millisecond counts!) and I'm also
not too sure what that buys compared to always working with persistent
memory and just always being in a state where persistent data is always
being used and can be picked up as-is.

However, the idea of a serialise and deserialise operation is relevant
to a possible alternative to this RFC. My colleague Alex Graf is working
on a framework called Kexec Hand Over (KHO):
https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com/#r
That allows drivers/modules to mark arbitrary memory pages as persistent
(ie: not allocatable by next kernel) and to pass over some serialised
state across kexec.
An alternative to IOMMU domain persistence could be to use KHO to mark
the IOMMU root, context and domain page table pages as persistent via
KHO.

> 
> Then you'd build the normal operating ioas and hwpt (with all the
> right page refcounts/etc) then switch to it and free the pkernfs
> memory.
> 
> It seems alot less invasive to me. The special case is clearly a
> special case and doesn't mess up the normal operation of the drivers.

Yes, a serialise/deserialise approach does have this distinct advantage
of not needing to change the alloc/free code paths. Pkernfs requires a
shim in the allocator to use persistent memory.


> 
> > * Needing to drive and re-hydrate the IOMMU page tables by defining an IOMMU file.
> > Really we should move the abstraction one level up and make the whole VFIO
> > container persistent via a pkernfs file. That way you’d "just" re-open the VFIO
> > container file and all of the DMA mappings inside VFIO would already be set up.
> 
> I doubt this.. It probably needs to be much finer grained actually,
> otherwise you are going to be serializing everything. Somehow I think
> you are better to serialize a minimum and try to reconstruct
> everything else in userspace. Like conserving iommufd IDs would be a
> huge PITA.
> 
> There are also going to be lots of security questions here, like we
> can't just let userspace feed in any garbage and violate vfio and
> iommu invariants.

Right! This is definitely one of the big gaps at the moment: this
approach requires that VFIO has the same state re-driven into it from
userspace so that the persistent and ephemeral data match. If userspace
does something dodgy, well, it may cause problems. :-)
That's exactly why I thought we should move the abstraction up to a
level that doesn't depend on userspace re-driving data. It sounds like
you were suggesting similar in the first part of your comment, but I
didn't fully understand how you'd like to see it presented to userspace.

JG


More information about the kexec mailing list