[RFC PATCH v2 0/7] Introduce persistent memory pool

Stanislav Kinsburskii skinsburskii at linux.microsoft.com
Wed Sep 27 17:02:30 PDT 2023


On Thu, Sep 28, 2023 at 10:29:32AM -0700, Dave Hansen wrote:
> On 9/27/23 16:25, Stanislav Kinsburskii wrote:
> > On Thu, Sep 28, 2023 at 06:22:54AM -0700, Dave Hansen wrote:
> >> On 9/27/23 09:13, Stanislav Kinsburskii wrote:
> >>> Once deposited, these pages can't be accessed by Linux anymore and thus
> >>> must be preserved in "used" state across kexec, as hypervisor state is
> >>> unware of kexec.
> >>
> >> If Linux can't access them, they're not RAM any more.  I'd much rather
> >> remove them from the memory map and move on with life rather than
> >> implement a bunch of new ABI that's got to be handed across kernels.
> > 
> > Could you elaborate more on the new ABIs? FDT is handled by x86 already,
> > and passing it over kexec looks like a natural extension.
> > Also, adding more state to it also doens't look like a new ABI.
> > Or does it?
> 
> FDT makes it easier to pass arbitrary data around, but you're still
> creating a new "default_pmpool" device tree node on one end and
> consuming it on the other.  That's a new ABI in my book.
> 

Well, then yes, it's a new ABI.
I guess it can still be named as "linux,cma", but then another
compatibility needs to be introduced, and that's again a new ABI, isn't
it?

> > Let me also comment on removing this regions from the memory map. The
> > major peculiarity here is that hypervisor distinguish between the pages,
> > deposited for guests to rnu and the pages deposited for the Linux root
> > partition to keep the guest-related portion of hypervisor state in the
> > root partition. And the latter is the matter in question.
> > 
> > We can indeed isolate and deposit a excessive amount of memory upfront
> > in hope that hypervisor will never get into the situation, when it needs
> > more memory.
> > However, it's not reliable, as the amount of memory will always be an
> > estimation, depending on the number of expected guests, guest-attached
> > devices, etc. And this becomes even a bigger problem when most of the
> > memory is already removed from the memory map to host guest partitions.
> > It's also not efficient as the amount of memory required by hypervisor
> > can grow or shrink depending on the use case or host configuration, and
> > deposting excessive amount of memory will be a waste.
> > 
> > But, actually, the idea of removing the pages from memory map was
> > reflected to some extent in the first version of this proposal,
> > so let me elaborate on it a bit.
> > 
> > Effectively, instead of reserving and depositing a lot of memory to
> > hypervisor upfront, the memory can be allocated from kernel memory when
> > needed and then returned back when unused.
> > This would still require pages removal from the memory map upon kexec,
> > but that's another problem.
> 
> Let's distill this down a bit.
> 
> I agree that it's a waste to reserve an obscene amount of memory up
> front for all guests for rare cases.  Having the amount of consumed
> memory grow is a nice feature.
> 
> You can also quite easily *shrink* the amount of memory on a given
> kernel without new code.  Right?
> 
> The problem comes when you've grown the footprint of hypervisor-donated
> memory, kexec, and *THEN* want to shrink it.  That's what needs new
> metadata to be communicated over to the new kernel.
> 
> 1. Boot some kernel
> 2. Grow the deposited memory a bunch
> 3. Kexec
> 4. Shrink the deposited memory
> 
> Right?
> 

Well, not exactly. That's something I'd like to have indeed, but from my
POV this goal is out of scope of discussion at the moment.
Let me try to express it the same way you did above:

1. Boot some kernel
2. Grow the deposited memory a bunch
5. Kexec
4. Kernel panic due to GPF upon accessing the memory deposited to
hypervisor.

> That's where you lose me.
> 
> Can't the deposited memory just be shrunk before kexec?  Surely there
> aren't a bunch of pathological things consuming that memory right before
> kexec, which is basically a reboot.

In general it can. But for this to happen hypervisor needs to release
this memory. And it can release the memory iff the guests are stopped.
And stopping the guests during kexec isn't something we want to have in the
long run.
Also, even if we stop the guests before kexec, we need to restart them
after boot meaning we have to deposit the pages once again.
All this: stopping the guests, withdrawing the pages upon kexec,
allocating after boot and depostiting them again significatnly affect
guests downtime.

Thanks,
Stanislav



More information about the kexec mailing list