[RFC PATCH v2 0/7] Introduce persistent memory pool

Thu Sep 28 10:29:32 PDT 2023

On 9/27/23 16:25, Stanislav Kinsburskii wrote:
> On Thu, Sep 28, 2023 at 06:22:54AM -0700, Dave Hansen wrote:
>> On 9/27/23 09:13, Stanislav Kinsburskii wrote:
>>> Once deposited, these pages can't be accessed by Linux anymore and thus
>>> must be preserved in "used" state across kexec, as hypervisor state is
>>> unware of kexec.
>>
>> If Linux can't access them, they're not RAM any more.  I'd much rather
>> remove them from the memory map and move on with life rather than
>> implement a bunch of new ABI that's got to be handed across kernels.
> 
> Could you elaborate more on the new ABIs? FDT is handled by x86 already,
> and passing it over kexec looks like a natural extension.
> Also, adding more state to it also doens't look like a new ABI.
> Or does it?

FDT makes it easier to pass arbitrary data around, but you're still
creating a new "default_pmpool" device tree node on one end and
consuming it on the other.  That's a new ABI in my book.

> Let me also comment on removing this regions from the memory map. The
> major peculiarity here is that hypervisor distinguish between the pages,
> deposited for guests to rnu and the pages deposited for the Linux root
> partition to keep the guest-related portion of hypervisor state in the
> root partition. And the latter is the matter in question.
> 
> We can indeed isolate and deposit a excessive amount of memory upfront
> in hope that hypervisor will never get into the situation, when it needs
> more memory.
> However, it's not reliable, as the amount of memory will always be an
> estimation, depending on the number of expected guests, guest-attached
> devices, etc. And this becomes even a bigger problem when most of the
> memory is already removed from the memory map to host guest partitions.
> It's also not efficient as the amount of memory required by hypervisor
> can grow or shrink depending on the use case or host configuration, and
> deposting excessive amount of memory will be a waste.
> 
> But, actually, the idea of removing the pages from memory map was
> reflected to some extent in the first version of this proposal,
> so let me elaborate on it a bit.
> 
> Effectively, instead of reserving and depositing a lot of memory to
> hypervisor upfront, the memory can be allocated from kernel memory when
> needed and then returned back when unused.
> This would still require pages removal from the memory map upon kexec,
> but that's another problem.

Let's distill this down a bit.

I agree that it's a waste to reserve an obscene amount of memory up
front for all guests for rare cases.  Having the amount of consumed
memory grow is a nice feature.

You can also quite easily *shrink* the amount of memory on a given
kernel without new code.  Right?

The problem comes when you've grown the footprint of hypervisor-donated
memory, kexec, and *THEN* want to shrink it.  That's what needs new
metadata to be communicated over to the new kernel.

1. Boot some kernel
2. Grow the deposited memory a bunch
3. Kexec
4. Shrink the deposited memory

Right?

That's where you lose me.

Can't the deposited memory just be shrunk before kexec?  Surely there
aren't a bunch of pathological things consuming that memory right before
kexec, which is basically a reboot.