[RFC] memory pressure detection in VMs using PSI mechanism for dynamically inflating/deflating VM memory

Mon Jan 23 15:47:14 PST 2023

On 1/23/2023 1:26 PM, T.J. Alumbaugh wrote:
> Hi Sudarshan,
>
> I had questions about the setup and another about the use of PSI.
Thanks for your comments Alumbaugh.
>> 1. This will be a native userspace daemon that will be running only in the Linux VM which will use virtio-mem driver that uses memory hotplug to add/remove memory. The VM (aka Secondary VM, SVM) will request for memory from the host which is Primary VM, PVM via the backend hypervisor which takes care of cross-VM communication.
>>
> In regards to the "PVM/SVM" nomenclature, is the implied setup one of
> fault tolerance (i.e. the secondary is there to take over in case of
> failure of the primary VM)? Generally speaking, are the PVM and SVM
> part of a defined system running some workload? The context seems to
> be that the situation is more intricate than "two virtual machines
> running on a host", but I'm not clear how it is different from that
> general notion.

Here the Primary VM (PVM) is actually the host and we run a VM from this 
host. We simply call this newly launched VM as Secondary VM (SVM). Sorry 
for the confusion here. The secondary VM runs in a secure environment.

>
>> 5. Detecting decrease in memory pressure – the reverse part where we give back memory to PVM when memory is no longer needed is bit tricky. We look for pressure decay and see if PSI averages (avg10, avg60, avg300) go down, and along with other memory stats (such as free memory etc) we make an educated guess that usecase has ended and memory has been free’ed by the usecase, and this memory can be given back to PVM when its no longer needed.
>>
> This is also very interesting to me. Detecting a decrease in pressure
> using PSI seems difficult. IIUC correctly, the approach taken in
> OOMD/senpai from Meta seems to be continually applying
> pressure/backing off, and then seeing the outcome of that decision on
> the pressure metric to feedback to the next decision (see links
> below). Is your approach similar? Do you check the metric periodically
> or only when receiving PSI memory events in userspace?
>
> https://github.com/facebookincubator/senpai/blob/main/senpai.py#L117-L148
> https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L529-L538

We have implemented a logic where we use the PSI averages to check for 
rate of decay. If there are no new pressure events, these averages would 
decay exponentially. And we wait until {avg10, avg60, avg300} values 
reaches below a certain threshold. The logic is as follows -

usecase ends  ->  wait until no new pressure event occurs (this usually 
happens when all usecases ends)  ->  once no new pressure events, run 
check for pressure decay algorithm that simply checks exponential decay 
of averages goes below certain threshold -> once this happens, we make 
educated decision that usecase has actually ended ->  check for memory 
stats MemFree etc (here we actually take memory snapshot when pressure 
builds up and new memory gets plugged-in, and compare memory snapshot 
when pressure decay ends, that way we know how much memory was 
plugged-in before and check if MemFree is in that range so that we get 
to know previously memory that was added is now no longer needed) ->  
release remaining free memory back to Primary VM (host).

The reason why we check for exponential decay of averages is it gives a 
clear picture that memory pressure is indeed going down, and any new 
sudden spike in pressure will be factored into increase in these 
averages and it can be observed. Rather than sampling the pressure 
during every ticks where you might miss the sudden spikes if the 
sampling time is too wide.

Another cool thing of using averages is you can calculate how long it 
will take for pressure to decay from {avg10XX, avg60XX} -> to {avg10TT, 
avg60TT} where avg10TT,... is the set threshold value. So you can sleep 
until this time and then wake up and check if averages have reached the 
threshold values. If its not, that means a new pressure event would have 
come in and suppressed the decay. This way we don't have to do any 
sampling of pressure every ticks (saves CPU cycles).

> Very interesting proposal. Thanks for sending,
>
> -T.J.