[PATCH v12 05/12] KVM: guest_memfd: Enforce NUMA mempolicy using shared policy

Ackerley Tng ackerleytng at google.com
Fri Oct 10 14:57:19 PDT 2025


Sean Christopherson <seanjc at google.com> writes:

> On Fri, Oct 10, 2025, Shivank Garg wrote:
>> >> @@ -112,6 +114,19 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>> >>  	return r;
>> >>  }
>> >>  
>> >> +static struct mempolicy *kvm_gmem_get_folio_policy(struct gmem_inode *gi,
>> >> +						   pgoff_t index)
>> > 
>> > How about kvm_gmem_get_index_policy() instead, since the policy is keyed
>> > by index?
>
> But isn't the policy tied to the folio?  I assume/hope that something will split
> folios if they have different policies for their indices when a folio contains
> more than one page.  In other words, how will this work when hugepage support
> comes along?
>
> So yeah, I agree that the lookup is keyed on the index, but conceptually aren't
> we getting the policy for the folio?  The index is a means to an end.
>

I think the policy is tied to the index.

When we mmap(), there may not be a folio at this index yet, so any folio
that gets allocated for this index then is taken from the right NUMA
node based on the policy.

If the folio is later truncated, the folio just goes back to the NUMA
node, but the memory policy remains for the next folio to be allocated
at this index.

>> >> +{
>> >> +#ifdef CONFIG_NUMA
>> >> +	struct mempolicy *mpol;
>> >> +
>> >> +	mpol = mpol_shared_policy_lookup(&gi->policy, index);
>> >> +	return mpol ? mpol : get_task_policy(current);
>> > 
>> > Should we be returning NULL if no shared policy was defined?
>> > 
>> > By returning NULL, __filemap_get_folio_mpol() can handle the case where
>> > cpuset_do_page_mem_spread().
>> > 
>> > If we always return current's task policy, what if the user wants to use
>> > cpuset_do_page_mem_spread()?
>> > 
>> 
>> I initially followed shmem's approach here.
>> I agree that returning NULL maintains consistency with the current default
>> behavior of cpuset_do_page_mem_spread(), regardless of CONFIG_NUMA.
>> 
>> I'm curious what could be the practical implications of cpuset_do_page_mem_spread()
>> v/s get_task_policy() as the fallback?
>
> Userspace could enable page spreading on the task that triggers guest_memfd
> allocation.  I can't conjure up a reason to do that, but I've been surprised
> more than once by KVM setups.
>
>> Which is more appropriate for guest_memfd when no policy is explicitly set
>> via mbind()?
>
> I don't think we need to answer that question?  Userspace _has_ set a policy,
> just through cpuset, not via mbind().  So while I can't imagine there's a sane
> use case for cpuset_do_page_mem_spread() with guest_memfd, I also don't see a
> reason why KVM should effectively disallow it.
>
> And unless I'm missing something, allocation will eventually fallback to
> get_task_policy() (in alloc_frozen_pages_noprof()), so by explicitly getting the
> task policy in guest_memfd, KVM is doing _more_ work than necessary _and_ is
> unnecessarily restricting usersepace.
>
> Add in that returning NULL would align this code with the ->get_policy hook (and
> could be shared again, I assume), and my vote is definitely to return NULL and
> not get in the way.

... although if we are going to return NULL then we can directly use
mpol_shared_policy_lookup(), so the first discussion is moot.


Though looking slightly into the future, shareability (aka memory
attributes or shared/private state within guest_memfd inodes) are also
keyed by index, and is a property of the index and not the folio (since
shared/private state is defined even before folios are allocated for a
given index.



More information about the linux-arm-kernel mailing list