[PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET
Sean Christopherson
seanjc at google.com
Tue Sep 5 17:45:02 PDT 2023
On Tue, Sep 05, 2023, David Stevens wrote:
> On Wed, Jul 12, 2023 at 7:00 AM Sean Christopherson <seanjc at google.com> wrote:
> >
> > On Tue, Jul 11, 2023, Zhi Wang wrote:
> > > On Thu, 6 Jul 2023 15:49:39 +0900
> > > David Stevens <stevensd at chromium.org> wrote:
> > >
> > > > On Wed, Jul 5, 2023 at 10:19___PM Zhi Wang <zhi.wang.linux at gmail.com> wrote:
> > > > >
> > > > > On Tue, 4 Jul 2023 16:50:48 +0900
> > > > > David Stevens <stevensd at chromium.org> wrote:
> > > > > If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> > > > > It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> > > > > temporary solution. I don't think it is a good idea to play tricks with
> > > > > a temporary solution, more like we are abusing the toleration.
> > > >
> > > > I'm not sure I understand what you're getting at. This series never
> > > > calls gup without FOLL_GET.
> > > >
> > > > This series aims to provide kvm_follow_pfn as a unified API on top of
> > > > gup+follow_pte. Since one of the major clients of this API uses an mmu
> > > > notifier, it makes sense to support returning a pfn without taking a
> > > > reference. And we indeed need to do that for certain types of memory.
> > > >
> > >
> > > I am not having prob with taking a pfn without taking a ref. I am
> > > questioning if using !FOLL_GET in struct kvm_follow_pfn to indicate taking
> > > a pfn without a ref is a good idea or not, while there is another flag
> > > actually showing it.
> > >
> > > I can understand that using FOLL_XXX in kvm_follow_pfn saves some
> > > translation between struct kvm_follow_pfn.{write, async, xxxx} and GUP
> > > flags. However FOLL_XXX is for GUP. Using FOLL_XXX for reflecting the
> > > requirements of GUP in the code path that going to call GUP is reasonable.
> > >
> > > But using FOLL_XXX with purposes that are not related to GUP call really
> > > feels off.
> >
> > I agree, assuming you're talking specifically about the logic in hva_to_pfn_remapped()
> > that handles non-refcounted pages, i.e. this
> >
> > if (get_page_unless_zero(page)) {
> > foll->is_refcounted_page = true;
> > if (!(foll->flags & FOLL_GET))
> > put_page(page);
> > } else if (foll->flags & FOLL_GET) {
> > r = -EFAULT;
> > }
> >
> > should be
> >
> > if (get_page_unless_zero(page)) {
> > foll->is_refcounted_page = true;
> > if (!(foll->flags & FOLL_GET))
> > put_page(page);
> > else if (!foll->guarded_by_mmu_notifier)
> > r = -EFAULT;
> >
> > because it's not the desire to grab a reference that makes getting non-refcounted
> > pfns "safe", it's whether or not the caller is plugged into the MMU notifiers.
> >
> > Though that highlights that checking guarded_by_mmu_notifier should be done for
> > *all* non-refcounted pfns, not just non-refcounted struct page memory.
>
> I think things are getting confused here because there are multiple
> things which "safe" refers to. There are three different definitions
> that I think are relevant here:
>
> 1) "safe" in the sense that KVM doesn't corrupt page reference counts
> 2) "safe" in the sense that KVM doesn't access pfns after they have been freed
> 3) "safe" in the sense that KVM doesn't use stale hva -> pfn translations
>
> For property 1, FOLL_GET is important. If the caller passes FOLL_GET,
> then they expect to be able to pass the returned pfn to
> kvm_release_pfn. This means that when FOLL_GET is set, if
> kvm_pfn_to_refcounted_page returns a page, then hva_to_pfn_remapped
> must take a reference count to avoid eventually corrupting the page
> ref count. I guess replacing the FOLL_GET check with
> !guarded_by_mmu_notifier is logically equivalent because
> __kvm_follow_pfn requires that at least one of guarded_by_mmu_notifier
> and FOLL_GET is set. But since we're concerned about a property of the
> refcount, I think that checking FOLL_GET is clearer.
>
> For property 2, FOLL_GET is also important. If guarded_by_mmu_notifier
> is set, then we're all good here. If guarded_by_mmu_notifier is not
> set, then the check in __kvm_follow_pfn guarantees that FOLL_GET is
> set. For struct page memory, we're safe because KVM will hold a
> reference as long as it's still using the page. For non struct page
> memory, we're not safe - this is where the breaking change of
> allow_unsafe_mappings would go. Note that for non-refcounted struct
> page, we can't use the allow_unsafe_mappings escape hatch. Since
> FOLL_GET was requested, if we returned such a page, then the caller
> would eventually corrupt the page refcount via kvm_release_pfn.
Yes we can. The caller simply needs to be made aware of is_refcounted_page. I
didn't include that in the snippet below because I didn't want to write the entire
patch. The whole point of adding is_refcounted_page is so that callers can
identify exactly what type of page was at the end of the trail that was followed.
> Property 3 would be nice, but we've already concluded that guarding
> all translations with mmu notifiers is infeasible. So maintaining
> property 2 is the best we can hope for.
No, #3 is just a variant of #2. Unless you're talking about not making guarantees
about guest accesses being ordered with respect to VMA/memslot updates, but I
don't think that's the case.
> > As for the other usage of FOLL_GET in this series (using it to conditionally do
> > put_page()), IMO that's very much related to the GUP call. Invoking put_page()
> > is a hack to workaround the fact that GUP doesn't provide a way to get the pfn
> > without grabbing a reference to the page. In an ideal world, KVM would NOT pass
> > FOLL_GET to the various GUP helpers, i.e. FOLL_GET would be passed as-is and KVM
> > wouldn't "need" to kinda sorta overload FOLL_GET to manually drop the reference.
> >
> > I do think it's worth providing a helper to consolidate and document that hacky
> > code, e.g. add a kvm_follow_refcounted_pfn() helper.
> >
> > All in all, I think the below (completely untested) is what we want?
> >
> > David (and others), I am planning on doing a full review of this series "soon",
> > but it will likely be a few weeks until that happens. I jumped in on this
> > specific thread because this caught my eye and I really don't want to throw out
> > *all* of the FOLL_GET usage.
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 5b5afd70f239..90d424990e0a 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -2481,6 +2481,25 @@ static inline int check_user_page_hwpoison(unsigned long addr)
> > return rc == -EHWPOISON;
> > }
> >
> > +static kvm_pfn_t kvm_follow_refcounted_pfn(struct kvm_follow_pfn *foll,
> > + struct page *page)
> > +{
> > + kvm_pfn_t pfn = page_to_pfn(page);
> > +
> > + foll->is_refcounted_page = true;
> > +
> > + /*
> > + * FIXME: Ideally, KVM wouldn't pass FOLL_GET to gup() when the caller
> > + * doesn't want to grab a reference, but gup() doesn't support getting
> > + * just the pfn, i.e. FOLL_GET is effectively mandatory. If that ever
> > + * changes, drop this and simply don't pass FOLL_GET to gup().
> > + */
> > + if (!(foll->flags & FOLL_GET))
> > + put_page(page);
> > +
> > + return pfn;
> > +}
> > +
> > /*
> > * The fast path to get the writable pfn which will be stored in @pfn,
> > * true indicates success, otherwise false is returned. It's also the
> > @@ -2500,11 +2519,9 @@ static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > return false;
> >
> > if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
> > - *pfn = page_to_pfn(page[0]);
> > foll->writable = foll->allow_write_mapping;
> > - foll->is_refcounted_page = true;
> > - if (!(foll->flags & FOLL_GET))
> > - put_page(page[0]);
> > +
> > + *pfn = kvm_follow_refcounted_pfn(foll, page[0]);
> > return true;
> > }
> >
> > @@ -2528,7 +2545,6 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > return npages;
> >
> > foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > - foll->is_refcounted_page = true;
> >
> > /* map read fault as writable if possible */
> > if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > @@ -2540,9 +2556,8 @@ static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > page = wpage;
> > }
> > }
> > - *pfn = page_to_pfn(page);
> > - if (!(foll->flags & FOLL_GET))
> > - put_page(page);
> > +
> > + *pfn = kvm_follow_refcounted_pfn(foll, page);
> > return npages;
> > }
> >
> > @@ -2610,17 +2625,16 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, struct kvm_follow_pfn
> > if (!page)
> > goto out;
> >
> > - if (get_page_unless_zero(page)) {
> > - foll->is_refcounted_page = true;
> > - if (!(foll->flags & FOLL_GET))
> > - put_page(page);
> > - } else if (foll->flags & FOLL_GET) {
> > - r = -EFAULT;
> > - }
> > -
> > + if (get_page_unless_zero(page))
> > + WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
> > out:
> > pte_unmap_unlock(ptep, ptl);
> > - *p_pfn = pfn;
> > +
> > + if (!foll->is_refcounted_page && !foll->guarded_by_mmu_notifier &&
> > + !allow_unsafe_mappings)
> > + r = -EFAULT;
> > + else
> > + *p_pfn = pfn;
> >
> > return r;
> > }
> >
>
> As I pointed out above, this suggestion is broken because a FOLL_GET
> && !guarded_by_mmu_notifier request (e.g. kvm_vcpu_map) for a
> non-refcounted page will result in the refcount eventually being
> corrupted.
I don't think so, unless I'm misunderstanding the concern. It just wasn't a
complete patch, and wasn't intended to be.
> What do you think of this implementation? If it makes sense, I can
> send out an updated patch series.
>
> /*
> * If FOLL_GET is set, then the caller wants us to take a reference to
> * keep the pfn alive. If FOLL_GET isn't set, then __kvm_follow_pfn
> * guarantees that guarded_by_mmu_notifier is set, so there aren't any
> * use-after-free concerns.
> */
> page = kvm_pfn_to_refcounted_page(pfn);
> if (page) {
> if (get_page_unless_zero(page)) {
> WARN_ON_ONCE(kvm_follow_refcounted_pfn(foll, page) != pfn);
> } else if (foll->flags & FOLL_GET) {
> /*
> * Certain IO or PFNMAP mappings can be backed with
> * valid struct pages but be allocated without
> * refcounting e.g., tail pages of non-compound higher
> * order allocations. The caller asked for a ref, but
> * we can't take one, since releasing such a ref would
> * free the page.
> */
> r = -EFAULT;
> }
> } else if (foll->flags & FOLL_GET) {
> /*
> * When there's no struct page to refcount and no MMU notifier,
> * then KVM can't be guarantee to avoid use-after-free. However,
> * there are valid reasons to set up such mappings. If userspace
> * is trusted and willing to forego kernel safety guarantees,
> * allow this check to be bypassed.
> */
> if (foll->guarded_by_mmu_notifier && !allow_unsafe_mappings)
I assume you mean:
if (!foll->guarded_by_mmu_notifier && !allow_unsafe_mappings)
> r = -EFAULT;
> }
Please no. I don't want to overload FOLL_GET or have dependencies between
FOLL_GET and guarded_by_mmu_notifier. The guest page fault path should be able
to omit FOLL_GET, and it obviously should set guarded_by_mmu_notifier. And
kvm_vcpu_map() should be able to set FOLL_GET, omit guarded_by_mmu_notifier, _and_
play nice with "unsafe", non-refcounted memory when allow_unsafe_mappings is true.
The above fits your use case because nothing in KVM needs to do kvm_vcpu_map()
on the GPU's buffer, but as a general solution the semantics are very odd. E.g.
in long form, they are:
Get get a reference if a there's a refcounted page, fail with -EFAULT if there's
a struct page but it's not refcounted, and fail with -EFAULT if there's not struct
page unless the caller is protected by mmu_notifiers or unsafe mappings are allowed.
That's rather bonkers. What I instead want is FOLL_GET to be:
Get a reference if there's a refcounted page.
And then allow_unsafe_mappings is simply:
Allow mapping non-refcounted memory into the guest even if the mapping isn't
guarded by mmu_notifier events.
More information about the linux-arm-kernel
mailing list