[PATCH] mm: introduce reference pages

Thu Aug 13 18:03:41 EDT 2020

Hi John,

Thanks for the review and suggestions.

On Sun, Aug 2, 2020 at 8:28 PM John Hubbard <jhubbard at nvidia.com> wrote:
>
> On 7/31/20 1:32 PM, Peter Collingbourne wrote:
> ...
>
> Hi,
>
> I can see why you want to do this. A few points to consider, below.
>
> btw, the patch would *not* apply for me, via `git am`. I finally used
> patch(1) and that worked. Probably good to mention which tree and branch
> this applies to, as a first step to avoiding that, but I'm not quite sure
> what else went wrong. Maybe use stock git, instead of
> 2.28.0.163.g6104cc2f0b6-goog? Just guessing.

Sorry about that. It might have been because I had another patch
applied underneath this one when I created the patch. In the v2 that
I'm about to send I'm based directly on master.

> > @@ -1684,9 +1695,33 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
> >       return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
> >   }
> >
> > +static vm_fault_t refpage_fault(struct vm_fault *vmf)
> > +{
> > +     struct page *page;
> > +
> > +     if (get_user_pages((unsigned long)vmf->vma->vm_private_data, 1, 0,
> > +                        &page, 0) != 1)
> > +             return VM_FAULT_SIGSEGV;
> > +
>
> This will end up overflowing the page->_refcount in some situations.
>
> Some thoughts:
>
> In order to implement this feature, the reference pages need to be made
> at least a little bit more special, and probably little bit more like
> zero pages. At one extreme, for example, zero pages could be a special
> case of reference pages, although I'm not sure of a clean way to
> implement that.
>
>
> The reason that more special-ness is required, is that things such as
> reference counting and locking can be special-cased with zero pages.
> Doing so allows avoiding page->_refcount overflows, for example. Your
> patch here, however, allows normal pages to be treated *almost* like a
> zero page, in that it's a page full of constant value data. But because
> a refpage can be any page, not just a special one that is defined at a
> single location, that leads to problems with refcounts.

You're right, there is a potential reference count issue here. But it
looks like the issue is not with _refcount but with _mapcount. For
example, a program could create a reference page mapping with 2^32
pages, fault every page in the mapping and thereby overflow _mapcount.

It looks like we can avoid this issue by aligning the handling of
reference pages with that of the zero page, as you suggested. Like the
zero page, _mapcount is now not modified on reference pages to track
PTEs (this is done by causing vm_normal_page() to return null for
these pages, as we do for the zero page). Ownership is moved to the
struct file created by the new refpage_create (bikeshed colors
welcome) syscall which returns a file descriptor, per Kirill's
suggestion. A struct file's reference count is an atomic_long_t, which
I assume cannot realistically overflow. A pointer to the reference
page is stored in the VMA's vm_private_data, but this is mostly for
convenience because the page is kept alive by the VMA's struct file
reference. The VMA's vm_ops is now set to null, which causes us to
follow the code path for anonymous pages, which has been modified to
handle reference pages. That's all implemented in the v2 that I'm
about to send.

I considered having reference page mappings continue to provide a
custom vm_ops, but this would require changes to the interface to
preserve the specialness of the reference page. For example,
vm_normal_page() would need to know to return null for the reference
page in order to prevent _mapcount from overflowing, which could
probably be done by adding a new interface to vm_ops, but that seemed
more complicated than changing the anonymous page code path.

> > +     vmf->page = page;
> > +     return VM_FAULT_LOCKED;
>
> Is the page really locked, or is this a case of "the page is special and
> we can safely claim it is locked"? Maybe I'm just confused about the use
> of VM_FAULT_LOCKED: I thought you only should set it after locking the
> page.

You're right, it isn't locked at this point. I had confused locking
the page with incrementing its _refcount via get_user_pages(). But
with the new implementation we no longer need this fault handler.

> > +}
> > +
> > +static void refpage_close(struct vm_area_struct *vma)
> > +{
> > +     /* This function exists only to prevent is_mergeable_vma from allowing a
> > +      * reference page mapping to be merged with an anonymous mapping.
> > +      */
>
> While it is true that implementing a vma's .close() method will prevent
> vma merging, this is an abuse of that function: it depends on how that
> function is implemented. And given that refpages represent significant
> new capability, I think they deserve their own "if" clause (and perhaps
> a VMA flag) in is_mergeable_vma(), instead of this kind of minor hack.

It turns out that with the change to use a file descriptor we do not
need a change to is_mergeable_vma() because the function bails out if
the struct file pointers in the VMAs are different.

Thanks,
Peter