[PATCH v3] mm: introduce reference pages

Sat Jun 19 02:21:27 PDT 2021

[Apologies for the delay in getting back to you; other work ended up
taking priority and now I'm back to looking at this.]

On Mon, Aug 17, 2020 at 7:31 PM John Hubbard <jhubbard at nvidia.com> wrote:
>
> On 8/14/20 2:33 PM, Peter Collingbourne wrote:
> > Introduce a new syscall, refpage_create, which returns a file
> > descriptor which may be mapped using mmap. Such a mapping is similar
>
> Hi,
>
> For new syscalls, I think we need to put linux-api on CC, at the very
> least. Adding them now. This would likely need man page support as well.
> I'll put linux-doc on Cc, too.

Thanks.

> > to an anonymous mapping, but instead of clean pages being backed by the
> > zero page, they are instead backed by a so-called reference page, whose
> > contents are specified using an argument to refpage_create. Loads from
> > the mapping will load directly from the reference page, and initial
> > stores to the mapping will copy-on-write from the reference page.
> >
> > Reference pages are useful in circumstances where anonymous mappings
> > combined with manual stores to memory would impose undesirable costs,
> > either in terms of performance or RSS. Use cases are focused on heap
> > allocators and include:
> >
> > - Pattern initialization for the heap. This is where malloc(3) gives
> >    you memory whose contents are filled with a non-zero pattern
> >    byte, in order to help detect and mitigate bugs involving use
> >    of uninitialized memory. Typically this is implemented by having
> >    the allocator memset the allocation with the pattern byte before
> >    returning it to the user, but for large allocations this can result
> >    in a significant increase in RSS, especially for allocations that
> >    are used sparsely. Even for dense allocations there is a needless
> >    impact to startup performance when it may be better to amortize it
> >    throughout the program. By creating allocations using a reference
> >    page filled with the pattern byte, we can avoid these costs.
> >
> > - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
> >    feature which allows for memory to be tagged in order to detect
> >    certain kinds of memory errors with low overhead. In order to set
> >    up an allocation to allow memory errors to be detected, the entire
> >    allocation needs to have the same tag. The issue here is similar to
> >    pattern initialization in the sense that large tagged allocations
> >    will be expensive if the tagging is done up front. The idea is that
> >    the allocator would create reference pages with each of the possible
> >    memory tags, and use those reference pages for the large allocations.
>
> That is good information, and it belongs in a man page, and/or Documentation/.

I plan to write a man page for refpage_create(2) once this is closer to landing.

> >
> > In order to measure the performance and RSS impact of reference pages,
> > a version of this patch backported to kernel version 4.14 was tested on
> > a Pixel 4 together with a modified [2] version of the Scudo allocator
> > that uses reference pages to implement pattern initialization. A
> > PDFium test program was used to collect the measurements like so:
> >
> > $ wget https://static.docs.arm.com/ddi0487/fb/DDI0487F_b_armv8_arm.pdf
> > $ /system/bin/time -v ./pdfium_test --pages=1-100 DDI0487F_b_armv8_arm.pdf
> >
> > and the median of 100 runs measurement was taken with three variants
> > of the allocator:
> >
> > - "anon" is the baseline (no pattern init)
> > - "memset" is with pattern init of allocator pages implemented by
> >    initializing anonymous pages with memset
> > - "refpage" is with pattern init of allocator pages implemented
> >    by creating reference pages
> >
> > All three variants are measured using the patch that I linked. "anon"
> > is without the patch, "refpage" is with the patch and "memset" is
> > with a previous version of the patch [3] with "#if 0" in place of
> > "#if 1" in linux.cpp. The measurements are as follows:
> >
> >            Real time (s)    Max RSS (KiB)
> > anon        2.237081         107088
> > memset      2.252241         112180
> > refpage     2.243786         107128
> >
> > We can see that RSS for refpage is almost the same as anon, and real
> > time overhead is 44% that of memset.
> >
>
> Are some of the numbers stale, maybe? Try as I might, I cannot combine
> anything above to come up with 44%. :)
>
> > As an alternative to introducing this syscall, I considered using
> > userfaultfd to implement reference pages. However, after having taken
> > a detailed look at the interface, it does not seem suitable to be
> > used in the context of a general purpose allocator. For example,
> > UFFD_FEATURE_FORK support would be required in order to correctly
> > support fork(2) in a process that uses the allocator (although POSIX
> > does not guarantee support for allocating after fork, many allocators
> > including Scudo support it, and nothing stops the forked process from
> > page faulting pre-existing allocations after forking anyway), but
> > UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
> > ("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
> > making it unsuitable for use in an allocator. Furthermore, even if
> > the interface issues are resolved, I suspect (but have not measured)
> > that the cost of the multiple context switches between kernel and
> > userspace would be too high to be used in an allocator anyway.
>
>
> That whole blurb is good for a cover letter, and perhaps an "alternatives
> considered" section in Documentation/. However, it should be omitted from
> the patch commit description, IMHO.

Okay, I moved it to the notes section of the commit message.

> ...
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 467302056e17..a1dc07ff914a 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -175,6 +175,13 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
> >
> >       if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
> >               return false;
> > +
> > +     /*
> > +      * Transparent hugepages not currently supported for anonymous VMAs with
> > +      * reference pages
> > +      */
> > +     if (unlikely(vma->vm_private_data))
>
>
> This should use a helper function, such as is_reference_page_vma(). Because the
> assumption that "vma->vm_private_data means a reference page vma" is much too
> fragile. More below.

That makes sense. In v4 I've introduced a helper function.

> > +             return false;
> >       return true;
> >   }
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index e7602a3bcef1..ac375e398690 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3122,5 +3122,15 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping,
> >
> >   extern int sysctl_nr_trim_pages;
> >
> > +static inline int is_zero_or_refpage_pfn(struct vm_area_struct *vma,
> > +                                      unsigned long pfn)
> > +{
> > +     if (is_zero_pfn(pfn))
> > +             return true;
> > +     if (unlikely(!vma->vm_ops && vma->vm_private_data))
> > +             return pfn == page_to_pfn((struct page *)vma->vm_private_data);
>
> As foreshadowed above, this needs a helper function. And the criteria for
> deciding that it's a reference page needs to be more robust than just "no vm_ops,
> vm_private_data is set, and it matches my page". Needs some more decisive
> information.
>
> Maybe setting vm_ops to some new "refpage" ops would be the way to go, for that.

As I mentioned in my reply to Jann, we can't set vm_ops without
introducing some unwanted behavior as a result of following the
non-anonymous VMA code path. What I ended up doing instead was to
check whether vm_file->f_op refers to the refpage file_operations
struct.

It might be nice to introduce a VM_REFPAGE flag to make this check
more efficient, but this would first require extending vm_flags to 64
bits on 32-bit platforms since we're out of bits in vm_flags. From
looking around it looks like many people have attempted this over the
years; it looks like the most recent attempt is from this month:
https://www.spinics.net/lists/kernel/msg3961408.html

Let's see if it actually happens this time.

> ...
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 5053439be6ab..6e9246d09e95 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2841,8 +2841,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> >       pmd_t *pmdp;
> >       pte_t *ptep;
> >
> > -     /* Only allow populating anonymous memory */
> > -     if (!vma_is_anonymous(vma))
> > +     /* Only allow populating anonymous memory without a reference page */
> > +     if (!vma_is_anonymous(vma) || vma->private_data)
>
> Same thing here: helper function, instead of open-coding the assumption about
> what makes a refpage vma.

Done.

> ...
>
> > +SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long,
> > +             flags)
> > +{
> > +     unsigned long content_addr = (unsigned long)content;
> > +     struct page *userpage, *refpage;
> > +     int fd;
> > +
> > +     if (flags != 0)
> > +             return -EINVAL;
> > +
> > +     refpage = alloc_page(GFP_KERNEL);
> > +     if (!refpage)
> > +             return -ENOMEM;
> > +
> > +     if ((content_addr & (PAGE_SIZE - 1)) != 0 ||
> > +         get_user_pages(content_addr, 1, 0, &userpage, 0) != 1) {
> > +             put_page(refpage);
> > +             return -EFAULT;
> > +     }
> > +
> > +     copy_highpage(refpage, userpage);
> > +     put_page(userpage);
> > +
> > +     fd = anon_inode_getfd("[refpage]", &refpage_file_operations, refpage,
> > +                           O_RDONLY | O_CLOEXEC);
>
> Seems like the flags argument should have an influence on these flags, rather
> than hard-coding O_CLOEXEC, right?

I couldn't see a use case for having one of these FDs without
O_CLOEXEC. If someone really wants a non-CLOEXEC refpage FD, they can
use fcntl to clear the CLOEXEC bit.

I only added the flags argument to support future extension as described in:
https://www.kernel.org/doc/html/v5.12/process/adding-syscalls.html#designing-the-api-planning-for-extension

Peter