[PATCH v3] mm: introduce reference pages
John Hubbard
jhubbard at nvidia.com
Mon Aug 17 22:31:39 EDT 2020
On 8/14/20 2:33 PM, Peter Collingbourne wrote:
> Introduce a new syscall, refpage_create, which returns a file
> descriptor which may be mapped using mmap. Such a mapping is similar
Hi,
For new syscalls, I think we need to put linux-api on CC, at the very
least. Adding them now. This would likely need man page support as well.
I'll put linux-doc on Cc, too.
> to an anonymous mapping, but instead of clean pages being backed by the
> zero page, they are instead backed by a so-called reference page, whose
> contents are specified using an argument to refpage_create. Loads from
> the mapping will load directly from the reference page, and initial
> stores to the mapping will copy-on-write from the reference page.
>
> Reference pages are useful in circumstances where anonymous mappings
> combined with manual stores to memory would impose undesirable costs,
> either in terms of performance or RSS. Use cases are focused on heap
> allocators and include:
>
> - Pattern initialization for the heap. This is where malloc(3) gives
> you memory whose contents are filled with a non-zero pattern
> byte, in order to help detect and mitigate bugs involving use
> of uninitialized memory. Typically this is implemented by having
> the allocator memset the allocation with the pattern byte before
> returning it to the user, but for large allocations this can result
> in a significant increase in RSS, especially for allocations that
> are used sparsely. Even for dense allocations there is a needless
> impact to startup performance when it may be better to amortize it
> throughout the program. By creating allocations using a reference
> page filled with the pattern byte, we can avoid these costs.
>
> - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
> feature which allows for memory to be tagged in order to detect
> certain kinds of memory errors with low overhead. In order to set
> up an allocation to allow memory errors to be detected, the entire
> allocation needs to have the same tag. The issue here is similar to
> pattern initialization in the sense that large tagged allocations
> will be expensive if the tagging is done up front. The idea is that
> the allocator would create reference pages with each of the possible
> memory tags, and use those reference pages for the large allocations.
That is good information, and it belongs in a man page, and/or Documentation/.
>
> In order to measure the performance and RSS impact of reference pages,
> a version of this patch backported to kernel version 4.14 was tested on
> a Pixel 4 together with a modified [2] version of the Scudo allocator
> that uses reference pages to implement pattern initialization. A
> PDFium test program was used to collect the measurements like so:
>
> $ wget https://static.docs.arm.com/ddi0487/fb/DDI0487F_b_armv8_arm.pdf
> $ /system/bin/time -v ./pdfium_test --pages=1-100 DDI0487F_b_armv8_arm.pdf
>
> and the median of 100 runs measurement was taken with three variants
> of the allocator:
>
> - "anon" is the baseline (no pattern init)
> - "memset" is with pattern init of allocator pages implemented by
> initializing anonymous pages with memset
> - "refpage" is with pattern init of allocator pages implemented
> by creating reference pages
>
> All three variants are measured using the patch that I linked. "anon"
> is without the patch, "refpage" is with the patch and "memset" is
> with a previous version of the patch [3] with "#if 0" in place of
> "#if 1" in linux.cpp. The measurements are as follows:
>
> Real time (s) Max RSS (KiB)
> anon 2.237081 107088
> memset 2.252241 112180
> refpage 2.243786 107128
>
> We can see that RSS for refpage is almost the same as anon, and real
> time overhead is 44% that of memset.
>
Are some of the numbers stale, maybe? Try as I might, I cannot combine
anything above to come up with 44%. :)
> As an alternative to introducing this syscall, I considered using
> userfaultfd to implement reference pages. However, after having taken
> a detailed look at the interface, it does not seem suitable to be
> used in the context of a general purpose allocator. For example,
> UFFD_FEATURE_FORK support would be required in order to correctly
> support fork(2) in a process that uses the allocator (although POSIX
> does not guarantee support for allocating after fork, many allocators
> including Scudo support it, and nothing stops the forked process from
> page faulting pre-existing allocations after forking anyway), but
> UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
> ("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
> making it unsuitable for use in an allocator. Furthermore, even if
> the interface issues are resolved, I suspect (but have not measured)
> that the cost of the multiple context switches between kernel and
> userspace would be too high to be used in an allocator anyway.
That whole blurb is good for a cover letter, and perhaps an "alternatives
considered" section in Documentation/. However, it should be omitted from
the patch commit description, IMHO.
...
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 467302056e17..a1dc07ff914a 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -175,6 +175,13 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
>
> if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
> return false;
> +
> + /*
> + * Transparent hugepages not currently supported for anonymous VMAs with
> + * reference pages
> + */
> + if (unlikely(vma->vm_private_data))
This should use a helper function, such as is_reference_page_vma(). Because the
assumption that "vma->vm_private_data means a reference page vma" is much too
fragile. More below.
> + return false;
> return true;
> }
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e7602a3bcef1..ac375e398690 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3122,5 +3122,15 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping,
>
> extern int sysctl_nr_trim_pages;
>
> +static inline int is_zero_or_refpage_pfn(struct vm_area_struct *vma,
> + unsigned long pfn)
> +{
> + if (is_zero_pfn(pfn))
> + return true;
> + if (unlikely(!vma->vm_ops && vma->vm_private_data))
> + return pfn == page_to_pfn((struct page *)vma->vm_private_data);
As foreshadowed above, this needs a helper function. And the criteria for
deciding that it's a reference page needs to be more robust than just "no vm_ops,
vm_private_data is set, and it matches my page". Needs some more decisive
information.
Maybe setting vm_ops to some new "refpage" ops would be the way to go, for that.
...
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 5053439be6ab..6e9246d09e95 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2841,8 +2841,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> pmd_t *pmdp;
> pte_t *ptep;
>
> - /* Only allow populating anonymous memory */
> - if (!vma_is_anonymous(vma))
> + /* Only allow populating anonymous memory without a reference page */
> + if (!vma_is_anonymous(vma) || vma->private_data)
Same thing here: helper function, instead of open-coding the assumption about
what makes a refpage vma.
...
> +SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long,
> + flags)
> +{
> + unsigned long content_addr = (unsigned long)content;
> + struct page *userpage, *refpage;
> + int fd;
> +
> + if (flags != 0)
> + return -EINVAL;
> +
> + refpage = alloc_page(GFP_KERNEL);
> + if (!refpage)
> + return -ENOMEM;
> +
> + if ((content_addr & (PAGE_SIZE - 1)) != 0 ||
> + get_user_pages(content_addr, 1, 0, &userpage, 0) != 1) {
> + put_page(refpage);
> + return -EFAULT;
> + }
> +
> + copy_highpage(refpage, userpage);
> + put_page(userpage);
> +
> + fd = anon_inode_getfd("[refpage]", &refpage_file_operations, refpage,
> + O_RDONLY | O_CLOEXEC);
Seems like the flags argument should have an influence on these flags, rather
than hard-coding O_CLOEXEC, right?
thanks,
--
John Hubbard
NVIDIA
More information about the linux-arm-kernel
mailing list