[PATCH v4] mm: introduce reference pages
Peter Collingbourne
pcc at google.com
Fri Jul 16 19:59:00 PDT 2021
On Tue, Jun 29, 2021 at 12:19 AM John Hubbard <jhubbard at nvidia.com> wrote:
>
> On 6/19/21 2:20 AM, Peter Collingbourne wrote:
> > Introduce a new syscall, refpage_create, which returns a file
> > descriptor which may be mapped using mmap. Such a mapping is similar
> > to an anonymous mapping, but instead of clean pages being backed by the
> > zero page, they are instead backed by a so-called reference page, whose
> > contents are specified using an argument to refpage_create. Loads from
> > the mapping will load directly from the reference page, and initial
> > stores to the mapping will copy-on-write from the reference page.
>
> Hi Peter,
>
> Now that you have shown that this seems to have some performance
> justification, I've taken a closer look at the patch, and have a handfull
> of small suggestions, most of them very easy to deal with.
>
> First of all: documentation of the new syscall. At the very least,
> refpage.c could use a bunch of the wording that is in this patch's
> commit description, at the top. I'm sure there are other places for new
> syscall documentation (someone else probably knows where), but that would
> be a good start.
Okay, I copied some of the text from the commit message into a comment
at the top of refpage.c. I also wrote a man page for the new syscall,
which I'm sending out concurrently.
> >
> > Reference pages are useful in circumstances where anonymous mappings
> > combined with manual stores to memory would impose undesirable costs,
> > either in terms of performance or RSS. Use cases are focused on heap
> > allocators and include:
> >
> > - Pattern initialization for the heap. This is where malloc(3) gives
> > you memory whose contents are filled with a non-zero pattern
> > byte, in order to help detect and mitigate bugs involving use
> > of uninitialized memory. Typically this is implemented by having
> > the allocator memset the allocation with the pattern byte before
> > returning it to the user, but for large allocations this can result
> > in a significant increase in RSS, especially for allocations that
> > are used sparsely. Even for dense allocations there is a needless
> > impact to startup performance when it may be better to amortize it
> > throughout the program. By creating allocations using a reference
> > page filled with the pattern byte, we can avoid these costs.
>
> As Kirill and Matthew mentioned in the other thread, it would be good
> to pass in the pattern as part of the syscall, instead of deducing it
> in prep_refpage_private_data(). I'll cover that more in the diffs area.
>
> >
> > - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5
> > feature which allows for memory to be tagged in order to detect
> > certain kinds of memory errors with low overhead. In order to set
> > up an allocation to allow memory errors to be detected, the entire
> > allocation needs to have the same tag. The issue here is similar to
> > pattern initialization in the sense that large tagged allocations
> > will be expensive if the tagging is done up front. The idea is that
> > the allocator would create reference pages with each of the possible
> > memory tags, and use those reference pages for the large allocations.
> >
> > This patch includes specific optimizations for these use cases in
> > order to reduce memory traffic. If the reference page consists of a
> > single repeating byte, the page is initialized using memset, and on
> > arm64 if the reference page consists of a uniformly tagged zero page,
> > the DC GZVA instruction is used to initialize the page.
> >
> > In order to measure the performance and RSS impact of reference pages,
> > I used the following microbenchmark program, which is intended to
> > compare an implementation of heap pattern initialization that uses
> > memset to initialize the pages against an implementation that uses
> > reference pages:
> >
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <sys/mman.h>
> > #include <unistd.h>
> >
> > constexpr unsigned char pattern_byte = 0xaa;
> >
> > #define PAGE_SIZE 4096
> >
> > _Alignas(PAGE_SIZE) static unsigned char pattern[PAGE_SIZE];
> >
> > int main(int argc, char **argv) {
> > if (argc < 3)
> > return 1;
> > bool use_refpage = argc > 3;
> > size_t mmap_size = atoi(argv[1]);
> > size_t touch_size = atoi(argv[2]);
> >
> > int refpage_fd;
> > if (use_refpage) {
> > memset(pattern, pattern_byte, PAGE_SIZE);
> > refpage_fd = syscall(448, pattern, 0);
> > }
> > for (unsigned i = 0; i != 1000; ++i) {
> > char *p;
> > if (use_refpage) {
> > p = (char *)mmap(0, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
> > refpage_fd, 0);
> > } else {
> > p = (char *)mmap(0, mmap_size, PROT_READ | PROT_WRITE,
> > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > memset(p, pattern_byte, mmap_size);
> > }
> > for (unsigned j = 0; j < touch_size; j += PAGE_SIZE)
> > p[j] = 0;
> > munmap(p, mmap_size);
> > }
> > }
> >
>
>
> That sample code would be very nice to include in a documentation
> section for documentation too, once we figure out the best place to put
> it. If no one else recommends anything, then I'd start with
> Documentation/mm/reference_pages.rst.
I would propose the man page to be the canonical source of
documentation for this syscall, since I would expect it to be the
first place that users will look when trying to understand code that
uses it, as opposed to the kernel's internal documentation.
I added some sample code to the man page, but not exactly the code
above since that code is more of a benchmark than a demonstration of
the feature, and I would expect the latter to be more useful to
readers.
> > On a DragonBoard 845c with the powersave governor, and taking the
> > median of 10 runs for each measurement, I measured the following
> > results for real time (s):
> >
> > touch_size/mmap_size memset refpages improvement (95% CI)
> > 4096/4096000 3.962194 0.026726 98.8015% +/- 1.14684%
> > 2048000/4096000 3.925309 1.48081 61.8271% +/- 1.11911%
> > 4096000/4096000 3.986275 3.385003 15.1205% +/- 0.227235%
> >
> > And the following for max RSS (KiB):
> >
> > touch_size/mmap_size memset refpages improvement (95% CI)
> > 4096/4096000 6656 3448 49.3815% +/- 1.30339%
> > 2048000/4096000 6696 4580 31.7053% +/- 1.16411%
> > 4096000/4096000 6716 6684 none
> >
> > So we see a large improvement for sparsely used allocations, and even
> > a modest perf improvement for fully utilized allocations as a result
> > of touching the pages one fewer time (with memset: once in the kernel
> > and once in userspace; with refpages: just once in the kernel).
> >
> > Signed-off-by: Peter Collingbourne <pcc at google.com>
> > Link: [1] https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/enhancing-memory-safety
> > ---
> > v4:
> > - rebased to linux-next
> > - added arch hooks to support MTE tagged reference pages
> > - added optimizations for pages with pattern byte as well as uniformly MTE-tagged pages
> > - added helper functions to avoid open-coding the reference page detection
> > - wrote a microbenchmark program and got new perf results for the commit message
> >
> > As an alternative to introducing this syscall, I considered using
> > userfaultfd to implement reference pages. However, after having taken
> > a detailed look at the interface, it does not seem suitable to be
> > used in the context of a general purpose allocator. For example,
> > UFFD_FEATURE_FORK support would be required in order to correctly
> > support fork(2) in a process that uses the allocator (although POSIX
> > does not guarantee support for allocating after fork, many allocators
> > including Scudo support it, and nothing stops the forked process from
> > page faulting pre-existing allocations after forking anyway), but
> > UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd
> > ("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"),
> > making it unsuitable for use in an allocator. Furthermore, even if
> > the interface issues are resolved, I suspect (but have not measured)
> > that the cost of the multiple context switches between kernel and
> > userspace would be too high to be used in an allocator anyway.
> >
> > arch/alpha/kernel/syscalls/syscall.tbl | 1 +
> > arch/arm/tools/syscall.tbl | 1 +
> > arch/arm64/include/asm/mman.h | 15 ++++
> > arch/arm64/include/asm/mte.h | 9 +-
> > arch/arm64/include/asm/page.h | 2 +-
> > arch/arm64/include/asm/unistd.h | 2 +-
> > arch/arm64/include/asm/unistd32.h | 2 +
> > arch/arm64/kernel/mte.c | 24 +++++
> > arch/arm64/lib/mte.S | 7 +-
> > arch/arm64/mm/fault.c | 41 ++++++++-
> > arch/ia64/kernel/syscalls/syscall.tbl | 1 +
> > arch/m68k/kernel/syscalls/syscall.tbl | 1 +
> > arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
> > arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
> > arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
> > arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
> > arch/parisc/kernel/syscalls/syscall.tbl | 1 +
> > arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
> > arch/s390/kernel/syscalls/syscall.tbl | 1 +
> > arch/sh/kernel/syscalls/syscall.tbl | 1 +
> > arch/sparc/kernel/syscalls/syscall.tbl | 1 +
> > arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> > arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> > arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
> > include/linux/gfp.h | 11 ++-
> > include/linux/highmem.h | 2 +-
> > include/linux/huge_mm.h | 7 ++
> > include/linux/mm.h | 39 ++++++++
> > include/linux/mman.h | 19 ++++
> > include/linux/syscalls.h | 3 +
> > include/uapi/asm-generic/unistd.h | 5 +-
> > kernel/sys_ni.c | 1 +
> > mm/Makefile | 4 +-
> > mm/gup.c | 2 +-
> > mm/kasan/hw_tags.c | 2 +-
> > mm/memory.c | 47 +++++++---
> > mm/migrate.c | 4 +-
> > mm/page_alloc.c | 2 +-
> > mm/refpage.c | 98 +++++++++++++++++++++
> > 39 files changed, 330 insertions(+), 34 deletions(-)
> > create mode 100644 mm/refpage.c
> >
> > diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> > index a17687ed4b51..494edc5ca61c 100644
> > --- a/arch/alpha/kernel/syscalls/syscall.tbl
> > +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> > @@ -486,3 +486,4 @@
> > 554 common landlock_create_ruleset sys_landlock_create_ruleset
> > 555 common landlock_add_rule sys_landlock_add_rule
> > 556 common landlock_restrict_self sys_landlock_restrict_self
> > +558 common refpage_create sys_refpage_create
> > diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> > index c5df1179fc5d..8fd7045f46b9 100644
> > --- a/arch/arm/tools/syscall.tbl
> > +++ b/arch/arm/tools/syscall.tbl
> > @@ -460,3 +460,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +448 common refpage_create sys_refpage_create
> > diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
> > index e3e28f7daf62..5c0da3f76ec7 100644
> > --- a/arch/arm64/include/asm/mman.h
> > +++ b/arch/arm64/include/asm/mman.h
> > @@ -84,4 +84,19 @@ static inline bool arch_validate_flags(unsigned long vm_flags)
> > }
> > #define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags)
> >
> > +struct refpage_private_data;
> > +
> > +void arch_prep_refpage_private_data(struct refpage_private_data *priv);
> > +#define arch_prep_refpage_private_data arch_prep_refpage_private_data
> > +
> > +static inline void arch_prep_refpage_vma(struct vm_area_struct *vma)
> > +{
> > + vma->vm_flags |= VM_MTE_ALLOWED;
> > +}
> > +#define arch_prep_refpage_vma arch_prep_refpage_vma
> > +
> > +void arch_copy_refpage(struct page *page, unsigned long addr,
> > + struct vm_area_struct *vma);
> > +#define arch_copy_refpage arch_copy_refpage
> > +
> > #endif /* ! __ASM_MMAN_H__ */
> > diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> > index 67bf259ae768..b513f83010c7 100644
> > --- a/arch/arm64/include/asm/mte.h
> > +++ b/arch/arm64/include/asm/mte.h
> > @@ -37,7 +37,7 @@ void mte_free_tag_storage(char *storage);
> > /* track which pages have valid allocation tags */
> > #define PG_mte_tagged PG_arch_2
> >
> > -void mte_zero_clear_page_tags(void *addr);
> > +void mte_zero_set_page_tags(void *addr);
>
>
> We should preserve the existing mte_zero_clear_page_tags(), and just
> implement it in terms of the new, more general mte_zero_set_page_tags().
> This is because: a) it will remove some diffs from this patch, and more
> importantly, b) the concept of zeroing is still a distinct and useful
> thing to have here.
With this patch there is only a single caller of
mte_zero_set_page_tags(), and that caller may pass an arbitrarily
tagged address. Which would mean that there would be no callers of the
mte_zero_clear_page_tags() function.
> > void mte_sync_tags(pte_t *ptep, pte_t pte);
> > void mte_copy_page_tags(void *kto, const void *kfrom);
> > void mte_thread_init_user(void);
> > @@ -48,13 +48,14 @@ long set_mte_ctrl(struct task_struct *task, unsigned long arg);
> > long get_mte_ctrl(struct task_struct *task);
> > int mte_ptrace_copy_tags(struct task_struct *child, long request,
> > unsigned long addr, unsigned long data);
> > +u8 mte_check_tag_zero_page(struct page *userpage);
> >
> > #else /* CONFIG_ARM64_MTE */
> >
> > /* unused if !CONFIG_ARM64_MTE, silence the compiler */
> > #define PG_mte_tagged 0
> >
> > -static inline void mte_zero_clear_page_tags(void *addr)
> > +static inline void mte_zero_set_page_tags(void *addr)
> > {
> > }
> > static inline void mte_sync_tags(pte_t *ptep, pte_t pte)
> > @@ -89,6 +90,10 @@ static inline int mte_ptrace_copy_tags(struct task_struct *child,
> > {
> > return -EIO;
> > }
> > +static inline u8 mte_check_tag_zero_page(struct page *userpage)
> > +{
> > + return 0;
> > +}
> >
> > #endif /* CONFIG_ARM64_MTE */
> >
> > diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> > index 993a27ea6f54..234f48688b1a 100644
> > --- a/arch/arm64/include/asm/page.h
> > +++ b/arch/arm64/include/asm/page.h
> > @@ -33,7 +33,7 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
> > unsigned long vaddr);
> > #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE_MOVABLE
> >
> > -void tag_clear_highpage(struct page *to);
> > +void tag_set_highpage(struct page *to, unsigned long tag);
>
>
> Same reasoning here: let's preserve tag_clear_highpage(), as well.
Makes sense, done.
> > #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
> >
> > #define clear_user_page(page, vaddr, pg) clear_page(page)
> > diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> > index 727bfc3be99b..3cb206aea3db 100644
> > --- a/arch/arm64/include/asm/unistd.h
> > +++ b/arch/arm64/include/asm/unistd.h
> > @@ -38,7 +38,7 @@
> > #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
> > #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
> >
> > -#define __NR_compat_syscalls 447
> > +#define __NR_compat_syscalls 449
> > #endif
> >
> > #define __ARCH_WANT_SYS_CLONE
> > diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> > index 99ffcafc736c..2a116aa17fe7 100644
> > --- a/arch/arm64/include/asm/unistd32.h
> > +++ b/arch/arm64/include/asm/unistd32.h
> > @@ -901,6 +901,8 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset)
> > __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
> > #define __NR_landlock_restrict_self 446
> > __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> > +#define __NR_refpage_create 448
> > +__SYSCALL(__NR_refpage_create, sys_refpage_create)
> >
> > /*
> > * Please add new compat syscalls above this comment and update
> > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> > index 125a10e413e9..6a79240d5a77 100644
> > --- a/arch/arm64/kernel/mte.c
> > +++ b/arch/arm64/kernel/mte.c
> > @@ -453,3 +453,27 @@ int mte_ptrace_copy_tags(struct task_struct *child, long request,
> >
> > return ret;
> > }
> > +
> > +u8 mte_check_tag_zero_page(struct page *userpage)
> > +{
> > + char *userpage_addr = page_address(userpage);
> > + u8 tag;
> > + int i;
> > +
> > + if (!test_bit(PG_mte_tagged, &userpage->flags))
> > + return 0;
> > +
> > + tag = mte_get_mem_tag(userpage_addr) & 0xF;
> > + if (tag == 0)
> > + return 0;
> > +
> > + for (i = 0; i != PAGE_SIZE; ++i)
> > + if (userpage_addr[i] != 0)
> > + return 0;
> > +
> > + for (i = MTE_GRANULE_SIZE; i != PAGE_SIZE; i += MTE_GRANULE_SIZE)
> > + if ((mte_get_mem_tag(userpage_addr + i) & 0xF) != tag)
> > + return 0;
> > +
> > + return tag;
> > +}
> > diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
> > index e83643b3995f..45be436c97af 100644
> > --- a/arch/arm64/lib/mte.S
> > +++ b/arch/arm64/lib/mte.S
> > @@ -37,24 +37,23 @@ SYM_FUNC_START(mte_clear_page_tags)
> > SYM_FUNC_END(mte_clear_page_tags)
> >
> > /*
> > - * Zero the page and tags at the same time
> > + * Zero the page and set tags at the same time
> > *
> > * Parameters:
> > * x0 - address to the beginning of the page
> > */
> > -SYM_FUNC_START(mte_zero_clear_page_tags)
> > +SYM_FUNC_START(mte_zero_set_page_tags)
> > mrs x1, dczid_el0
> > and w1, w1, #0xf
> > mov x2, #4
> > lsl x1, x2, x1
> > - and x0, x0, #(1 << MTE_TAG_SHIFT) - 1 // clear the tag
> >
> > 1: dc gzva, x0
> > add x0, x0, x1
> > tst x0, #(PAGE_SIZE - 1)
> > b.ne 1b
> > ret
> > -SYM_FUNC_END(mte_zero_clear_page_tags)
> > +SYM_FUNC_END(mte_zero_set_page_tags)
> >
> > /*
> > * Copy the tags from the source page to the destination one
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index 349c488765ca..36355758ffc7 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -25,6 +25,7 @@
> > #include <linux/perf_event.h>
> > #include <linux/preempt.h>
> > #include <linux/hugetlb.h>
> > +#include <linux/mman.h>
> >
> > #include <asm/acpi.h>
> > #include <asm/bug.h>
> > @@ -939,9 +940,45 @@ struct page *alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
> > return alloc_page_vma(flags, vma, vaddr);
> > }
> >
> > -void tag_clear_highpage(struct page *page)
> > +void tag_set_highpage(struct page *page, unsigned long tag)
> > {
> > - mte_zero_clear_page_tags(page_address(page));
> > + unsigned long addr = (unsigned long)page_address(page);
> > +
> > + addr &= ~MTE_TAG_MASK;
> > + addr |= tag << MTE_TAG_SHIFT;
> > + mte_zero_set_page_tags((void *)addr);
> > page_kasan_tag_reset(page);
> > set_bit(PG_mte_tagged, &page->flags);
> > }
> > +
> > +#define REFPAGE_OPTZN_MTE_TAGGED REFPAGE_OPTZN_ARCH
>
> I see what you're doing with the arch layer here, but there's no need to
> accept the minor drawbacks (of having this #define hidden away near the
> bottom of a .c file). Instead, let's just put this into the list in
> mm.h, and call it what it is, rather than "arch".
Done.
> > +
> > +void arch_prep_refpage_private_data(struct refpage_private_data *priv)
> > +{
> > + if (system_supports_mte()) {
> > + u8 tag;
> > +
> > + if (!test_and_set_bit(PG_mte_tagged, &priv->refpage->flags))
> > + mte_clear_page_tags(page_address(priv->refpage));
> > +
> > + tag = mte_check_tag_zero_page(priv->refpage);
> > + if (tag) {
> > + priv->optzn_kind = REFPAGE_OPTZN_MTE_TAGGED;
> > + priv->optzn_info = tag;
> > + return;
> > + }
> > + }
> > +
> > + prep_refpage_private_data(priv);
> > +}
> > +
> > +void arch_copy_refpage(struct page *page, unsigned long addr,
> > + struct vm_area_struct *vma)
> > +{
> > + struct refpage_private_data *priv = vma->vm_private_data;
> > +
> > + if (priv->optzn_kind == REFPAGE_OPTZN_MTE_TAGGED)
> > + tag_set_highpage(page, priv->optzn_info);
> > + else
> > + copy_refpage(page, addr, vma);
> > +}
> > diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> > index 6d07742c57b8..c2209d83f3c3 100644
> > --- a/arch/ia64/kernel/syscalls/syscall.tbl
> > +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> > @@ -367,3 +367,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +448 common refpage_create sys_refpage_create
> > diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> > index 541bc1b3a8f9..0360cf474a49 100644
> > --- a/arch/m68k/kernel/syscalls/syscall.tbl
> > +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> > @@ -446,3 +446,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +448 common refpage_create sys_refpage_create
> > diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> > index a176faca2927..de85d758e564 100644
> > --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> > +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> > @@ -452,3 +452,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +448 common refpage_create sys_refpage_create
> > diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
> > index c2d2e19abea8..b07c7293d2a3 100644
> > --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> > +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> > @@ -385,3 +385,4 @@
> > 444 n32 landlock_create_ruleset sys_landlock_create_ruleset
> > 445 n32 landlock_add_rule sys_landlock_add_rule
> > 446 n32 landlock_restrict_self sys_landlock_restrict_self
> > +448 n32 refpage_create sys_refpage_create
> > diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
> > index ac653d08b1ea..7ebabb99dd06 100644
> > --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> > +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> > @@ -361,3 +361,4 @@
> > 444 n64 landlock_create_ruleset sys_landlock_create_ruleset
> > 445 n64 landlock_add_rule sys_landlock_add_rule
> > 446 n64 landlock_restrict_self sys_landlock_restrict_self
> > +448 n64 refpage_create sys_refpage_create
> > diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> > index 253f2cd70b6b..a51149ac101c 100644
> > --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> > +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> > @@ -434,3 +434,4 @@
> > 444 o32 landlock_create_ruleset sys_landlock_create_ruleset
> > 445 o32 landlock_add_rule sys_landlock_add_rule
> > 446 o32 landlock_restrict_self sys_landlock_restrict_self
> > +448 o32 refpage_create sys_refpage_create
> > diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> > index e26187b9ab87..385565864861 100644
> > --- a/arch/parisc/kernel/syscalls/syscall.tbl
> > +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> > @@ -444,3 +444,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +448 common refpage_create sys_refpage_create
> > diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> > index aef2a290e71a..95cdd9f7dc06 100644
> > --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> > +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> > @@ -526,3 +526,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +448 common refpage_create sys_refpage_create
> > diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> > index 64d51ab5a8b4..92ed1260ffd9 100644
> > --- a/arch/s390/kernel/syscalls/syscall.tbl
> > +++ b/arch/s390/kernel/syscalls/syscall.tbl
> > @@ -449,3 +449,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self sys_landlock_restrict_self
> > +448 common refpage_create sys_refpage_create sys_refpage_create
> > diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> > index e0a70be77d84..f9d198cc2541 100644
> > --- a/arch/sh/kernel/syscalls/syscall.tbl
> > +++ b/arch/sh/kernel/syscalls/syscall.tbl
> > @@ -449,3 +449,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +448 common refpage_create sys_refpage_create
> > diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> > index 603f5a821502..83533aa49340 100644
> > --- a/arch/sparc/kernel/syscalls/syscall.tbl
> > +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> > @@ -492,3 +492,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +448 common refpage_create sys_refpage_create
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index ce763a12311c..054c69e395b5 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -452,3 +452,4 @@
> > 445 i386 landlock_add_rule sys_landlock_add_rule
> > 446 i386 landlock_restrict_self sys_landlock_restrict_self
> > 447 i386 memfd_secret sys_memfd_secret
> > +448 i386 refpage_create sys_refpage_create
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index f6b57799c1ea..1f24f0b66cbd 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -369,6 +369,7 @@
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > 447 common memfd_secret sys_memfd_secret
> > +448 common refpage_create sys_refpage_create
> >
> > #
> > # Due to a historical design error, certain syscalls are numbered differently
> > diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> > index 235d67d6ceb4..96c27fb404ca 100644
> > --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> > +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> > @@ -417,3 +417,4 @@
> > 444 common landlock_create_ruleset sys_landlock_create_ruleset
> > 445 common landlock_add_rule sys_landlock_add_rule
> > 446 common landlock_restrict_self sys_landlock_restrict_self
> > +448 common refpage_create sys_refpage_create
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 55b2ec1f965a..ae3c763eb9e9 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -55,8 +55,9 @@ struct vm_area_struct;
> > #define ___GFP_ACCOUNT 0x400000u
> > #define ___GFP_ZEROTAGS 0x800000u
> > #define ___GFP_SKIP_KASAN_POISON 0x1000000u
> > +#define ___GFP_NOZERO 0x2000000u
> > #ifdef CONFIG_LOCKDEP
> > -#define ___GFP_NOLOCKDEP 0x2000000u
> > +#define ___GFP_NOLOCKDEP 0x4000000u
> > #else
> > #define ___GFP_NOLOCKDEP 0
> > #endif
> > @@ -238,18 +239,24 @@ struct vm_area_struct;
> > * %__GFP_SKIP_KASAN_POISON returns a page which does not need to be poisoned
> > * on deallocation. Typically used for userspace pages. Currently only has an
> > * effect in HW tags mode.
> > + *
> > + * %__GFP_NOZERO disables any implicit zeroing of the page (e.g. as a result
> > + * of init_on_alloc=on). This flag should only be used to address specific
> > + * performance bottlenecks and only if the page is clearly being fully
> > + * initialized following the allocation.
> > */
> > #define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN)
> > #define __GFP_COMP ((__force gfp_t)___GFP_COMP)
> > #define __GFP_ZERO ((__force gfp_t)___GFP_ZERO)
> > #define __GFP_ZEROTAGS ((__force gfp_t)___GFP_ZEROTAGS)
> > #define __GFP_SKIP_KASAN_POISON ((__force gfp_t)___GFP_SKIP_KASAN_POISON)
> > +#define __GFP_NOZERO ((__force gfp_t)___GFP_NOZERO)
> >
> > /* Disable lockdep for GFP context tracking */
> > #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
> >
> > /* Room for N __GFP_FOO bits */
> > -#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
> > +#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
> > #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
> >
> > /**
> > diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> > index 93ba33c09d12..6c5076dd1e9b 100644
> > --- a/include/linux/highmem.h
> > +++ b/include/linux/highmem.h
> > @@ -187,7 +187,7 @@ static inline void clear_highpage(struct page *page)
> >
> > #ifndef __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
> >
> > -static inline void tag_clear_highpage(struct page *page)
> > +static inline void tag_set_highpage(struct page *page, unsigned long tag)
> > {
> > }
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index f123e15d966e..36ecfc391b46 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -127,6 +127,13 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
> >
> > if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
> > return false;
> > +
> > + /*
> > + * Transparent hugepages not currently supported for anonymous VMAs with
> > + * reference pages
> > + */
> > + if (unlikely(is_refpage_vma(vma)))
> > + return false;
> > return true;
> > }
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index a127d93612fa..8cff9e0463b5 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -32,6 +32,7 @@
> > #include <linux/sched.h>
> > #include <linux/pgtable.h>
> > #include <linux/kasan.h>
> > +#include <linux/fs.h>
> >
> > struct mempolicy;
> > struct anon_vma;
> > @@ -722,6 +723,42 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
> > /* flush_tlb_range() takes a vma, not a mm, and can care about flags */
> > #define TLB_FLUSH_VMA(mm,flags) { .vm_mm = (mm), .vm_flags = (flags) }
> >
> > +extern const struct file_operations refpage_file_operations;
> > +
> > +struct refpage_private_data {
> > + struct page *refpage;
> > + u8 optzn_kind;
>
> How about:
> u8 content_type;
>
> > + u8 optzn_info;
>
> and:
> u8 pattern[16]; // or whatever size the enums go up to, see below
>
> > +};
> > +
> > +#define REFPAGE_OPTZN_NONE 0
>
> For this next set, how about How about REFPAGE_CONTENT_TYPE_ for a prefix?
> The spelling of OPTZN is tough, and there's no particular need internally
> to call these out as optimizations.
>
> So then this one becomes:
>
> #define REFPAGE_CONTENT_TYPE_USER_SET 0
>
> > +#define REFPAGE_OPTZN_PATTERN 1
> > +#define REFPAGE_OPTZN_ARCH 2
>
> And for the last one, let's avoid the arch hiding and just call it what it
> is, no reason not to:
>
> #define REFPAGE_CONTENT_TYPE_MTE_TAGGED 2
Done. But I think that MTE_TAGGED's usage of the field formerly known
as "optzn_info" is sufficiently different from PATTERN that "pattern"
is probably not a great name. So let's give that field a more opaque
name -- I chose "content_info".
> > +
> > +static inline bool is_refpage_vma(struct vm_area_struct *vma)
> > +{
> > + return vma->vm_file && vma->vm_file->f_op == &refpage_file_operations;
> > +}
> > +
> > +static inline struct page *get_vma_refpage(struct vm_area_struct *vma)
> > +{
> > + struct refpage_private_data *priv = vma->vm_private_data;
> > +
> > + BUG_ON(!is_refpage_vma(vma));
> > + return priv->refpage;
> > +}
> > +
> > +static inline int is_refpage_pfn(struct vm_area_struct *vma, unsigned long pfn)
> > +{
> > + return is_refpage_vma(vma) && pfn == page_to_pfn(get_vma_refpage(vma));
> > +}
> > +
> > +static inline int is_zero_or_refpage_pfn(struct vm_area_struct *vma,
> > + unsigned long pfn)
> > +{
> > + return is_zero_pfn(pfn) || is_refpage_pfn(vma, pfn);
> > +}
> > +
>
>
> I don't think this helper function is helping enough to justify itself,
> seeing as how it is quite clear when the implementation is used instead. No
> big deal either way, though.
Fair. That ends up making the code a bit larger, but perhaps clarity
at the call site is more important. I removed it.
> > struct mmu_gather;
> > struct inode;
> >
> > @@ -2977,6 +3014,8 @@ static inline void kernel_unpoison_pages(struct page *page, int numpages) { }
> > DECLARE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc);
> > static inline bool want_init_on_alloc(gfp_t flags)
> > {
> > + if (flags & __GFP_NOZERO)
> > + return false;
> > if (static_branch_maybe(CONFIG_INIT_ON_ALLOC_DEFAULT_ON,
> > &init_on_alloc))
> > return true;
> > diff --git a/include/linux/mman.h b/include/linux/mman.h
> > index ebb09a964272..cdf8f8245c78 100644
> > --- a/include/linux/mman.h
> > +++ b/include/linux/mman.h
> > @@ -2,6 +2,7 @@
> > #ifndef _LINUX_MMAN_H
> > #define _LINUX_MMAN_H
> >
> > +#include <linux/fs.h>
> > #include <linux/mm.h>
> > #include <linux/percpu_counter.h>
> >
> > @@ -123,6 +124,24 @@ static inline bool arch_validate_flags(unsigned long flags)
> > #define arch_validate_flags arch_validate_flags
> > #endif
> >
> > +void prep_refpage_private_data(struct refpage_private_data *priv);
> > +#ifndef arch_prep_refpage_private_data
> > +#define arch_prep_refpage_private_data prep_refpage_private_data
> > +#endif
> > +
> > +#ifndef arch_prep_refpage_vma
> > +static inline void arch_prep_refpage_vma(struct vm_area_struct *vma)
> > +{
> > +}
> > +#define arch_prep_refpage_vma arch_prep_refpage_vma
> > +#endif
> > +
> > +void copy_refpage(struct page *page, unsigned long addr,
> > + struct vm_area_struct *vma);
> > +#ifndef arch_copy_refpage
> > +#define arch_copy_refpage copy_refpage
> > +#endif
> > +
> > /*
> > * Optimisation macro. It is equivalent to:
> > * (x & bit1) ? bit2 : 0
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index 69c9a7010081..303a28a86500 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -864,6 +864,9 @@ asmlinkage long sys_mremap(unsigned long addr,
> > unsigned long old_len, unsigned long new_len,
> > unsigned long flags, unsigned long new_addr);
> >
> > +/* mm/refpage.c */
> > +asmlinkage long sys_refpage_create(const void __user *content, unsigned long flags);
> > +
> > /* security/keys/keyctl.c */
> > asmlinkage long sys_add_key(const char __user *_type,
> > const char __user *_description,
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index a9d6fcd95f42..54cede7db5f0 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -878,8 +878,11 @@ __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
> > __SYSCALL(__NR_memfd_secret, sys_memfd_secret)
> > #endif
> >
> > +#define __NR_refpage_create 448
> > +__SYSCALL(__NR_refpage_create, sys_refpage_create)
> > +
> > #undef __NR_syscalls
> > -#define __NR_syscalls 448
> > +#define __NR_syscalls 449
> >
> > /*
> > * 32 bit systems traditionally used different
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 30971b1dd4a9..bc65a54eb2a4 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -300,6 +300,7 @@ COND_SYSCALL(migrate_pages);
> > COND_SYSCALL_COMPAT(migrate_pages);
> > COND_SYSCALL(move_pages);
> > COND_SYSCALL_COMPAT(move_pages);
> > +COND_SYSCALL(refpage_create);
> >
> > COND_SYSCALL(perf_event_open);
> > COND_SYSCALL(accept4);
> > diff --git a/mm/Makefile b/mm/Makefile
> > index e3436741d539..137adc22bf50 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -35,10 +35,10 @@ CFLAGS_init-mm.o += $(call cc-disable-warning, override-init)
> > CFLAGS_init-mm.o += $(call cc-disable-warning, initializer-overrides)
> >
> > mmu-y := nommu.o
> > -mmu-$(CONFIG_MMU) := highmem.o memory.o mincore.o \
> > +mmu-$(CONFIG_MMU) := highmem.o ioremap.o memory.o mincore.o \
> > mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
> > msync.o page_vma_mapped.o pagewalk.o \
> > - pgtable-generic.o rmap.o vmalloc.o ioremap.o
> > + pgtable-generic.o refpage.o rmap.o vmalloc.o
> >
> >
> > ifdef CONFIG_CROSS_MEMORY_ATTACH
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 42b8b1fa6521..ba1b7bd7a0a0 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -548,7 +548,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> > goto out;
> > }
> >
> > - if (is_zero_pfn(pte_pfn(pte))) {
> > + if (is_zero_or_refpage_pfn(vma, pte_pfn(pte))) {
> > page = pte_page(pte);
> > } else {
> > ret = follow_pfn_pte(vma, address, ptep, flags);
> > diff --git a/mm/kasan/hw_tags.c b/mm/kasan/hw_tags.c
> > index ed5e5b833d61..3c433e430c80 100644
> > --- a/mm/kasan/hw_tags.c
> > +++ b/mm/kasan/hw_tags.c
> > @@ -253,7 +253,7 @@ void kasan_alloc_pages(struct page *page, unsigned int order, gfp_t flags)
> > int i;
> >
> > for (i = 0; i != 1 << order; ++i)
> > - tag_clear_highpage(page + i);
> > + tag_set_highpage(page + i, 0);
>
>
> Here, we could avoid this diff, by preserving tag_clear_highpage(). And
> that's good, because the current diff is making the code just ever so
> slightly worse. :)
Done.
> > } else {
> > kasan_unpoison_pages(page, order, init);
> > }
> > diff --git a/mm/memory.c b/mm/memory.c
> > index db86558791f1..8b32bdd215b7 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -614,7 +614,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > return vma->vm_ops->find_special_page(vma, addr);
> > if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> > return NULL;
> > - if (is_zero_pfn(pfn))
> > + if (is_zero_or_refpage_pfn(vma, pfn))
> > return NULL;
> > if (pte_devmap(pte))
> > return NULL;
> > @@ -640,7 +640,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > }
> > }
> >
> > - if (is_zero_pfn(pfn))
> > + if (is_zero_or_refpage_pfn(vma, pfn))
> > return NULL;
> >
> > check_pfn:
> > @@ -2166,7 +2166,7 @@ static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
> > return true;
> > if (pfn_t_special(pfn))
> > return true;
> > - if (is_zero_pfn(pfn_t_to_pfn(pfn)))
> > + if (is_zero_or_refpage_pfn(vma, pfn_t_to_pfn(pfn)))
> > return true;
> > return false;
> > }
> > @@ -2990,22 +2990,29 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> > pte_t entry;
> > int page_copied = 0;
> > struct mmu_notifier_range range;
> > + unsigned long pfn;
> >
> > if (unlikely(anon_vma_prepare(vma)))
> > goto oom;
> >
> > - if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
> > + pfn = pte_pfn(vmf->orig_pte);
> > + if (is_zero_pfn(pfn)) {
> > new_page = alloc_zeroed_user_highpage_movable(vma,
> > vmf->address);
> > if (!new_page)
> > goto oom;
> > } else {
> > - new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
> > - vmf->address);
> > + bool refpage = is_refpage_pfn(vma, pfn);
> > +
> > + new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE |
> > + (refpage ? __GFP_NOZERO : 0),
> > + vma, vmf->address);
> > if (!new_page)
> > goto oom;
> >
> > - if (!cow_user_page(new_page, old_page, vmf)) {
> > + if (refpage) {
> > + arch_copy_refpage(new_page, vmf->address, vma);
> > + } else if (!cow_user_page(new_page, old_page, vmf)) {
> > /*
> > * COW failed, if the fault was solved by other,
> > * it's fine. If not, userspace would re-fault on
> > @@ -3739,11 +3746,16 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> > if (unlikely(pmd_trans_unstable(vmf->pmd)))
> > return 0;
> >
> > - /* Use the zero-page for reads */
> > + /* Use the zero-page, or reference page if set, for reads */
> > if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> > !mm_forbids_zeropage(vma->vm_mm)) {
> > - entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
> > - vma->vm_page_prot));
> > + unsigned long pfn;
> > +
> > + if (unlikely(is_refpage_vma(vma)))
> > + pfn = page_to_pfn(get_vma_refpage(vma));
> > + else
> > + pfn = my_zero_pfn(vmf->address);
> > + entry = pte_mkspecial(pfn_pte(pfn, vma->vm_page_prot));
> > vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > vmf->address, &vmf->ptl);
> > if (!pte_none(*vmf->pte)) {
> > @@ -3764,9 +3776,18 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> > /* Allocate our own private page. */
> > if (unlikely(anon_vma_prepare(vma)))
> > goto oom;
> > - page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
> > - if (!page)
> > - goto oom;
> > +
> > + if (unlikely(is_refpage_vma(vma))) {
> > + page = alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_NOZERO, vma,
> > + vmf->address);
> > + if (!page)
> > + goto oom;
> > + arch_copy_refpage(page, vmf->address, vma);
> > + } else {
> > + page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
> > + if (!page)
> > + goto oom;
> > + }
> >
> > if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
> > goto oom_free_page;
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 23cbd9de030b..9a897676ff95 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2774,8 +2774,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> > pmd_t *pmdp;
> > pte_t *ptep;
> >
> > - /* Only allow populating anonymous memory */
> > - if (!vma_is_anonymous(vma))
> > + /* Only allow populating anonymous memory without a reference page */
> > + if (!vma_is_anonymous(vma) || is_refpage_vma(vma))
> > goto abort;
> >
> > pgdp = pgd_offset(mm, addr);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8836e54721ae..6ca831c1821f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1283,7 +1283,7 @@ static void kernel_init_free_pages(struct page *page, int numpages, bool zero_ta
> >
> > if (zero_tags) {
> > for (i = 0; i < numpages; i++)
> > - tag_clear_highpage(page + i);
> > + tag_set_highpage(page + i, 0);
> > return;
> > }
> >
> > diff --git a/mm/refpage.c b/mm/refpage.c
> > new file mode 100644
> > index 000000000000..ee95e281d2d4
> > --- /dev/null
> > +++ b/mm/refpage.c
> > @@ -0,0 +1,98 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +
> > +#include <linux/anon_inodes.h>
> > +#include <linux/fs_context.h>
> > +#include <linux/highmem.h>
> > +#include <linux/mman.h>
> > +#include <linux/mount.h>
> > +#include <linux/syscalls.h>
> > +
> > +void prep_refpage_private_data(struct refpage_private_data *priv)
> > +{
> > + u8 *addr = page_address(priv->refpage);
> > + u8 pattern = addr[0];
> > + int i;
> > +
> > + for (i = 1; i != PAGE_SIZE; ++i)
> > + if (addr[i] != pattern)
> > + return;
> > +
> > + priv->optzn_kind = REFPAGE_OPTZN_PATTERN;
> > + priv->optzn_info = pattern;
> > +}
> > +
>
> I am hoping that this doesn't remain in its current form, because of
> the API discussions. Probably we'll end up with setting a pattern instead
> of deducing it.
That's right -- now the code will set up the pattern content type only
if the size is 1, so we don't need to explicitly check every byte.
> > +void copy_refpage(struct page *page, unsigned long addr,
> > + struct vm_area_struct *vma)
> > +{
> > + struct refpage_private_data *priv = vma->vm_private_data;
> > +
> > + if (priv->optzn_kind == REFPAGE_OPTZN_PATTERN)
> > + memset(page_address(page), priv->optzn_info, PAGE_SIZE);
> > + else
> > + copy_user_highpage(page, priv->refpage, addr, vma);
> > +}
> > +
> > +static void put_refpage_private_data(struct refpage_private_data *priv)
>
> Can you please rename this to free_refpage_private_data()? It's a little more
> accurate.
Yes, I think that free would be a better name. (I never understood the
distinction between free and put in the kernel. Although now that I
think about it, maybe it's to do with whether it's a refcounted object
or not? In that case, free seems like the right term.)
But with the error handling refactoring that you requested below,
there ends up being only a single caller of this function, so I
decided to move the body into the caller, making the naming here moot.
> > +{
> > + put_page(priv->refpage);
> > + kfree(priv);
> > +}
> > +
> > +static int refpage_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > + vma_set_anonymous(vma);
> > + vma->vm_private_data = vma->vm_file->private_data;
> > + arch_prep_refpage_vma(vma);
> > + return 0;
> > +}
> > +
> > +static int refpage_release(struct inode *inode, struct file *file)
> > +{
> > + put_refpage_private_data(file->private_data);
> > + return 0;
> > +}
> > +
> > +const struct file_operations refpage_file_operations = {
> > + .mmap = refpage_mmap,
> > + .release = refpage_release,
> > +};
> > +
> > +SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long,
> > + flags)
>
> From the API discussion (and using a simpler syntax to illustrate this), it
> seems like the following would be close:
>
> enum content_type {
> BYTE_PATTERN,
> FOUR_BYTE_PATTERN,
> ...
> FULL_4KB_PAGE
> };
>
> int refpage_create(const void *__user content, enum content_type, unsigned long flags);
>
> ...and if content_type == BYTE_PATTERN, then content is a pointer to just one byte of
> data, and so forth for the other enum values.
As we discussed later on, let's use Matthew's proposed API instead of
making the content type explicit.
> > +{
> > + unsigned long content_addr = (unsigned long)content;
> > + struct page *userpage;
> > + struct refpage_private_data *private_data;
> > + int fd;
> > +
> > + if (flags != 0)
> > + return -EINVAL;
> > +
> > + if ((content_addr & (PAGE_SIZE - 1)) != 0 ||
> > + get_user_pages(content_addr, 1, 0, &userpage, 0) != 1)
> > + return -EFAULT;
> > +
> > + private_data = kzalloc(sizeof(struct refpage_private_data), GFP_KERNEL);
> > + if (!private_data) {
> > + put_page(userpage);
> > + return -ENOMEM;
> > + }
> > +
> > + private_data->refpage = alloc_page(GFP_KERNEL);
> > + if (!private_data->refpage) {
> > + kfree(private_data);
> > + put_page(userpage);
> > + return -ENOMEM;
> > + }
> > +
> > + copy_highpage(private_data->refpage, userpage);
> > + arch_prep_refpage_private_data(private_data);
> > + put_page(userpage);
> > +
> > + fd = anon_inode_getfd("[refpage]", &refpage_file_operations,
> > + private_data, O_RDONLY | O_CLOEXEC)
> > + if (fd < 0)
> > + put_refpage_private_data(private_data);
>
> And here, a couple of things:
>
> 1) I think there's a bug in the fd < 0 case, because you're only freeing
> one of the two pages (there's an alloc_page() call, and a gup call above).
(FWIW, there was no bug here. The page allocated by alloc_page() is
freed by put_refpage_private_data(), and the userpage is freed by the
put_page(userpage).)
> 2) It's jarring to have part the error handling in three different ways:
> returning -EFAULT directly, coding each error case to undo the growing
> set of operations, and finally, jumping out to another routine here for
> fd < 0.
>
> Even for a small routine, that's too error-prone. Instead, one of the
> following will be cleaner and safer too:
>
> a) use goto and labels to unwind, or
>
> b) use a no-fail cleanup routine to unwind
>
> and either way, do it for all cases (or at least all of them after the first
> trivial -EFAULT return.
Done.
Peter
More information about the linux-arm-kernel
mailing list