[PATCH] mm/page_table_check: do not track special (PFN-mapped) PTEs
Andrew Morton
akpm at linux-foundation.org
Mon Jun 8 14:22:58 PDT 2026
On Mon, 8 Jun 2026 19:57:58 +0400 Andrey Smirnov <andrey.smirnov at siderolabs.com> wrote:
> The vDSO data store ("[vvar]") special mapping is created as a VM_PFNMAP
> mapping and its pages are installed into userspace with vmf_insert_pfn(),
> which produces special PTEs (pte_special()). On x86 and arm64 (and riscv)
> pte_user_accessible_page() only tests the PRESENT/USER bits and does not
> exclude special PTEs, so page_table_check accounts these PFN mappings in
> the per-page anon/file map counters even though they are not rmap-managed
> pages (vm_normal_page() returns NULL for them).
>
> Most of these data pages live in the kernel image and are never freed, so
> the stray accounting is invisible. The time-namespace VVAR page is the
> exception: it is a real alloc_page() page that is released with
> __free_page() in free_time_ns() when the last task of a time namespace
> exits. Across the map / unmap / vdso_join_timens() zap transitions the
> special-PTE accounting is not balanced for this page, so a non-zero
> file_map_count survives to the free path and trips:
>
> kernel BUG at mm/page_table_check.c:143!
> __page_table_check_zero+0xfb/0x130
> __free_frozen_pages+0x52f/0x650
> free_time_ns+0x85/0xc0
> free_nsproxy+0x7f/0x130
> do_exit+0x313/0xa60
> do_group_exit+0x77/0x90
>
> This is reliably reproducible on x86_64 and arm64 under heavy container/CI
> churn that rapidly creates and destroys time namespaces (CLONE_NEWTIME via
> runc / docker-init / tini), and was independently reported by syzbot on
> riscv. It only manifests when CONFIG_PAGE_TABLE_CHECK is active.
>
> Special PTEs have no struct-page rmap semantics and must never have been
> tracked by page table check. Skip them in both the set and clear paths so
> the counters stay balanced (always zero) for PFN-mapped pages, regardless
> of how the architecture defines pte_user_accessible_page(). pte_special()
> is available generically (it is a no-op returning false on architectures
> without ARCH_HAS_PTE_SPECIAL), so this is a single, arch-independent fix.
>
> Note that the v7.0 generic vDSO datastore rework in commit 05988dba1179
> ("vdso/datastore: Allocate data pages dynamically") incidentally avoids
> the problem by switching the mapping to VM_MIXEDMAP + vmf_insert_page()
> with balanced struct-page accounting. This patch fixes the still-affected
> VM_PFNMAP path used by 6.18.y and earlier, and additionally makes
> page_table_check robust against any future PFN-mapped user pages.
Thanks.
The patch isn't applicable to current -linus mainline. I reworked it
as below, then deleted it. It would be better if this rework came from
yourself (tested), please. And a patch which applies will get checked
by Sashiko AI review.
--- a/mm/page_table_check.c~mm-page_table_check-do-not-track-special-pfn-mapped-ptes
+++ a/mm/page_table_check.c
@@ -151,7 +151,15 @@ void __page_table_check_pte_clear(struct
if (&init_mm == mm)
return;
- if (pte_user_accessible_page(mm, addr, pte))
+ /*
+ * PFN-mapped (special) PTEs - e.g. the vDSO/time-namespace "[vvar]"
+ * mapping installed via vmf_insert_pfn() - are not rmap-managed and
+ * must not be tracked here. Tracking them can leave a non-zero map
+ * count on a struct page that is later freed (the time namespace VVAR
+ * page in free_time_ns()), tripping the BUG_ON() in
+ * __page_table_check_zero().
+ */
+ if (pte_user_accessible_page(mm, addr, pte) && !pte_special(pte))
page_table_check_clear(pte_pfn(pte), PAGE_SIZE >> PAGE_SHIFT);
}
EXPORT_SYMBOL(__page_table_check_pte_clear);
@@ -208,7 +216,7 @@ void __page_table_check_ptes_set(struct
for (i = 0; i < nr; i++)
__page_table_check_pte_clear(mm, addr + PAGE_SIZE * i, ptep_get(ptep + i));
- if (pte_user_accessible_page(mm, addr, pte))
+ if (pte_user_accessible_page(mm, addr, pte) && !pte_special(pte))
page_table_check_set(pte_pfn(pte), nr, pte_write(pte));
}
EXPORT_SYMBOL(__page_table_check_ptes_set);
_
More information about the linux-riscv
mailing list