[PATCH] mm/page_table_check: do not track special (PFN-mapped) PTEs
Pasha Tatashin
pasha.tatashin at soleen.com
Mon Jun 8 19:23:28 PDT 2026
On 06-08 14:22, Andrew Morton wrote:
> On Mon, 8 Jun 2026 19:57:58 +0400 Andrey Smirnov <andrey.smirnov at siderolabs.com> wrote:
>
> > The vDSO data store ("[vvar]") special mapping is created as a VM_PFNMAP
> > mapping and its pages are installed into userspace with vmf_insert_pfn(),
> > which produces special PTEs (pte_special()). On x86 and arm64 (and riscv)
> > pte_user_accessible_page() only tests the PRESENT/USER bits and does not
> > exclude special PTEs, so page_table_check accounts these PFN mappings in
> > the per-page anon/file map counters even though they are not rmap-managed
> > pages (vm_normal_page() returns NULL for them).
> >
> > Most of these data pages live in the kernel image and are never freed, so
> > the stray accounting is invisible. The time-namespace VVAR page is the
> > exception: it is a real alloc_page() page that is released with
> > __free_page() in free_time_ns() when the last task of a time namespace
> > exits. Across the map / unmap / vdso_join_timens() zap transitions the
> > special-PTE accounting is not balanced for this page, so a non-zero
> > file_map_count survives to the free path and trips:
> >
> > kernel BUG at mm/page_table_check.c:143!
> > __page_table_check_zero+0xfb/0x130
> > __free_frozen_pages+0x52f/0x650
> > free_time_ns+0x85/0xc0
> > free_nsproxy+0x7f/0x130
> > do_exit+0x313/0xa60
> > do_group_exit+0x77/0x90
> >
> > This is reliably reproducible on x86_64 and arm64 under heavy container/CI
> > churn that rapidly creates and destroys time namespaces (CLONE_NEWTIME via
> > runc / docker-init / tini), and was independently reported by syzbot on
> > riscv. It only manifests when CONFIG_PAGE_TABLE_CHECK is active.
> >
> > Special PTEs have no struct-page rmap semantics and must never have been
> > tracked by page table check. Skip them in both the set and clear paths so
> > the counters stay balanced (always zero) for PFN-mapped pages, regardless
> > of how the architecture defines pte_user_accessible_page(). pte_special()
> > is available generically (it is a no-op returning false on architectures
> > without ARCH_HAS_PTE_SPECIAL), so this is a single, arch-independent fix.
> >
> > Note that the v7.0 generic vDSO datastore rework in commit 05988dba1179
> > ("vdso/datastore: Allocate data pages dynamically") incidentally avoids
> > the problem by switching the mapping to VM_MIXEDMAP + vmf_insert_page()
> > with balanced struct-page accounting. This patch fixes the still-affected
> > VM_PFNMAP path used by 6.18.y and earlier, and additionally makes
> > page_table_check robust against any future PFN-mapped user pages.
Thank you for detailed explanation of the bug, and it makes sense to me.
> Thanks.
>
> The patch isn't applicable to current -linus mainline. I reworked it
> as below, then deleted it. It would be better if this rework came from
> yourself (tested), please. And a patch which applies will get checked
> by Sashiko AI review.
+1.
Pasha
> --- a/mm/page_table_check.c~mm-page_table_check-do-not-track-special-pfn-mapped-ptes
> +++ a/mm/page_table_check.c
> @@ -151,7 +151,15 @@ void __page_table_check_pte_clear(struct
> if (&init_mm == mm)
> return;
>
> - if (pte_user_accessible_page(mm, addr, pte))
> + /*
> + * PFN-mapped (special) PTEs - e.g. the vDSO/time-namespace "[vvar]"
> + * mapping installed via vmf_insert_pfn() - are not rmap-managed and
> + * must not be tracked here. Tracking them can leave a non-zero map
> + * count on a struct page that is later freed (the time namespace VVAR
> + * page in free_time_ns()), tripping the BUG_ON() in
> + * __page_table_check_zero().
> + */
> + if (pte_user_accessible_page(mm, addr, pte) && !pte_special(pte))
> page_table_check_clear(pte_pfn(pte), PAGE_SIZE >> PAGE_SHIFT);
> }
> EXPORT_SYMBOL(__page_table_check_pte_clear);
> @@ -208,7 +216,7 @@ void __page_table_check_ptes_set(struct
>
> for (i = 0; i < nr; i++)
> __page_table_check_pte_clear(mm, addr + PAGE_SIZE * i, ptep_get(ptep + i));
> - if (pte_user_accessible_page(mm, addr, pte))
> + if (pte_user_accessible_page(mm, addr, pte) && !pte_special(pte))
> page_table_check_set(pte_pfn(pte), nr, pte_write(pte));
> }
> EXPORT_SYMBOL(__page_table_check_ptes_set);
> _
>
More information about the linux-riscv
mailing list