[PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts

Suren Baghdasaryan surenb at google.com
Wed Jun 4 07:39:41 PDT 2025


On Wed, Jun 4, 2025 at 7:21 AM Lorenzo Stoakes
<lorenzo.stoakes at oracle.com> wrote:
>
> The walk_page_range_novma() function is rather confusing - it supports two
> modes, one used often, the other used only for debugging.
>
> The first mode is the common case of traversal of kernel page tables, which
> is what nearly all callers use this for.
>
> Secondly it provides an unusual debugging interface that allows for the
> traversal of page tables in a userland range of memory even for that memory
> which is not described by a VMA.
>
> It is far from certain that such page tables should even exist, but perhaps
> this is precisely why it is useful as a debugging mechanism.
>
> As a result, this is utilised by ptdump only. Historically, things were
> reversed - ptdump was the only user, and other parts of the kernel evolved
> to use the kernel page table walking here.
>
> Since we have some complicated and confusing locking rules for the novma
> case, it makes sense to separate the two usages into their own functions.
>
> Doing this also provide self-documentation as to the intent of the caller -
> are they doing something rather unusual or are they simply doing a standard
> kernel page table walk?
>
> We therefore establish two separate functions - walk_page_range_debug() for
> this single usage, and walk_kernel_page_table_range() for general kernel
> page table walking.
>
> We additionally make walk_page_range_debug() internal to mm.
>
> Note that ptdump uses the precise same function for kernel walking as a
> convenience, so we permit this but make it very explicit by having
> walk_page_range_novma() invoke walk_kernel_page_table_range() in this case.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes at oracle.com>
> Acked-by: Mike Rapoport (Microsoft) <rppt at kernel.org>

Reviewed-by: Suren Baghdasaryan <surenb at google.com>

> ---
> v2:
> * Renamed walk_page_range_novma() to walk_page_range_debug() as per David.
> * Moved walk_page_range_debug() definition to mm/internal.h as per Mike.
> * Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as
>   per David.
>
> v1 resend:
> * Actually cc'd lists...
> * Fixed mistake in walk_page_range_novma() not handling kernel mappings and
>   update commit message to referene.
> * Added Mike's off-list Acked-by.
> * Fixed up comments as per Mike.
> * Add some historic flavour to the commit message as per Mike.
> https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoakes@oracle.com/
>
> v1:
> (accidentally sent off-list due to error in scripting)
>
>  arch/loongarch/mm/pageattr.c |  2 +-
>  arch/openrisc/kernel/dma.c   |  4 +-
>  arch/riscv/mm/pageattr.c     |  8 +--
>  include/linux/pagewalk.h     |  7 ++-
>  mm/hugetlb_vmemmap.c         |  2 +-
>  mm/internal.h                |  4 ++
>  mm/pagewalk.c                | 98 ++++++++++++++++++++++++------------
>  mm/ptdump.c                  |  3 +-
>  8 files changed, 82 insertions(+), 46 deletions(-)
>
> diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c
> index 99165903908a..f5e910b68229 100644
> --- a/arch/loongarch/mm/pageattr.c
> +++ b/arch/loongarch/mm/pageattr.c
> @@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, pgp
>                 return 0;
>
>         mmap_write_lock(&init_mm);
> -       ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, &masks);
> +       ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, &masks);
>         mmap_write_unlock(&init_mm);
>
>         flush_tlb_kernel_range(start, end);
> diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
> index 3a7b5baaa450..af932a4ad306 100644
> --- a/arch/openrisc/kernel/dma.c
> +++ b/arch/openrisc/kernel/dma.c
> @@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size)
>          * them and setting the cache-inhibit bit.
>          */
>         mmap_write_lock(&init_mm);
> -       error = walk_page_range_novma(&init_mm, va, va + size,
> +       error = walk_kernel_page_table_range(va, va + size,
>                         &set_nocache_walk_ops, NULL, NULL);
>         mmap_write_unlock(&init_mm);
>
> @@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size)
>
>         mmap_write_lock(&init_mm);
>         /* walk_page_range shouldn't be able to fail here */
> -       WARN_ON(walk_page_range_novma(&init_mm, va, va + size,
> +       WARN_ON(walk_kernel_page_table_range(va, va + size,
>                         &clear_nocache_walk_ops, NULL, NULL));
>         mmap_write_unlock(&init_mm);
>  }
> diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
> index d815448758a1..3f76db3d2769 100644
> --- a/arch/riscv/mm/pageattr.c
> +++ b/arch/riscv/mm/pageattr.c
> @@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
>                         if (ret)
>                                 goto unlock;
>
> -                       ret = walk_page_range_novma(&init_mm, lm_start, lm_end,
> +                       ret = walk_kernel_page_table_range(lm_start, lm_end,
>                                                     &pageattr_ops, NULL, &masks);
>                         if (ret)
>                                 goto unlock;
> @@ -317,13 +317,13 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
>                 if (ret)
>                         goto unlock;
>
> -               ret = walk_page_range_novma(&init_mm, lm_start, lm_end,
> +               ret = walk_kernel_page_table_range(lm_start, lm_end,
>                                             &pageattr_ops, NULL, &masks);
>                 if (ret)
>                         goto unlock;
>         }
>
> -       ret =  walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL,
> +       ret =  walk_kernel_page_table_range(start, end, &pageattr_ops, NULL,
>                                      &masks);
>
>  unlock:
> @@ -335,7 +335,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
>          */
>         flush_tlb_all();
>  #else
> -       ret =  walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL,
> +       ret =  walk_kernel_page_table_range(start, end, &pageattr_ops, NULL,
>                                      &masks);
>
>         mmap_write_unlock(&init_mm);
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index 9700a29f8afb..8ac2f6d6d2a3 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -129,10 +129,9 @@ struct mm_walk {
>  int walk_page_range(struct mm_struct *mm, unsigned long start,
>                 unsigned long end, const struct mm_walk_ops *ops,
>                 void *private);
> -int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
> -                         unsigned long end, const struct mm_walk_ops *ops,
> -                         pgd_t *pgd,
> -                         void *private);
> +int walk_kernel_page_table_range(unsigned long start,
> +               unsigned long end, const struct mm_walk_ops *ops,
> +               pgd_t *pgd, void *private);
>  int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
>                         unsigned long end, const struct mm_walk_ops *ops,
>                         void *private);
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 27245e86df25..ba0fb1b6a5a8 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -166,7 +166,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
>         VM_BUG_ON(!PAGE_ALIGNED(start | end));
>
>         mmap_read_lock(&init_mm);
> -       ret = walk_page_range_novma(&init_mm, start, end, &vmemmap_remap_ops,
> +       ret = walk_kernel_page_table_range(start, end, &vmemmap_remap_ops,
>                                     NULL, walk);
>         mmap_read_unlock(&init_mm);
>         if (ret)
> diff --git a/mm/internal.h b/mm/internal.h
> index 6b8ed2017743..43788d0de6e3 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1605,6 +1605,10 @@ static inline void accept_page(struct page *page)
>  int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
>                 unsigned long end, const struct mm_walk_ops *ops,
>                 void *private);
> +int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
> +                         unsigned long end, const struct mm_walk_ops *ops,
> +                         pgd_t *pgd,
> +                         void *private);
>
>  /* pt_reclaim.c */
>  bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval);
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index e478777c86e1..057a125c3bc0 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -584,9 +584,28 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>         return walk_page_range_mm(mm, start, end, ops, private);
>  }
>
> +static int __walk_page_range_novma(struct mm_struct *mm, unsigned long start,
> +               unsigned long end, const struct mm_walk_ops *ops,
> +               pgd_t *pgd, void *private)
> +{
> +       struct mm_walk walk = {
> +               .ops            = ops,
> +               .mm             = mm,
> +               .pgd            = pgd,
> +               .private        = private,
> +               .no_vma         = true
> +       };
> +
> +       if (start >= end || !walk.mm)
> +               return -EINVAL;
> +       if (!check_ops_valid(ops))
> +               return -EINVAL;
> +
> +       return walk_pgd_range(start, end, &walk);
> +}
> +
>  /**
> - * walk_page_range_novma - walk a range of pagetables not backed by a vma
> - * @mm:                mm_struct representing the target process of page table walk
> + * walk_kernel_page_table_range - walk a range of kernel pagetables.
>   * @start:     start address of the virtual address range
>   * @end:       end address of the virtual address range
>   * @ops:       operation to call during the walk
> @@ -596,56 +615,69 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>   * Similar to walk_page_range() but can walk any page tables even if they are
>   * not backed by VMAs. Because 'unusual' entries may be walked this function
>   * will also not lock the PTEs for the pte_entry() callback. This is useful for
> - * walking the kernel pages tables or page tables for firmware.
> + * walking kernel pages tables or page tables for firmware.
>   *
>   * Note: Be careful to walk the kernel pages tables, the caller may be need to
>   * take other effective approaches (mmap lock may be insufficient) to prevent
>   * the intermediate kernel page tables belonging to the specified address range
>   * from being freed (e.g. memory hot-remove).
>   */
> -int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
> +int walk_kernel_page_table_range(unsigned long start, unsigned long end,
> +               const struct mm_walk_ops *ops, pgd_t *pgd, void *private)
> +{
> +       struct mm_struct *mm = &init_mm;
> +
> +       /*
> +        * Kernel intermediate page tables are usually not freed, so the mmap
> +        * read lock is sufficient. But there are some exceptions.
> +        * E.g. memory hot-remove. In which case, the mmap lock is insufficient
> +        * to prevent the intermediate kernel pages tables belonging to the
> +        * specified address range from being freed. The caller should take
> +        * other actions to prevent this race.
> +        */
> +       mmap_assert_locked(mm);
> +
> +       return __walk_page_range_novma(mm, start, end, ops, pgd, private);
> +}
> +
> +/**
> + * walk_page_range_debug - walk a range of pagetables not backed by a vma
> + * @mm:                mm_struct representing the target process of page table walk
> + * @start:     start address of the virtual address range
> + * @end:       end address of the virtual address range
> + * @ops:       operation to call during the walk
> + * @pgd:       pgd to walk if different from mm->pgd
> + * @private:   private data for callbacks' usage
> + *
> + * Similar to walk_page_range() but can walk any page tables even if they are
> + * not backed by VMAs. Because 'unusual' entries may be walked this function
> + * will also not lock the PTEs for the pte_entry() callback.
> + *
> + * This is for debugging purposes ONLY.
> + */
> +int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
>                           unsigned long end, const struct mm_walk_ops *ops,
>                           pgd_t *pgd,
>                           void *private)
>  {
> -       struct mm_walk walk = {
> -               .ops            = ops,
> -               .mm             = mm,
> -               .pgd            = pgd,
> -               .private        = private,
> -               .no_vma         = true
> -       };
> -
> -       if (start >= end || !walk.mm)
> -               return -EINVAL;
> -       if (!check_ops_valid(ops))
> -               return -EINVAL;
> +       /*
> +        * For convenience, we allow this function to also traverse kernel
> +        * mappings.
> +        */
> +       if (mm == &init_mm)
> +               return walk_kernel_page_table_range(start, end, ops, pgd, private);
>
>         /*
> -        * 1) For walking the user virtual address space:
> -        *
>          * The mmap lock protects the page walker from changes to the page
>          * tables during the walk.  However a read lock is insufficient to
>          * protect those areas which don't have a VMA as munmap() detaches
>          * the VMAs before downgrading to a read lock and actually tearing
>          * down PTEs/page tables. In which case, the mmap write lock should
> -        * be hold.
> -        *
> -        * 2) For walking the kernel virtual address space:
> -        *
> -        * The kernel intermediate page tables usually do not be freed, so
> -        * the mmap map read lock is sufficient. But there are some exceptions.
> -        * E.g. memory hot-remove. In which case, the mmap lock is insufficient
> -        * to prevent the intermediate kernel pages tables belonging to the
> -        * specified address range from being freed. The caller should take
> -        * other actions to prevent this race.
> +        * be held.
>          */
> -       if (mm == &init_mm)
> -               mmap_assert_locked(walk.mm);
> -       else
> -               mmap_assert_write_locked(walk.mm);
> +       mmap_assert_write_locked(mm);
>
> -       return walk_pgd_range(start, end, &walk);
> +       return __walk_page_range_novma(mm, start, end, ops, pgd, private);
>  }
>
>  int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
> diff --git a/mm/ptdump.c b/mm/ptdump.c
> index 9374f29cdc6f..61a352aa12ed 100644
> --- a/mm/ptdump.c
> +++ b/mm/ptdump.c
> @@ -4,6 +4,7 @@
>  #include <linux/debugfs.h>
>  #include <linux/ptdump.h>
>  #include <linux/kasan.h>
> +#include "internal.h"
>
>  #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
>  /*
> @@ -177,7 +178,7 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd)
>
>         mmap_write_lock(mm);
>         while (range->start != range->end) {
> -               walk_page_range_novma(mm, range->start, range->end,
> +               walk_page_range_debug(mm, range->start, range->end,
>                                       &ptdump_ops, pgd, st);
>                 range++;
>         }
> --
> 2.49.0



More information about the linux-riscv mailing list