[PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
Lorenzo Stoakes
lorenzo.stoakes at oracle.com
Thu Jun 5 02:56:23 PDT 2025
On Thu, Jun 05, 2025 at 05:42:16PM +0800, Muchun Song wrote:
>
>
> > On Jun 5, 2025, at 17:24, Lorenzo Stoakes <lorenzo.stoakes at oracle.com> wrote:
> >
> > On Thu, Jun 05, 2025 at 08:56:59AM +0200, Vlastimil Babka wrote:
> >> On 6/4/25 16:19, Lorenzo Stoakes wrote:
> >>> The walk_page_range_novma() function is rather confusing - it supports two
> >>> modes, one used often, the other used only for debugging.
> >>>
> >>> The first mode is the common case of traversal of kernel page tables, which
> >>> is what nearly all callers use this for.
> >>>
> >>> Secondly it provides an unusual debugging interface that allows for the
> >>> traversal of page tables in a userland range of memory even for that memory
> >>> which is not described by a VMA.
> >>>
> >>> It is far from certain that such page tables should even exist, but perhaps
> >>> this is precisely why it is useful as a debugging mechanism.
> >>>
> >>> As a result, this is utilised by ptdump only. Historically, things were
> >>> reversed - ptdump was the only user, and other parts of the kernel evolved
> >>> to use the kernel page table walking here.
> >>>
> >>> Since we have some complicated and confusing locking rules for the novma
> >>> case, it makes sense to separate the two usages into their own functions.
> >>>
> >>> Doing this also provide self-documentation as to the intent of the caller -
> >>> are they doing something rather unusual or are they simply doing a standard
> >>> kernel page table walk?
> >>>
> >>> We therefore establish two separate functions - walk_page_range_debug() for
> >>> this single usage, and walk_kernel_page_table_range() for general kernel
> >>> page table walking.
> >>>
> >>> We additionally make walk_page_range_debug() internal to mm.
> >>>
> >>> Note that ptdump uses the precise same function for kernel walking as a
> >>
> >> IMHO it's not clear at this point what "the precise same function" means.
> >>
> >>> convenience, so we permit this but make it very explicit by having
> >>> walk_page_range_novma() invoke walk_kernel_page_table_range() in this case.
> >>
> >> ^ walk_page_range_debug()
> >
> > Oops will fix.
> >
> >>
> >> Maybe this could be reworded in the sense (AFAIU) that
> >> walk_page_range_debug() can be used for both user space page table walking
> >> or kernel depending on what mm is passed, so in the case of init_mm it
> >> invokes walk_kernel_page_table_range() internally.
> >
> > Sure.
> >
> >>
> >>>
> >>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes at oracle.com>
> >>> Acked-by: Mike Rapoport (Microsoft) <rppt at kernel.org>
> >>> ---
> >>> v2:
> >>> * Renamed walk_page_range_novma() to walk_page_range_debug() as per David.
> >>> * Moved walk_page_range_debug() definition to mm/internal.h as per Mike.
> >>> * Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as
> >>> per David.
> >>>
> >>> v1 resend:
> >>> * Actually cc'd lists...
> >>> * Fixed mistake in walk_page_range_novma() not handling kernel mappings and
> >>> update commit message to referene.
> >>> * Added Mike's off-list Acked-by.
> >>> * Fixed up comments as per Mike.
> >>> * Add some historic flavour to the commit message as per Mike.
> >>> https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoakes@oracle.com/
> >>>
> >>> v1:
> >>> (accidentally sent off-list due to error in scripting)
> >>>
> >>> arch/loongarch/mm/pageattr.c | 2 +-
> >>> arch/openrisc/kernel/dma.c | 4 +-
> >>> arch/riscv/mm/pageattr.c | 8 +--
> >>> include/linux/pagewalk.h | 7 ++-
> >>> mm/hugetlb_vmemmap.c | 2 +-
> >>> mm/internal.h | 4 ++
> >>> mm/pagewalk.c | 98 ++++++++++++++++++++++++------------
> >>> mm/ptdump.c | 3 +-
> >>> 8 files changed, 82 insertions(+), 46 deletions(-)
> >>>
> >>> diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c
> >>> index 99165903908a..f5e910b68229 100644
> >>> --- a/arch/loongarch/mm/pageattr.c
> >>> +++ b/arch/loongarch/mm/pageattr.c
> >>> @@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, pgp
> >>> return 0;
> >>>
> >>> mmap_write_lock(&init_mm);
> >>> - ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, &masks);
> >>> + ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, &masks);
> >>> mmap_write_unlock(&init_mm);
> >>
> >> You've removed init_mm from walk_page_range_novma() but I see most callers
> >> do the locking of init_mm immediately around it. This suggests a version
> >> handling that automatically? A bit complicated by the read/write
> >> possibilities, so maybe not worth wrapping? Just a thought, as David says ;)
> >
> > Most callers write lock interestingly, but then one read lock's, so we can't
> > just assume and would need to pass a boolean which would kind of suck.
>
> Hi Lorenzo,
>
> Actually, the write lock introduced in commit 8782fb61cc848 to fix the
> race condition when walking user page tables can be replaced with a read
> lock. As explained in commit b123d09304d86, it is safe to walk kernel
> page tables while holding the mmap read lock. The function name
> `walk_kernel_page_table_range` clearly indicates its purpose: walking
> kernel page tables. Thus, using a read lock internally is appropriate
> and safe. Please correct me, if I am wrong.
>
> To further enhance robustness, it is better to add a WARN_ON check to
> ensure that the address range passed to walk_kernel_page_table_range
> is indeed within the kernel address space. This will help prevent any
> accidental misuse and catch issues early.
>
Hi Muchun,
Thanks for the information. I'll chase up this locking stuff in a
follow-up, if that's ok? As I want to get this refactoring landed first
with the existing behaviour.
I did wonder about doing a WARN_ON(), but I am concerned that perhaps there
are some very odd architectures that place things in unusual locations... I
guess this would be a addr >= TASK_SIZE check?
On the 'weirdness' front, Jann's opinion is that the debug code shouldn't
even be walking page tables without VMAs in userspace anyway (and to me -
unless there's some weird arch-specific stuff out there - I am also
perplexed by this).
So we should in theory just drop the ability to do this kind of traversal
in general.
This is something I'll investigate...
There's also oddities like:
if (walk->mm == &init_mm || addr >= TASK_SIZE)
pte = pte_offset_kernel(pmd, addr);
else
pte = pte_offset_map(pmd, addr);
In the walk_pte_range() function, which I feel is wrong - you should pass
init_mm if you want to explore kernel page tables, so why are we doing this
here?
I wonder if it's for stuff like vsyscall which are ostensibly kernel
addresses but accessible in userspace... but then would we really want to
walk the page tables for this?
This is stuff I want dig into a little bit.
But overall, would you mind if I defer the WARN_ON() and other such stuff
to a follow up?
Just keen for us to firstly separate out the actual usages of these
functions very clearly, and thanks to Mike's great suggestion also making
the debug usage entirely mm-internal.
This should lay a foundation for introducing further sanity to the equation
here :)
I will properly look into the commits you mention in preparation for a
follow up though to be clear (your input is much appreciated! :)
> Muchun,
> Thanks.
Cheers, Lorenzo
>
> >
> > Also other walkers assume the caller has the lock so it's consistent to
> > keep it this way.
> >
> >>
> >>>
> >>> flush_tlb_kernel_range(start, end);
> >>> diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
> >>> index 3a7b5baaa450..af932a4ad306 100644
> >>> --- a/arch/openrisc/kernel/dma.c
> >>> +++ b/arch/openrisc/kernel/dma.c
> >>> @@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size)
> >>> * them and setting the cache-inhibit bit.
> >>> */
> >>> mmap_write_lock(&init_mm);
> >>> - error = walk_page_range_novma(&init_mm, va, va + size,
> >>> + error = walk_kernel_page_table_range(va, va + size,
> >>> &set_nocache_walk_ops, NULL, NULL);
> >>> mmap_write_unlock(&init_mm);
> >>>
> >>> @@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size)
> >>>
> >>> mmap_write_lock(&init_mm);
> >>> /* walk_page_range shouldn't be able to fail here */
> >>> - WARN_ON(walk_page_range_novma(&init_mm, va, va + size,
> >>> + WARN_ON(walk_kernel_page_table_range(va, va + size,
> >>> &clear_nocache_walk_ops, NULL, NULL));
> >>> mmap_write_unlock(&init_mm);
> >>> }
> >>> diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
> >>> index d815448758a1..3f76db3d2769 100644
> >>> --- a/arch/riscv/mm/pageattr.c
> >>> +++ b/arch/riscv/mm/pageattr.c
> >>> @@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
> >>> if (ret)
> >>> goto unlock;
> >>>
> >>> - ret = walk_page_range_novma(&init_mm, lm_start, lm_end,
> >>> + ret = walk_kernel_page_table_range(lm_start, lm_end,
> >>> &pageattr_ops, NULL, &masks);
> >>
> >> Note this and other places break the second line's arguments alignment on
> >> the opening bracket. Maybe it just shows it's a bit fragile style...
> >>
> >>
> >
> > Yeah I know :) I know you won't believe this coming from me, but I was
> > trying to minimise the churn :P
>
>
More information about the linux-riscv
mailing list