[PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts

Thu Jun 5 05:11:56 PDT 2025

> On Jun 5, 2025, at 17:56, Lorenzo Stoakes <lorenzo.stoakes at oracle.com> wrote:
> 
> On Thu, Jun 05, 2025 at 05:42:16PM +0800, Muchun Song wrote:
>> 
>> 
>>> On Jun 5, 2025, at 17:24, Lorenzo Stoakes <lorenzo.stoakes at oracle.com> wrote:
>>> 
>>> On Thu, Jun 05, 2025 at 08:56:59AM +0200, Vlastimil Babka wrote:
>>>> On 6/4/25 16:19, Lorenzo Stoakes wrote:
>>>>> The walk_page_range_novma() function is rather confusing - it supports two
>>>>> modes, one used often, the other used only for debugging.
>>>>> 
>>>>> The first mode is the common case of traversal of kernel page tables, which
>>>>> is what nearly all callers use this for.
>>>>> 
>>>>> Secondly it provides an unusual debugging interface that allows for the
>>>>> traversal of page tables in a userland range of memory even for that memory
>>>>> which is not described by a VMA.
>>>>> 
>>>>> It is far from certain that such page tables should even exist, but perhaps
>>>>> this is precisely why it is useful as a debugging mechanism.
>>>>> 
>>>>> As a result, this is utilised by ptdump only. Historically, things were
>>>>> reversed - ptdump was the only user, and other parts of the kernel evolved
>>>>> to use the kernel page table walking here.
>>>>> 
>>>>> Since we have some complicated and confusing locking rules for the novma
>>>>> case, it makes sense to separate the two usages into their own functions.
>>>>> 
>>>>> Doing this also provide self-documentation as to the intent of the caller -
>>>>> are they doing something rather unusual or are they simply doing a standard
>>>>> kernel page table walk?
>>>>> 
>>>>> We therefore establish two separate functions - walk_page_range_debug() for
>>>>> this single usage, and walk_kernel_page_table_range() for general kernel
>>>>> page table walking.
>>>>> 
>>>>> We additionally make walk_page_range_debug() internal to mm.
>>>>> 
>>>>> Note that ptdump uses the precise same function for kernel walking as a
>>>> 
>>>> IMHO it's not clear at this point what "the precise same function" means.
>>>> 
>>>>> convenience, so we permit this but make it very explicit by having
>>>>> walk_page_range_novma() invoke walk_kernel_page_table_range() in this case.
>>>> 
>>>> ^ walk_page_range_debug()
>>> 
>>> Oops will fix.
>>> 
>>>> 
>>>> Maybe this could be reworded in the sense (AFAIU) that
>>>> walk_page_range_debug() can be used for both user space page table walking
>>>> or kernel depending on what mm is passed, so in the case of init_mm it
>>>> invokes walk_kernel_page_table_range() internally.
>>> 
>>> Sure.
>>> 
>>>> 
>>>>> 
>>>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes at oracle.com>
>>>>> Acked-by: Mike Rapoport (Microsoft) <rppt at kernel.org>
>>>>> ---
>>>>> v2:
>>>>> * Renamed walk_page_range_novma() to walk_page_range_debug() as per David.
>>>>> * Moved walk_page_range_debug() definition to mm/internal.h as per Mike.
>>>>> * Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as
>>>>> per David.
>>>>> 
>>>>> v1 resend:
>>>>> * Actually cc'd lists...
>>>>> * Fixed mistake in walk_page_range_novma() not handling kernel mappings and
>>>>> update commit message to referene.
>>>>> * Added Mike's off-list Acked-by.
>>>>> * Fixed up comments as per Mike.
>>>>> * Add some historic flavour to the commit message as per Mike.
>>>>> https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoakes@oracle.com/
>>>>> 
>>>>> v1:
>>>>> (accidentally sent off-list due to error in scripting)
>>>>> 
>>>>> arch/loongarch/mm/pageattr.c |  2 +-
>>>>> arch/openrisc/kernel/dma.c   |  4 +-
>>>>> arch/riscv/mm/pageattr.c     |  8 +--
>>>>> include/linux/pagewalk.h     |  7 ++-
>>>>> mm/hugetlb_vmemmap.c         |  2 +-
>>>>> mm/internal.h                |  4 ++
>>>>> mm/pagewalk.c                | 98 ++++++++++++++++++++++++------------
>>>>> mm/ptdump.c                  |  3 +-
>>>>> 8 files changed, 82 insertions(+), 46 deletions(-)
>>>>> 
>>>>> diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c
>>>>> index 99165903908a..f5e910b68229 100644
>>>>> --- a/arch/loongarch/mm/pageattr.c
>>>>> +++ b/arch/loongarch/mm/pageattr.c
>>>>> @@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, pgp
>>>>> return 0;
>>>>> 
>>>>> mmap_write_lock(&init_mm);
>>>>> - ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, &masks);
>>>>> + ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, &masks);
>>>>> mmap_write_unlock(&init_mm);
>>>> 
>>>> You've removed init_mm from walk_page_range_novma() but I see most callers
>>>> do the locking of init_mm immediately around it. This suggests a version
>>>> handling that automatically? A bit complicated by the read/write
>>>> possibilities, so maybe not worth wrapping? Just a thought, as David says ;)
>>> 
>>> Most callers write lock interestingly, but then one read lock's, so we can't
>>> just assume and would need to pass a boolean which would kind of suck.
>> 
>> Hi Lorenzo,
>> 
>> Actually, the write lock introduced in commit 8782fb61cc848 to fix the
>> race condition when walking user page tables can be replaced with a read
>> lock. As explained in commit b123d09304d86, it is safe to walk kernel
>> page tables while holding the mmap read lock. The function name
>> `walk_kernel_page_table_range` clearly indicates its purpose: walking
>> kernel page tables. Thus, using a read lock internally is appropriate
>> and safe. Please correct me, if I am wrong.
>> 
>> To further enhance robustness, it is better to add a WARN_ON check to
>> ensure that the address range passed to walk_kernel_page_table_range
>> is indeed within the kernel address space. This will help prevent any
>> accidental misuse and catch issues early.
>> 
> 
> Hi Muchun,
> 
> Thanks for the information. I'll chase up this locking stuff in a
> follow-up, if that's ok? As I want to get this refactoring landed first
> with the existing behaviour.

OK. Make sense.

> 
> I did wonder about doing a WARN_ON(), but I am concerned that perhaps there
> are some very odd architectures that place things in unusual locations... I
> guess this would be a addr >= TASK_SIZE check?

I think so.

> 
> On the 'weirdness' front, Jann's opinion is that the debug code shouldn't
> even be walking page tables without VMAs in userspace anyway (and to me -
> unless there's some weird arch-specific stuff out there - I am also
> perplexed by this).
> 
> So we should in theory just drop the ability to do this kind of traversal
> in general.
> 
> This is something I'll investigate...
> 
> There's also oddities like:
> 
> if (walk->mm == &init_mm || addr >= TASK_SIZE)
> 	pte = pte_offset_kernel(pmd, addr);
> else
> 	pte = pte_offset_map(pmd, addr);
> 
> In the walk_pte_range() function, which I feel is wrong - you should pass
> init_mm if you want to explore kernel page tables, so why are we doing this
> here?

I guess it is because some users want to walk kernel address space but
not mapped in init_mm (like efi_mm, see ptdump_efi_show()).

> 
> I wonder if it's for stuff like vsyscall which are ostensibly kernel
> addresses but accessible in userspace... but then would we really want to
> walk the page tables for this?
> 
> This is stuff I want dig into a little bit.
> 
> But overall, would you mind if I defer the WARN_ON() and other such stuff
> to a follow up?

No problem. It is up to you.

> 
> Just keen for us to firstly separate out the actual usages of these
> functions very clearly, and thanks to Mike's great suggestion also making
> the debug usage entirely mm-internal.
> 
> This should lay a foundation for introducing further sanity to the equation
> here :)
> 
> I will properly look into the commits you mention in preparation for a
> follow up though to be clear (your input is much appreciated! :)

Thank you! I look forward to your follow-up and any further insights you
might have.

Thanks.

> 
>> Muchun,
>> Thanks.
> 
> Cheers, Lorenzo
> 
>> 
>>> 
>>> Also other walkers assume the caller has the lock so it's consistent to
>>> keep it this way.
>>> 
>>>> 
>>>>> 
>>>>> flush_tlb_kernel_range(start, end);
>>>>> diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
>>>>> index 3a7b5baaa450..af932a4ad306 100644
>>>>> --- a/arch/openrisc/kernel/dma.c
>>>>> +++ b/arch/openrisc/kernel/dma.c
>>>>> @@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size)
>>>>> * them and setting the cache-inhibit bit.
>>>>> */
>>>>> mmap_write_lock(&init_mm);
>>>>> - error = walk_page_range_novma(&init_mm, va, va + size,
>>>>> + error = walk_kernel_page_table_range(va, va + size,
>>>>> &set_nocache_walk_ops, NULL, NULL);
>>>>> mmap_write_unlock(&init_mm);
>>>>> 
>>>>> @@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size)
>>>>> 
>>>>> mmap_write_lock(&init_mm);
>>>>> /* walk_page_range shouldn't be able to fail here */
>>>>> - WARN_ON(walk_page_range_novma(&init_mm, va, va + size,
>>>>> + WARN_ON(walk_kernel_page_table_range(va, va + size,
>>>>> &clear_nocache_walk_ops, NULL, NULL));
>>>>> mmap_write_unlock(&init_mm);
>>>>> }
>>>>> diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
>>>>> index d815448758a1..3f76db3d2769 100644
>>>>> --- a/arch/riscv/mm/pageattr.c
>>>>> +++ b/arch/riscv/mm/pageattr.c
>>>>> @@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
>>>>> if (ret)
>>>>> goto unlock;
>>>>> 
>>>>> - ret = walk_page_range_novma(&init_mm, lm_start, lm_end,
>>>>> + ret = walk_kernel_page_table_range(lm_start, lm_end,
>>>>>    &pageattr_ops, NULL, &masks);
>>>> 
>>>> Note this and other places break the second line's arguments alignment on
>>>> the opening bracket. Maybe it just shows it's a bit fragile style...
>>>> 
>>>> 
>>> 
>>> Yeah I know :) I know you won't believe this coming from me, but I was
>>> trying to minimise the churn :P