[PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
Muchun Song
muchun.song at linux.dev
Thu Jun 5 05:11:56 PDT 2025
> On Jun 5, 2025, at 17:56, Lorenzo Stoakes <lorenzo.stoakes at oracle.com> wrote:
>
> On Thu, Jun 05, 2025 at 05:42:16PM +0800, Muchun Song wrote:
>>
>>
>>> On Jun 5, 2025, at 17:24, Lorenzo Stoakes <lorenzo.stoakes at oracle.com> wrote:
>>>
>>> On Thu, Jun 05, 2025 at 08:56:59AM +0200, Vlastimil Babka wrote:
>>>> On 6/4/25 16:19, Lorenzo Stoakes wrote:
>>>>> The walk_page_range_novma() function is rather confusing - it supports two
>>>>> modes, one used often, the other used only for debugging.
>>>>>
>>>>> The first mode is the common case of traversal of kernel page tables, which
>>>>> is what nearly all callers use this for.
>>>>>
>>>>> Secondly it provides an unusual debugging interface that allows for the
>>>>> traversal of page tables in a userland range of memory even for that memory
>>>>> which is not described by a VMA.
>>>>>
>>>>> It is far from certain that such page tables should even exist, but perhaps
>>>>> this is precisely why it is useful as a debugging mechanism.
>>>>>
>>>>> As a result, this is utilised by ptdump only. Historically, things were
>>>>> reversed - ptdump was the only user, and other parts of the kernel evolved
>>>>> to use the kernel page table walking here.
>>>>>
>>>>> Since we have some complicated and confusing locking rules for the novma
>>>>> case, it makes sense to separate the two usages into their own functions.
>>>>>
>>>>> Doing this also provide self-documentation as to the intent of the caller -
>>>>> are they doing something rather unusual or are they simply doing a standard
>>>>> kernel page table walk?
>>>>>
>>>>> We therefore establish two separate functions - walk_page_range_debug() for
>>>>> this single usage, and walk_kernel_page_table_range() for general kernel
>>>>> page table walking.
>>>>>
>>>>> We additionally make walk_page_range_debug() internal to mm.
>>>>>
>>>>> Note that ptdump uses the precise same function for kernel walking as a
>>>>
>>>> IMHO it's not clear at this point what "the precise same function" means.
>>>>
>>>>> convenience, so we permit this but make it very explicit by having
>>>>> walk_page_range_novma() invoke walk_kernel_page_table_range() in this case.
>>>>
>>>> ^ walk_page_range_debug()
>>>
>>> Oops will fix.
>>>
>>>>
>>>> Maybe this could be reworded in the sense (AFAIU) that
>>>> walk_page_range_debug() can be used for both user space page table walking
>>>> or kernel depending on what mm is passed, so in the case of init_mm it
>>>> invokes walk_kernel_page_table_range() internally.
>>>
>>> Sure.
>>>
>>>>
>>>>>
>>>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes at oracle.com>
>>>>> Acked-by: Mike Rapoport (Microsoft) <rppt at kernel.org>
>>>>> ---
>>>>> v2:
>>>>> * Renamed walk_page_range_novma() to walk_page_range_debug() as per David.
>>>>> * Moved walk_page_range_debug() definition to mm/internal.h as per Mike.
>>>>> * Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as
>>>>> per David.
>>>>>
>>>>> v1 resend:
>>>>> * Actually cc'd lists...
>>>>> * Fixed mistake in walk_page_range_novma() not handling kernel mappings and
>>>>> update commit message to referene.
>>>>> * Added Mike's off-list Acked-by.
>>>>> * Fixed up comments as per Mike.
>>>>> * Add some historic flavour to the commit message as per Mike.
>>>>> https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoakes@oracle.com/
>>>>>
>>>>> v1:
>>>>> (accidentally sent off-list due to error in scripting)
>>>>>
>>>>> arch/loongarch/mm/pageattr.c | 2 +-
>>>>> arch/openrisc/kernel/dma.c | 4 +-
>>>>> arch/riscv/mm/pageattr.c | 8 +--
>>>>> include/linux/pagewalk.h | 7 ++-
>>>>> mm/hugetlb_vmemmap.c | 2 +-
>>>>> mm/internal.h | 4 ++
>>>>> mm/pagewalk.c | 98 ++++++++++++++++++++++++------------
>>>>> mm/ptdump.c | 3 +-
>>>>> 8 files changed, 82 insertions(+), 46 deletions(-)
>>>>>
>>>>> diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c
>>>>> index 99165903908a..f5e910b68229 100644
>>>>> --- a/arch/loongarch/mm/pageattr.c
>>>>> +++ b/arch/loongarch/mm/pageattr.c
>>>>> @@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, pgp
>>>>> return 0;
>>>>>
>>>>> mmap_write_lock(&init_mm);
>>>>> - ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, &masks);
>>>>> + ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, &masks);
>>>>> mmap_write_unlock(&init_mm);
>>>>
>>>> You've removed init_mm from walk_page_range_novma() but I see most callers
>>>> do the locking of init_mm immediately around it. This suggests a version
>>>> handling that automatically? A bit complicated by the read/write
>>>> possibilities, so maybe not worth wrapping? Just a thought, as David says ;)
>>>
>>> Most callers write lock interestingly, but then one read lock's, so we can't
>>> just assume and would need to pass a boolean which would kind of suck.
>>
>> Hi Lorenzo,
>>
>> Actually, the write lock introduced in commit 8782fb61cc848 to fix the
>> race condition when walking user page tables can be replaced with a read
>> lock. As explained in commit b123d09304d86, it is safe to walk kernel
>> page tables while holding the mmap read lock. The function name
>> `walk_kernel_page_table_range` clearly indicates its purpose: walking
>> kernel page tables. Thus, using a read lock internally is appropriate
>> and safe. Please correct me, if I am wrong.
>>
>> To further enhance robustness, it is better to add a WARN_ON check to
>> ensure that the address range passed to walk_kernel_page_table_range
>> is indeed within the kernel address space. This will help prevent any
>> accidental misuse and catch issues early.
>>
>
> Hi Muchun,
>
> Thanks for the information. I'll chase up this locking stuff in a
> follow-up, if that's ok? As I want to get this refactoring landed first
> with the existing behaviour.
OK. Make sense.
>
> I did wonder about doing a WARN_ON(), but I am concerned that perhaps there
> are some very odd architectures that place things in unusual locations... I
> guess this would be a addr >= TASK_SIZE check?
I think so.
>
> On the 'weirdness' front, Jann's opinion is that the debug code shouldn't
> even be walking page tables without VMAs in userspace anyway (and to me -
> unless there's some weird arch-specific stuff out there - I am also
> perplexed by this).
>
> So we should in theory just drop the ability to do this kind of traversal
> in general.
>
> This is something I'll investigate...
>
> There's also oddities like:
>
> if (walk->mm == &init_mm || addr >= TASK_SIZE)
> pte = pte_offset_kernel(pmd, addr);
> else
> pte = pte_offset_map(pmd, addr);
>
> In the walk_pte_range() function, which I feel is wrong - you should pass
> init_mm if you want to explore kernel page tables, so why are we doing this
> here?
I guess it is because some users want to walk kernel address space but
not mapped in init_mm (like efi_mm, see ptdump_efi_show()).
>
> I wonder if it's for stuff like vsyscall which are ostensibly kernel
> addresses but accessible in userspace... but then would we really want to
> walk the page tables for this?
>
> This is stuff I want dig into a little bit.
>
> But overall, would you mind if I defer the WARN_ON() and other such stuff
> to a follow up?
No problem. It is up to you.
>
> Just keen for us to firstly separate out the actual usages of these
> functions very clearly, and thanks to Mike's great suggestion also making
> the debug usage entirely mm-internal.
>
> This should lay a foundation for introducing further sanity to the equation
> here :)
>
> I will properly look into the commits you mention in preparation for a
> follow up though to be clear (your input is much appreciated! :)
Thank you! I look forward to your follow-up and any further insights you
might have.
Thanks.
>
>> Muchun,
>> Thanks.
>
> Cheers, Lorenzo
>
>>
>>>
>>> Also other walkers assume the caller has the lock so it's consistent to
>>> keep it this way.
>>>
>>>>
>>>>>
>>>>> flush_tlb_kernel_range(start, end);
>>>>> diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
>>>>> index 3a7b5baaa450..af932a4ad306 100644
>>>>> --- a/arch/openrisc/kernel/dma.c
>>>>> +++ b/arch/openrisc/kernel/dma.c
>>>>> @@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size)
>>>>> * them and setting the cache-inhibit bit.
>>>>> */
>>>>> mmap_write_lock(&init_mm);
>>>>> - error = walk_page_range_novma(&init_mm, va, va + size,
>>>>> + error = walk_kernel_page_table_range(va, va + size,
>>>>> &set_nocache_walk_ops, NULL, NULL);
>>>>> mmap_write_unlock(&init_mm);
>>>>>
>>>>> @@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size)
>>>>>
>>>>> mmap_write_lock(&init_mm);
>>>>> /* walk_page_range shouldn't be able to fail here */
>>>>> - WARN_ON(walk_page_range_novma(&init_mm, va, va + size,
>>>>> + WARN_ON(walk_kernel_page_table_range(va, va + size,
>>>>> &clear_nocache_walk_ops, NULL, NULL));
>>>>> mmap_write_unlock(&init_mm);
>>>>> }
>>>>> diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
>>>>> index d815448758a1..3f76db3d2769 100644
>>>>> --- a/arch/riscv/mm/pageattr.c
>>>>> +++ b/arch/riscv/mm/pageattr.c
>>>>> @@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
>>>>> if (ret)
>>>>> goto unlock;
>>>>>
>>>>> - ret = walk_page_range_novma(&init_mm, lm_start, lm_end,
>>>>> + ret = walk_kernel_page_table_range(lm_start, lm_end,
>>>>> &pageattr_ops, NULL, &masks);
>>>>
>>>> Note this and other places break the second line's arguments alignment on
>>>> the opening bracket. Maybe it just shows it's a bit fragile style...
>>>>
>>>>
>>>
>>> Yeah I know :) I know you won't believe this coming from me, but I was
>>> trying to minimise the churn :P
More information about the linux-riscv
mailing list