[PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
Lorenzo Stoakes
lorenzo.stoakes at oracle.com
Fri Jun 6 06:41:49 PDT 2025
On Fri, Jun 06, 2025 at 12:59:20PM +0200, Jann Horn wrote:
> On Thu, Jun 5, 2025 at 10:23 PM David Hildenbrand <david at redhat.com> wrote:
> > On 05.06.25 21:19, Jann Horn wrote:
> > > On Wed, Jun 4, 2025 at 4:21 PM Lorenzo Stoakes
> > > <lorenzo.stoakes at oracle.com> wrote:
> > >> The walk_page_range_novma() function is rather confusing - it supports two
> > >> modes, one used often, the other used only for debugging.
> > >>
> > >> The first mode is the common case of traversal of kernel page tables, which
> > >> is what nearly all callers use this for.
> > >>
> > >> Secondly it provides an unusual debugging interface that allows for the
> > >> traversal of page tables in a userland range of memory even for that memory
> > >> which is not described by a VMA.
> > >>
> > >> It is far from certain that such page tables should even exist, but perhaps
> > >> this is precisely why it is useful as a debugging mechanism.
> > >>
> > >> As a result, this is utilised by ptdump only. Historically, things were
> > >> reversed - ptdump was the only user, and other parts of the kernel evolved
> > >> to use the kernel page table walking here.
> > >
> > > Just for the record, copy-pasting my comment on v1 that was
> > > accidentally sent off-list:
> > > ```
> > > Sort of a tangential comment: I wonder if it would make sense to give
> > > ptdump a different page table walker that uses roughly the same safety
> > > contract as gup_fast() - turn off IRQs and then walk the page tables
> > > locklessly. We'd need basically no locking and no special cases
> > > (regarding userspace mappings at least), at the cost of having to
> > > write the walker code such that we periodically restart the walk from
> > > scratch and not being able to inspect referenced pages. (That might
> > > also be nicer for debugging, since it wouldn't block on locks...)
> > > ```
> >
> > I assume we don't have to dump more than pte values etc? So
> > pte_special() and friends are not relevant to get it right.
> >
> > GUP-fast depend on CONFIG_HAVE_GUP_FAST, not sure if that would be a
> > concern for now.
>
> Ah, good point, that's annoying... maaaybe we should just gate this
> entire feature on CONFIG_HAVE_GUP_FAST to make sure the userspace
> mappings are designed to be walkable in this way? It's in debugfs,
> which _theoretically_
> (https://docs.kernel.org/filesystems/debugfs.html) means there are no
> stability guarantees, and I think it is normally used on architectures
> that define CONFIG_HAVE_GUP_FAST...
Hm, it's a nice idea, but I wonder if it's worthwhile just for ptdump?
I really hate how we're just arbitrarily using init_mm.mmap_lock as a mutex
here though.
Could we GUP fast walkers here in general I wonder...? Or optionally maybe
for more general page table walking?
I mean of course gated on availability.
We sorely need a truly generalised page walker :) though of course it's a
matter of people having time :P
More information about the linux-riscv
mailing list