[PATCH v2 1/2] mm: Allow lockless kernel pagetable walking
Lorenzo Stoakes
lorenzo.stoakes at oracle.com
Tue Jun 10 05:57:31 PDT 2025
On Tue, Jun 10, 2025 at 06:10:03PM +0530, Dev Jain wrote:
>
> On 10/06/25 5:37 pm, Lorenzo Stoakes wrote:
> > OK so I think the best solution here is to just update check_ops_valid(), which
> > was kind of sucky anyway (we check everywhere but walk_page_range_mm() to
> > enforce the install pte thing).
> >
> > Let's do something like:
> >
> > #define OPS_MAY_INSTALL_PTE (1<<0)
> > #define OPS_MAY_AVOID_LOCK (1<<1)
> >
> > and update check_ops_valid() to take a flags or maybe 'capabilities' field.
> >
> > Then check based on this e.g.:
> >
> > if (ops->install_pte && !(capabilities & OPS_MAY_INSTALL_PTE))
> > return false;
> >
> > if (ops->walk_lock == PGWALK_NOLOCK && !(capabilities & OPS_MAY_AVOID_LOCK))
> > return false;
> >
> > return true;
> >
> > Then update callers, most of whom can have '0' passed for this field, with
> > walk_page_range_mm() setting OPS_MAY_INSTALL_PTE and
> > walk_kernel_page_table_range() setting OPS_MAY_AVOID_LOCK.
> >
> > That way we check it all in one place, it's consistent, we no long 'implicitly'
> > don't check it for walk_page_range_mm() and all is neater.
> >
> > We do end up calling this predicate twice for walk_page_range(), so we could
> > (should?) also make walk_page_range_mm() into a static __walk_page_range_mm()
> > and have a separate walk_page_range_mm() that does this check.
> >
> > I think this will solve the interface issues I've raised below.
>
> Makes sense, thank you for your suggestions.
Thanks :)
Sorry to be a pain but I think this will fit more nicely.
>
> >
> > Thanks!
> >
> > On Tue, Jun 10, 2025 at 05:14:00PM +0530, Dev Jain wrote:
> > > arm64 currently changes permissions on vmalloc objects locklessly, via
> > > apply_to_page_range. Patch 2 moves away from this to use the pagewalk API,
> > > since a limitation of the former is to deny changing permissions for block
> > > mappings. However, the API currently enforces the init_mm.mmap_lock to be
> > > held. To avoid the unnecessary bottleneck of the mmap_lock for our usecase,
> > > this patch extends this generic API to be used locklessly, so as to retain
> > > the existing behaviour for changing permissions. Apart from this reason,
> > > it is noted at [1] that KFENCE can manipulate kernel pgtable entries during
> > > softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
> > > This being a non-sleepable context, we cannot take the init_mm mmap lock.
> > >
> > > Since such extension can potentially be dangerous for other callers
> > > consuming the pagewalk API, explicitly disallow lockless traversal for
> > > userspace pagetables by returning EINVAL. Add comments to highlight the
> > > conditions under which we can use the API locklessly - no underlying VMA,
> > > and the user having exclusive control over the range, thus guaranteeing no
> > > concurrent access.
> > >
> > > Signed-off-by: Dev Jain <dev.jain at arm.com>
> > > ---
> > > include/linux/pagewalk.h | 7 +++++++
> > > mm/pagewalk.c | 23 ++++++++++++++++++-----
> > > 2 files changed, 25 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> > > index 8ac2f6d6d2a3..5efd6541239b 100644
> > > --- a/include/linux/pagewalk.h
> > > +++ b/include/linux/pagewalk.h
> > > @@ -14,6 +14,13 @@ enum page_walk_lock {
> > > PGWALK_WRLOCK = 1,
> > > /* vma is expected to be already write-locked during the walk */
> > > PGWALK_WRLOCK_VERIFY = 2,
> > > + /*
> > > + * Walk without any lock. Use of this is only meant for the
> > > + * case where there is no underlying VMA, and the user has
> > > + * exclusive control over the range, guaranteeing no concurrent
> > > + * access. For example, changing permissions of vmalloc objects.
> > > + */
> > > + PGWALK_NOLOCK = 3,
> > Thanks for the comment! This is good.
> >
> > > };
> > >
> > > /**
> > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > > index ff5299eca687..d55d933f84ec 100644
> > > --- a/mm/pagewalk.c
> > > +++ b/mm/pagewalk.c
> > > @@ -417,13 +417,17 @@ static int __walk_page_range(unsigned long start, unsigned long end,
> > > return err;
> > > }
> > >
> > > -static inline void process_mm_walk_lock(struct mm_struct *mm,
> > > +static inline bool process_mm_walk_lock(struct mm_struct *mm,
> > > enum page_walk_lock walk_lock)
> > I don't like this signature at all, you don't describe what it does, and now it
> > returns... whether it was not locked? I think this might lead to confusion :)
> >
> >
> > > {
> > > + if (walk_lock == PGWALK_NOLOCK)
> > > + return 1;
> > It's 2025, return true please :)
> >
> > > +
> > > if (walk_lock == PGWALK_RDLOCK)
> > > mmap_assert_locked(mm);
> > > else
> > > mmap_assert_write_locked(mm);
> > > + return 0;
> > It's 2025, return false please :)
> >
> > > }
> > >
> > > static inline void process_vma_walk_lock(struct vm_area_struct *vma,
> > > @@ -440,6 +444,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
> > > case PGWALK_RDLOCK:
> > > /* PGWALK_RDLOCK is handled by process_mm_walk_lock */
> > > break;
> > > + case PGWALK_NOLOCK:
> > > + break;
> > Under what circumstances would we be fine with this function being invoked with
> > no lock being specified?
> >
> > Isn't it the case that the one situation in which we can specify PGWALK_NOLOCK
> > won't ever invoke this? Or did I miss a call of this function?
> >
> > If not, we should make this a VM_WARN_ON_ONCE(1);
>
> I was made aware that there is a quest to remove BUG_ON()'s in the kernel, see [1].
> Is there a similar problem with VM_WARN_ON_ONCE()?
No, in fact the proposal is that we replace BUG_ON()'s with [VM_]WARN_ON_ONCE()'s.
So this is all good, BUG_ON() is basically never needed unless you are _certain_
there will be system instability that MUST BE STOPPED NOW.
Which is almost never going to be the case.
See
https://lore.kernel.org/linux-mm/CAHk-=wjO1xL_ZRKUG_SJuh6sPTQ-6Lem3a3pGoo26CXEsx_w0g@mail.gmail.com/
where I managed to somehow provoke Linus into giving some (very interesting!) input ;)
But if you see the thread you can see further context on all this.
>
> [1]: https://lore.kernel.org/all/053ae9ec-1113-4ed8-9625-adf382070bc5@redhat.com/
>
> >
> > > }
> > > #endif
> > > }
> > > @@ -470,7 +476,8 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
> > > if (!walk.mm)
> > > return -EINVAL;
> > >
> > > - process_mm_walk_lock(walk.mm, ops->walk_lock);
> > > + if (process_mm_walk_lock(walk.mm, ops->walk_lock))
> > > + return -EINVAL;
> > This is just weird, you're treating the return like it were an error value (no
> > it's a boolean), the name of the function doesn't expliain the 'verb' of what's
> > happening, it's just confusing.
> >
> > Obviously I'm belabouring the point a bit, see suggestion at top :)
> >
> > > vma = find_vma(walk.mm, start);
> > > do {
> > > @@ -626,8 +633,12 @@ int walk_kernel_page_table_range(unsigned long start, unsigned long end,
> > > * to prevent the intermediate kernel pages tables belonging to the
> > > * specified address range from being freed. The caller should take
> > > * other actions to prevent this race.
> > > + *
> > > + * If the caller can guarantee that it has exclusive access to the
> > > + * specified address range, only then it can use PGWALK_NOLOCK.
> > > */
> > > - mmap_assert_locked(mm);
> > > + if (ops->walk_lock != PGWALK_NOLOCK)
> > > + mmap_assert_locked(mm);
> > >
> > > return walk_pgd_range(start, end, &walk);
> > > }
> > > @@ -699,7 +710,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
> > > if (!check_ops_valid(ops))
> > > return -EINVAL;
> > >
> > > - process_mm_walk_lock(walk.mm, ops->walk_lock);
> > > + if (process_mm_walk_lock(walk.mm, ops->walk_lock))
> > > + return -EINVAL;
> > > process_vma_walk_lock(vma, ops->walk_lock);
> > > return __walk_page_range(start, end, &walk);
> > > }
> > > @@ -719,7 +731,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
> > > if (!check_ops_valid(ops))
> > > return -EINVAL;
> > >
> > > - process_mm_walk_lock(walk.mm, ops->walk_lock);
> > > + if (process_mm_walk_lock(walk.mm, ops->walk_lock))
> > > + return -EINVAL;
> > > process_vma_walk_lock(vma, ops->walk_lock);
> > > return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
> > > }
> > > --
> > > 2.30.2
> > >
> > Thanks!
More information about the linux-arm-kernel
mailing list