[PATCH 1/3] mm: Allow pagewalk without locks
Dev Jain
dev.jain at arm.com
Fri Jun 6 02:33:18 PDT 2025
On 06/06/25 2:51 pm, Dev Jain wrote:
>
> On 30/05/25 4:27 pm, Lorenzo Stoakes wrote:
>> +cc Jan for page table stuff.
>>
>> On Fri, May 30, 2025 at 02:34:05PM +0530, Dev Jain wrote:
>>> It is noted at [1] that KFENCE can manipulate kernel pgtable entries
>>> during
>>> softirqs. It does this by calling set_memory_valid() ->
>>> __change_memory_common().
>>> This being a non-sleepable context, we cannot take the init_mm mmap
>>> lock.
>>> Therefore, add PGWALK_NOLOCK to enable walk_page_range_novma() usage
>>> without
>>> locks.
>> Hm This is worrying.
>>
>> You're unconditionally making it possible for dangerous usage here - to
>> walk page tables without a lock.
>>
>> We need to assert this is only being used in a context where this makes
>> sense, e.g. a no VMA range under the right circumstances.
>>
>> At the very least we need asserts that we are in a circumstance where
>> this
>> is permitted.
>>
>> For VMAs, you must keep the VMA stable, which requires a VMA read
>> lock at
>> minimum.
>>
>> See
>> https://origin.kernel.org/doc/html/latest/mm/process_addrs.html#page-tables
>>
>> for details where these requirements are documented.
>>
>> I also think we should update this documentation to cover off this
>> non-VMA
>> task context stuff. I can perhaps do this so as not to egregiously add
>> workload to this series :)
>>
>> Also, again this commit message is not enough for such a major change to
>> core mm stuff. I think you need to underline that - in non-task
>> context -
>> you are safe to manipulate _kernel_ mappings, having precluded KFENCE
>> as a
>> concern.
>
> Sorry for late reply, after your comments I had to really go and
> understand
> kernel pagetable walking properly by reading your process_addrs
> documentation
> and reading the code, so that I could prepare an answer and improve my
> understanding, thanks for your review!
>
> How does the below comment above PGWALK_NOLOCK look?
>
> "Walk without any lock. Use of this is only meant for the
> case where there is no underlying VMA, and the user has
> exclusive control over the range, guaranteeing no concurrent
> access. For example, changing permissions of vmalloc objects."
>
> and the patch description can be modified as
> "
> It is noted at [1] that KFENCE can manipulate kernel pgtable entries
> during
> softirqs. It does this by calling set_memory_valid() ->
> __change_memory_common().
> This being a non-sleepable context, we cannot take the init_mm mmap lock.
> Therefore, add PGWALK_NOLOCK to enable walk_page_range_novma() usage
> without
> locks.
> Currently, apply_to_page_range is being used by __change_memory_common()
> to change permissions over a range of vmalloc space, without any locking.
> Patch 2 in this series shifts to the usage of walk_page_range_novma(),
> hence
> this patch is needed. We do not need any locks because the vmalloc object
> has exclusive access to the range, i.e two vmalloc objects do not share
> the same physical address.
> "
>
>
>>
>>> [1]
>>> https://lore.kernel.org/linux-arm-kernel/89d0ad18-4772-4d8f-ae8a-7c48d26a927e@arm.com/
>> Basically expand upon this information.
>>
>> Basically the commit message refers to your usage, but describes a patch
>> that makes it possible to do unlocked page table walks.
>>
>> As I get into below, no pun intended, but this needs to be _locked down_
>> heavily.
>>
>> - Only walk_page_range_novma() should allow it. All other functions
>> should
>> return -EINVAL if this is set.
>
> Sure.
>
>>
>> - walk_page_range_novma() should assert we're in the appropriate context
>> where this is feasible.
>
> There should be two conditions: that the mm is init_mm, and the start
> address
> belongs to the vmalloc (or module) space. I am a little nervous about
> the second. On searching
> throughout the codebase, I could find only vmalloc and module
> addresses getting
> modified through set_memory_* API, but I couldn't prove that all such
> usages
> are being done on vmalloc/module addresses.
Sorry, please ignore the bit about the second point, I confused it with
some other
issue in the past. Now set_direct_map_invalid_noflush() will also use
__change_memory_common ->
walk_page_range_novma, and previously too it was doing it locklessly. So
let us assert only for
mm == init_mm.
>
>>
>> - Comments should be updated accordingly.
>>
>> - We should assert (at least CONFIG_DEBUG_VM asserts) in every place
>> that
>> checks for a VMA that we are not in this lock mode, since this is
>> disallowed.
>>
>>> Signed-off-by: Dev Jain <dev.jain at arm.com>
>>> ---
>>> include/linux/pagewalk.h | 2 ++
>>> mm/pagewalk.c | 12 ++++++++----
>>> 2 files changed, 10 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
>>> index 9700a29f8afb..9bc8853ed3de 100644
>>> --- a/include/linux/pagewalk.h
>>> +++ b/include/linux/pagewalk.h
>>> @@ -14,6 +14,8 @@ enum page_walk_lock {
>>> PGWALK_WRLOCK = 1,
>>> /* vma is expected to be already write-locked during the walk */
>>> PGWALK_WRLOCK_VERIFY = 2,
>>> + /* no lock is needed */
>>> + PGWALK_NOLOCK = 3,
>> I'd prefer something very explicitly documenting that, at the very
>> least, this
>> can only be used for non-VMA cases.
>>
>> It's hard to think of a name here, but the comment should be explicit
>> as to
>> under what circumstances this is allowed.
>>
>>> };
>>>
>>> /**
>>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>>> index e478777c86e1..9657cf4664b2 100644
>>> --- a/mm/pagewalk.c
>>> +++ b/mm/pagewalk.c
>>> @@ -440,6 +440,8 @@ static inline void process_vma_walk_lock(struct
>>> vm_area_struct *vma,
>>> case PGWALK_RDLOCK:
>>> /* PGWALK_RDLOCK is handled by process_mm_walk_lock */
>>> break;
>>> + default:
>>> + break;
>> Please no 'default' here, we want to be explicit and cover all cases.
>
> Sure.
>
>>
>> And surely, since you're explicitly only allowing this for non-VMA
>> ranges, this
>> should be a WARN_ON_ONCE() or something?
>
> Sounds good, maybe a WARN_ON_ONCE(vma)?
>
>>
>>> }
>>> #endif
>>> }
>>> @@ -640,10 +642,12 @@ int walk_page_range_novma(struct mm_struct
>>> *mm, unsigned long start,
>>> * specified address range from being freed. The caller should
>>> take
>>> * other actions to prevent this race.
>>> */
>> All functions other than this should explicitly disallow this locking
>> mode
>> with -EINVAL checks. I do not want to see this locking mode made
>> available
>> in a broken context.
>>
>> The full comment:
>>
>> /*
>> * 1) For walking the user virtual address space:
>> *
>> * The mmap lock protects the page walker from changes to the page
>> * tables during the walk. However a read lock is insufficient to
>> * protect those areas which don't have a VMA as munmap() detaches
>> * the VMAs before downgrading to a read lock and actually tearing
>> * down PTEs/page tables. In which case, the mmap write lock should
>> * be hold.
>> *
>> * 2) For walking the kernel virtual address space:
>> *
>> * The kernel intermediate page tables usually do not be freed, so
>> * the mmap map read lock is sufficient. But there are some
>> exceptions.
>> * E.g. memory hot-remove. In which case, the mmap lock is
>> insufficient
>> * to prevent the intermediate kernel pages tables belonging to the
>> * specified address range from being freed. The caller should take
>> * other actions to prevent this race.
>> */
>>
>> Are you walking kernel memory only? Point 1 above explicitly points
>> out why
>> userland novma memory requires a lock.
>>
>> For point 2 you need to indicate why you don't need to consider
>> hotplugging,
>
> Well, hotunplugging will first offline the physical memory, and since the
> vmalloc object has the reference to the pages, there is no race.
>
>> etc.
>>
>> But as Ryan points out elsewhere, you should be expanding this
>> comment to
>> explain your case...
>>
>> You should also assert you're in a context where this applies and error
>> out/WARN if not.
>>
>>> - if (mm == &init_mm)
>>> - mmap_assert_locked(walk.mm);
>>> - else
>>> - mmap_assert_write_locked(walk.mm);
>>> + if (ops->walk_lock != PGWALK_NOLOCK) {
>> I really don't like the idea that you're allowing no lock for
>> userland mappings.
>>
>> This should at the very least be:
>>
>> if (mm == &init_mm) {
>> if (ops->walk_lock != PGWALK_NOLOCK)
>> mmap_assert_locked(walk.mm);
>> } else {
>> mmap_assert_write_locked(walk.mm);
>> }
>
> Sure.
>
>>
>>> + if (mm == &init_mm)
>>> + mmap_assert_locked(walk.mm);
>>> + else
>>> + mmap_assert_write_locked(walk.mm);
>>> + }
>>>
>>> return walk_pgd_range(start, end, &walk);
>>> }
>>> --
>>> 2.30.2
>>>
>> We have to be _really_ careful with this stuff. It's very fiddly and
>> brokenness can be a security issue.
>
More information about the linux-arm-kernel
mailing list