[PATCH 1/3] mm: Allow pagewalk without locks

Dev Jain dev.jain at arm.com
Fri Jun 6 02:33:18 PDT 2025


On 06/06/25 2:51 pm, Dev Jain wrote:
>
> On 30/05/25 4:27 pm, Lorenzo Stoakes wrote:
>> +cc Jan for page table stuff.
>>
>> On Fri, May 30, 2025 at 02:34:05PM +0530, Dev Jain wrote:
>>> It is noted at [1] that KFENCE can manipulate kernel pgtable entries 
>>> during
>>> softirqs. It does this by calling set_memory_valid() -> 
>>> __change_memory_common().
>>> This being a non-sleepable context, we cannot take the init_mm mmap 
>>> lock.
>>> Therefore, add PGWALK_NOLOCK to enable walk_page_range_novma() usage 
>>> without
>>> locks.
>> Hm This is worrying.
>>
>> You're unconditionally making it possible for dangerous usage here - to
>> walk page tables without a lock.
>>
>> We need to assert this is only being used in a context where this makes
>> sense, e.g. a no VMA range under the right circumstances.
>>
>> At the very least we need asserts that we are in a circumstance where 
>> this
>> is permitted.
>>
>> For VMAs, you must keep the VMA stable, which requires a VMA read 
>> lock at
>> minimum.
>>
>> See
>> https://origin.kernel.org/doc/html/latest/mm/process_addrs.html#page-tables 
>>
>> for details where these requirements are documented.
>>
>> I also think we should update this documentation to cover off this 
>> non-VMA
>> task context stuff. I can perhaps do this so as not to egregiously add
>> workload to this series :)
>>
>> Also, again this commit message is not enough for such a major change to
>> core mm stuff. I think you need to underline that - in non-task 
>> context -
>> you are safe to manipulate _kernel_ mappings, having precluded KFENCE 
>> as a
>> concern.
>
> Sorry for late reply, after your comments I had to really go and 
> understand
> kernel pagetable walking properly by reading your process_addrs 
> documentation
> and reading the code, so that I could prepare an answer and improve my
> understanding, thanks for your review!
>
> How does the below comment above PGWALK_NOLOCK look?
>
> "Walk without any lock. Use of this is only meant for the
>  case where there is no underlying VMA, and the user has
>  exclusive control over the range, guaranteeing no concurrent
>  access. For example, changing permissions of vmalloc objects."
>
> and the patch description can be modified as
> "
> It is noted at [1] that KFENCE can manipulate kernel pgtable entries 
> during
> softirqs. It does this by calling set_memory_valid() -> 
> __change_memory_common().
> This being a non-sleepable context, we cannot take the init_mm mmap lock.
> Therefore, add PGWALK_NOLOCK to enable walk_page_range_novma() usage 
> without
> locks.
> Currently, apply_to_page_range is being used by __change_memory_common()
> to change permissions over a range of vmalloc space, without any locking.
> Patch 2 in this series shifts to the usage of walk_page_range_novma(), 
> hence
> this patch is needed. We do not need any locks because the vmalloc object
> has exclusive access to the range, i.e two vmalloc objects do not share
> the same physical address.
> "
>
>
>>
>>> [1] 
>>> https://lore.kernel.org/linux-arm-kernel/89d0ad18-4772-4d8f-ae8a-7c48d26a927e@arm.com/
>> Basically expand upon this information.
>>
>> Basically the commit message refers to your usage, but describes a patch
>> that makes it possible to do unlocked page table walks.
>>
>> As I get into below, no pun intended, but this needs to be _locked down_
>> heavily.
>>
>> - Only walk_page_range_novma() should allow it. All other functions 
>> should
>>    return -EINVAL if this is set.
>
> Sure.
>
>>
>> - walk_page_range_novma() should assert we're in the appropriate context
>>    where this is feasible.
>
> There should be two conditions: that the mm is init_mm, and the start 
> address
> belongs to the vmalloc (or module) space. I am a little nervous about 
> the second. On searching
> throughout the codebase, I could find only vmalloc and module 
> addresses getting
> modified through set_memory_* API, but I couldn't prove that all such 
> usages
> are being done on vmalloc/module addresses.

Sorry, please ignore the bit about the second point, I confused it with 
some other

issue in the past. Now set_direct_map_invalid_noflush() will also use 
__change_memory_common ->

walk_page_range_novma, and previously too it was doing it locklessly. So 
let us assert only for

mm == init_mm.


>
>>
>> - Comments should be updated accordingly.
>>
>> - We should assert (at least CONFIG_DEBUG_VM asserts) in every place 
>> that
>>    checks for a VMA that we are not in this lock mode, since this is
>>    disallowed.
>>
>>> Signed-off-by: Dev Jain <dev.jain at arm.com>
>>> ---
>>>   include/linux/pagewalk.h |  2 ++
>>>   mm/pagewalk.c            | 12 ++++++++----
>>>   2 files changed, 10 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
>>> index 9700a29f8afb..9bc8853ed3de 100644
>>> --- a/include/linux/pagewalk.h
>>> +++ b/include/linux/pagewalk.h
>>> @@ -14,6 +14,8 @@ enum page_walk_lock {
>>>       PGWALK_WRLOCK = 1,
>>>       /* vma is expected to be already write-locked during the walk */
>>>       PGWALK_WRLOCK_VERIFY = 2,
>>> +    /* no lock is needed */
>>> +    PGWALK_NOLOCK = 3,
>> I'd prefer something very explicitly documenting that, at the very 
>> least, this
>> can only be used for non-VMA cases.
>>
>> It's hard to think of a name here, but the comment should be explicit 
>> as to
>> under what circumstances this is allowed.
>>
>>>   };
>>>
>>>   /**
>>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>>> index e478777c86e1..9657cf4664b2 100644
>>> --- a/mm/pagewalk.c
>>> +++ b/mm/pagewalk.c
>>> @@ -440,6 +440,8 @@ static inline void process_vma_walk_lock(struct 
>>> vm_area_struct *vma,
>>>       case PGWALK_RDLOCK:
>>>           /* PGWALK_RDLOCK is handled by process_mm_walk_lock */
>>>           break;
>>> +    default:
>>> +        break;
>> Please no 'default' here, we want to be explicit and cover all cases.
>
> Sure.
>
>>
>> And surely, since you're explicitly only allowing this for non-VMA 
>> ranges, this
>> should be a WARN_ON_ONCE() or something?
>
> Sounds good, maybe a WARN_ON_ONCE(vma)?
>
>>
>>>       }
>>>   #endif
>>>   }
>>> @@ -640,10 +642,12 @@ int walk_page_range_novma(struct mm_struct 
>>> *mm, unsigned long start,
>>>        * specified address range from being freed. The caller should 
>>> take
>>>        * other actions to prevent this race.
>>>        */
>> All functions other than this should explicitly disallow this locking 
>> mode
>> with -EINVAL checks. I do not want to see this locking mode made 
>> available
>> in a broken context.
>>
>> The full comment:
>>
>>     /*
>>      * 1) For walking the user virtual address space:
>>      *
>>      * The mmap lock protects the page walker from changes to the page
>>      * tables during the walk.  However a read lock is insufficient to
>>      * protect those areas which don't have a VMA as munmap() detaches
>>      * the VMAs before downgrading to a read lock and actually tearing
>>      * down PTEs/page tables. In which case, the mmap write lock should
>>      * be hold.
>>      *
>>      * 2) For walking the kernel virtual address space:
>>      *
>>      * The kernel intermediate page tables usually do not be freed, so
>>      * the mmap map read lock is sufficient. But there are some 
>> exceptions.
>>      * E.g. memory hot-remove. In which case, the mmap lock is 
>> insufficient
>>      * to prevent the intermediate kernel pages tables belonging to the
>>      * specified address range from being freed. The caller should take
>>      * other actions to prevent this race.
>>      */
>>
>> Are you walking kernel memory only? Point 1 above explicitly points 
>> out why
>> userland novma memory requires a lock.
>>
>> For point 2 you need to indicate why you don't need to consider 
>> hotplugging,
>
> Well, hotunplugging will first offline the physical memory, and since the
> vmalloc object has the reference to the pages, there is no race.
>
>> etc.
>>
>> But as Ryan points out elsewhere, you should be expanding this 
>> comment to
>> explain your case...
>>
>> You should also assert you're in a context where this applies and error
>> out/WARN if not.
>>
>>> -    if (mm == &init_mm)
>>> -        mmap_assert_locked(walk.mm);
>>> -    else
>>> -        mmap_assert_write_locked(walk.mm);
>>> +    if (ops->walk_lock != PGWALK_NOLOCK) {
>> I really don't like the idea that you're allowing no lock for 
>> userland mappings.
>>
>> This should at the very least be:
>>
>> if (mm == &init_mm)  {
>>     if (ops->walk_lock != PGWALK_NOLOCK)
>>         mmap_assert_locked(walk.mm);
>> } else {
>>     mmap_assert_write_locked(walk.mm);
>> }
>
> Sure.
>
>>
>>> +        if (mm == &init_mm)
>>> +            mmap_assert_locked(walk.mm);
>>> +        else
>>> +            mmap_assert_write_locked(walk.mm);
>>> +    }
>>>
>>>       return walk_pgd_range(start, end, &walk);
>>>   }
>>> -- 
>>> 2.30.2
>>>
>> We have to be _really_ careful with this stuff. It's very fiddly and
>> brokenness can be a security issue.
>



More information about the linux-arm-kernel mailing list