[PATCH] arm64: Enable vmalloc-huge with ptdump
Dev Jain
dev.jain at arm.com
Thu Jun 5 01:16:01 PDT 2025
On 05/06/25 10:18 am, Dev Jain wrote:
>
> On 30/05/25 7:06 pm, Will Deacon wrote:
>> On Fri, May 30, 2025 at 02:11:36PM +0100, Ryan Roberts wrote:
>>> On 30/05/2025 13:35, Will Deacon wrote:
>>>> On Fri, May 30, 2025 at 12:50:40PM +0100, Ryan Roberts wrote:
>>>>> On 30/05/2025 10:14, Dev Jain wrote:
>>>>>> On 30/05/25 2:10 pm, Ryan Roberts wrote:
>>>>>>> On 30/05/2025 09:20, Dev Jain wrote:
>>>>>>>> arm64 disables vmalloc-huge when kernel page table dumping is
>>>>>>>> enabled,
>>>>>>>> because an intermediate table may be removed, potentially
>>>>>>>> causing the
>>>>>>>> ptdump code to dereference an invalid address. We want to be
>>>>>>>> able to
>>>>>>>> analyze block vs page mappings for kernel mappings with ptdump,
>>>>>>>> so to
>>>>>>>> enable vmalloc-huge with ptdump, synchronize between page table
>>>>>>>> removal in
>>>>>>>> pmd_free_pte_page()/pud_free_pmd_page() and ptdump pagetable
>>>>>>>> walking. We
>>>>>>>> use mmap_read_lock and not write lock because we don't need to
>>>>>>>> synchronize
>>>>>>>> between two different vm_structs; two vmalloc objects running
>>>>>>>> this same
>>>>>>>> code path will point to different page tables, hence there is
>>>>>>>> no race.
>>>>>> My "correction" from race->no problem was incorrect after all :)
>>>>>> There will
>>>>>> be no race too since the vm_struct object has exclusive access to
>>>>>> whatever
>>>>>> table it is clearing.
>>>>>>
>>>>>>>> Signed-off-by: Dev Jain <dev.jain at arm.com>
>>>>>>>> ---
>>>>>>>> arch/arm64/include/asm/vmalloc.h | 6 ++----
>>>>>>>> arch/arm64/mm/mmu.c | 7 +++++++
>>>>>>>> 2 files changed, 9 insertions(+), 4 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/arch/arm64/include/asm/vmalloc.h
>>>>>>>> b/arch/arm64/include/asm/vmalloc.h
>>>>>>>> index 38fafffe699f..28b7173d8693 100644
>>>>>>>> --- a/arch/arm64/include/asm/vmalloc.h
>>>>>>>> +++ b/arch/arm64/include/asm/vmalloc.h
>>>>>>>> @@ -12,15 +12,13 @@ static inline bool
>>>>>>>> arch_vmap_pud_supported(pgprot_t prot)
>>>>>>>> /*
>>>>>>>> * SW table walks can't handle removal of intermediate
>>>>>>>> entries.
>>>>>>>> */
>>>>>>>> - return pud_sect_supported() &&
>>>>>>>> - !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>>>>>>> + return pud_sect_supported();
>>>>>>>> }
>>>>>>>> #define arch_vmap_pmd_supported arch_vmap_pmd_supported
>>>>>>>> static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>>>>>>>> {
>>>>>>>> - /* See arch_vmap_pud_supported() */
>>>>>>>> - return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>>>>>>>> + return true;
>>>>>>>> }
>>>>>>>> #endif
>>>>>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>>>>>>>> index ea6695d53fb9..798cebd9e147 100644
>>>>>>>> --- a/arch/arm64/mm/mmu.c
>>>>>>>> +++ b/arch/arm64/mm/mmu.c
>>>>>>>> @@ -1261,7 +1261,11 @@ int pmd_free_pte_page(pmd_t *pmdp,
>>>>>>>> unsigned long addr)
>>>>>>>> }
>>>>>>>> table = pte_offset_kernel(pmdp, addr);
>>>>>>>> +
>>>>>>>> + /* Synchronize against ptdump_walk_pgd() */
>>>>>>>> + mmap_read_lock(&init_mm);
>>>>>>>> pmd_clear(pmdp);
>>>>>>>> + mmap_read_unlock(&init_mm);
>>>>>>> So this works because ptdump_walk_pgd() takes the write_lock
>>>>>>> (which is mutually
>>>>>>> exclusive with any read_lock holders) for the duration of the
>>>>>>> table walk, so it
>>>>>>> will either consistently see the pgtables before or after this
>>>>>>> removal. It will
>>>>>>> never disappear during the walk, correct?
>>>>>>>
>>>>>>> I guess there is a risk of this showing up as contention with
>>>>>>> other init_mm
>>>>>>> write_lock holders. But I expect that
>>>>>>> pmd_free_pte_page()/pud_free_pmd_page()
>>>>>>> are called sufficiently rarely that the risk is very small.
>>>>>>> Let's fix any perf
>>>>>>> problem if/when we see it.
>>>>>> We can avoid all of that by my initial approach - to wrap the
>>>>>> lock around
>>>>>> CONFIG_PTDUMP_DEBUGFS.
>>>>>> I don't have a strong opinion, just putting it out there.
>>>>> (I wrote then failed to send earlier):
>>>>>
>>>>> It's ugly though. Personally I'd prefer to keep it simple unless
>>>>> we have clear
>>>>> evidence that its needed. I was just laying out my justification
>>>>> for not needing
>>>>> to doing the conditional wrapping in this comment.
>>>> I really don't think we should be adding unconditional locking
>>>> overhead
>>>> to core mm routines purely to facilitate a rarely used debug option.
>>>>
>>>> Instead, can we either adopt something like the RCU-like walk used by
>>>> fast GUP or stick the locking behind a static key that's only enabled
>>>> when a ptdump walk is in progress (a bit like how
>>>> hugetlb_vmemmap_optimize_folio() manipulates
>>>> hugetlb_optimize_vmemmap_key)?
>>> My sense is that the static key will be less effort and can be
>>> contained fully
>>> in arm64. I think we would need to enable the key around the call to
>>> ptdump_walk_pgd() in both ptdump_walk() and ptdump_check_wx(). Then
>>> where Dev is
>>> currently taking the read lock, that would be contingent on the key
>>> being
>>> enabled and the unlock would be contingent on having taken the lock.
>>>
>>> Does that sound like an acceptable approach?
>> Yup, and I think you'll probably need something like a synchronize_rcu()
>> when flipping the key to deal with any pre-existing page-table freers.
>
> IIUC, you mean to say that we need to ensure any existing readers having
> a reference to the isolated table in pmd_free_pte_page and friend,
> have exited.
> But the problem is that we take an mmap_write_lock() around
> ptdump_walk_pgd() so
> this is a sleepable code path.
The mmap_write_lock around ptdump_walk_pgd() is definitely restrictive.
I think that
was put because ptdump is rarely used, I think we could have done
RCU-freeing of pagetables to
synchronize between ptdump and vmalloc pagetable lazy freeing/ hotunplug
pagetable freeing.
>
>>
>> Will
>
More information about the linux-arm-kernel
mailing list