[RESEND PATCH v5] arm64: Enable vmalloc-huge with ptdump

Thu Jul 24 00:20:22 PDT 2025

+ Ash Wilding. He may be interested in this patch because it concerns memory
barriers, and I always refer to his youtube video for this stuff : )

On 24/07/25 11:20 am, Anshuman Khandual wrote:
> A very small nit -
>
> arm64/mm: Enable vmalloc-huge with ptdump

Sure.

>
> On 23/07/25 9:48 PM, Dev Jain wrote:
>> Our goal is to move towards enabling vmalloc-huge by default on arm64 so
>> as to reduce TLB pressure. Therefore, we need a way to analyze the portion
>> of block mappings in vmalloc space we can get on a production system; this
>> can be done through ptdump, but currently we disable vmalloc-huge if
>> CONFIG_PTDUMP_DEBUGFS is on. The reason is that lazy freeing of kernel
>> pagetables via vmap_try_huge_pxd() may race with ptdump, so ptdump
>> may dereference a bogus address.
>>
>> To solve this, we need to synchronize ptdump_walk_pgd() with
>> pud_free_pmd_page() and pmd_free_pte_page().
>>
>> Since this race is very unlikely to happen in practice, we do not want to
>> penalize other code paths taking the init_mm mmap_lock. Therefore, we use
>> static keys. ptdump will enable the static key - upon observing that, the
>> vmalloc table freeing path will get patched in with an
>> mmap_read_lock/unlock sequence. A code comment explains in detail, how
>> a combination of acquire sematics of static_branch_enable() and the
> typo                       ^^^^^^^^ s/sematics/semantics

Thanks.

>
>> barriers in __flush_tlb_kernel_pgtable() ensures that ptdump will never
>> get a hold on the address of a freed PMD or PTE table.
>>
>> For an (almost) rigorous proof of correctness, one may look at:
>> https://lore.kernel.org/all/e1e87f16-1c48-481b-8f7c-9333ac5d13e7@arm.com/
>>
>> Reviewed-by: Ryan Roberts <ryan.roberts at arm.com>
>> Signed-off-by: Dev Jain <dev.jain at arm.com>
>> ---
>> Rebased on 6.16-rc7.
>>
>> mm-selfests pass. No issues were observed while parallelly running
>> test_vmalloc.sh (which stresses the vmalloc subsystem) and dumping the
>> kernel pagetable through sysfs in a loop.
>>
>> v4->v5:
>>   - The arch_vmap_pxd_supported() changes were dropped by mistake in between
>>     the revisions, add them back (Anshuman)
>>   - Rewrite cover letter, drop stray change, add arm64_ptdump_walk_pgd()
>>     helper, rename ptdump_lock_key -> arm64_ptdump_lock_key, add comment
>>     above __pmd_free_pte_page() to explain when the lock won't
>>     be taken (Anshuman)
>>   - Rewrite the comment explaining the synchronization logic (Catalin)
>>
>> v3->v4:
>>   - Lock-unlock immediately
>>   - Simplify includes
>>
>> v2->v3:
>>   - Use static key mechanism
>>
>> v1->v2:
>>   - Take lock only when CONFIG_PTDUMP_DEBUGFS is on
>>   - In case of pud_free_pmd_page(), isolate the PMD table to avoid taking
>>     the lock 512 times again via pmd_free_pte_page()
>>
>> ---
>>   arch/arm64/include/asm/ptdump.h  |  2 +
>>   arch/arm64/include/asm/vmalloc.h |  6 +--
>>   arch/arm64/mm/mmu.c              | 86 ++++++++++++++++++++++++++++++--
>>   arch/arm64/mm/ptdump.c           | 11 +++-
>>   4 files changed, 95 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/ptdump.h b/arch/arm64/include/asm/ptdump.h
>> index fded5358641f..baff24004459 100644
>> --- a/arch/arm64/include/asm/ptdump.h
>> +++ b/arch/arm64/include/asm/ptdump.h
>> @@ -7,6 +7,8 @@
>>   
>>   #include <linux/ptdump.h>
>>   
>> +DECLARE_STATIC_KEY_FALSE(arm64_ptdump_lock_key);
>> +
>>   #ifdef CONFIG_PTDUMP
>>   
>>   #include <linux/mm_types.h>
>> diff --git a/arch/arm64/include/asm/vmalloc.h b/arch/arm64/include/asm/vmalloc.h
>> index 12f534e8f3ed..e835fd437ae0 100644
>> --- a/arch/arm64/include/asm/vmalloc.h
>> +++ b/arch/arm64/include/asm/vmalloc.h
>> @@ -12,15 +12,13 @@ static inline bool arch_vmap_pud_supported(pgprot_t prot)
>>   	/*
>>   	 * SW table walks can't handle removal of intermediate entries.
>>   	 */
> This above comment can now be dropped.
>
>> -	return pud_sect_supported() &&
>> -	       !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>> +	return pud_sect_supported();
>>   }

I keep missing something or the other :(

>>   
>>   #define arch_vmap_pmd_supported arch_vmap_pmd_supported
>>   static inline bool arch_vmap_pmd_supported(pgprot_t prot)
>>   {
>> -	/* See arch_vmap_pud_supported() */
>> -	return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
>> +	return true;
>>   }
>>   
>>   #define arch_vmap_pte_range_map_size arch_vmap_pte_range_map_size
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index 00ab1d648db6..49932c1dd34e 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -46,6 +46,8 @@
>>   #define NO_CONT_MAPPINGS	BIT(1)
>>   #define NO_EXEC_MAPPINGS	BIT(2)	/* assumes FEAT_HPDS is not used */
>>   
>> +DEFINE_STATIC_KEY_FALSE(arm64_ptdump_lock_key);
> A short introduction about the static key's purpose might be helpful.

It may be, but I don't see comments above static key declarations anywhere else
in the codebase. My opinion is that adding a comment here will be over-documenting
stuff.

>
>> +
>>   enum pgtable_type {
>>   	TABLE_PTE,
>>   	TABLE_PMD,
>> @@ -1267,7 +1269,12 @@ int pmd_clear_huge(pmd_t *pmdp)
>>   	return 1;
>>   }
>>   
>> -int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> +/*
>> + * If PMD has been isolated via pud_free_pmd_page(), ptdump cannot get
>> + * a hold to it, so no need to serialize with mmap_lock, hence lock
>> + * will be passed as false here. Otherwise, lock will be true.
>> + */
> This comment should be split into two and added near their respective
> call sites with and without the kernel mmap_lock requirement.

That would be better.

>
>> +static int __pmd_free_pte_page(pmd_t *pmdp, unsigned long addr, bool lock)IIUC 'lock' is the need for taking mmap_lock in init_mm. Hence changing
> that as 'mmap_lock' or 'kernel_mmap_lock' might be better which will
> also add some required context.

Makes sense.

>
>>   {
>>   	pte_t *table;
>>   	pmd_t pmd;
>> @@ -1279,13 +1286,24 @@ int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>>   		return 1;
>>   	}
>>   
>> +	/* See comment in pud_free_pmd_page for static key logic */
>>   	table = pte_offset_kernel(pmdp, addr);
>>   	pmd_clear(pmdp);
>>   	__flush_tlb_kernel_pgtable(addr);
>> +	if (static_branch_unlikely(&arm64_ptdump_lock_key) && lock) {
>> +		mmap_read_lock(&init_mm);
>> +		mmap_read_unlock(&init_mm);
>> +	}
>> +
>>   	pte_free_kernel(NULL, table);
>>   	return 1;
>>   }
>>   
>> +int pmd_free_pte_page(pmd_t *pmdp, unsigned long addr)
>> +{
>> +	return __pmd_free_pte_page(pmdp, addr, true);
>> +}
>> +
>>   int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>   {
>>   	pmd_t *table;
>> @@ -1301,16 +1319,76 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
>>   	}
>>   
>>   	table = pmd_offset(pudp, addr);
>> +
>> +	/*
>> +	 * Our objective is to prevent ptdump from reading a PMD table which has
>> +	 * been freed.  Assume that ptdump_walk_pgd() (call this thread T1)
>> +	 * executes completely on CPU1 and pud_free_pmd_page() (call this thread
>> +	 * T2) executes completely on CPU2. Let the region sandwiched by the
>> +	 * mmap_write_lock/unlock in T1 be called CS (the critical section).
>> +	 *
>> +	 * Claim: The CS of T1 will never operate on a freed PMD table.
>> +	 *
>> +	 * Proof:
>> +	 *
>> +	 * Case 1: The static branch is visible to T2.
>> +	 *
>> +	 * Case 1 (a): T1 acquires the lock before T2 can.
>> +	 * T2 will block until T1 drops the lock, so pmd_free() will only be
>> +	 * executed after T1 exits CS.
> T2 blocks here because ptdump_walk_pgd() takes mmap_write_lock(). This needs
> to be mentioned.

I have declared the case as "T1 acquires the lock before T2 can". Then I
write "T2 will block until T1 drops the lock". This exactly means what you
mention above.

>
>> +	 *> +	 * Case 1 (b): T2 acquires the lock before T1 can.
>> +	 * The sequence of barriers issued in __flush_tlb_kernel_pgtable()
>> +	 * ensures that an empty PUD (via pud_clear()) is visible to T1 before
>> +	 * T1 can enter CS, therefore it is impossible for the CS to get hold
>> +	 * of the address of the isolated PMD table.
> Agreed. mmap_write_lock() from ptdump_walk_pgd() will proceed after T2 calls
> mmap_read_unlock(). So protection from race comes from the fact that T1 can
> never get hold of isolated PMD table not because of the mmap_lock.

Thanks for confirmation.

>
>> +	 *
>> +	 * Case 2: The static branch is not visible to T2.
>> +	 *
>> +	 * Since static_branch_enable() (via dmb(ish)) and mmap_write_lock()
>> +	 * have acquire semantics, it is guaranteed that the static branch
>> +	 * will be visible to all CPUs before T1 can enter CS. The static
>> +	 * branch not being visible to T2 therefore guarantees that T1 has
>> +	 * not yet entered CS .... (i)
>> +	 * The sequence of barriers via __flush_tlb_kernel_pgtable() in T2
>> +	 * implies that if the invisibility of the static branch has been
>> +	 * observed by T2 (i.e static_branch_unlikely() is observed as false),
>> +	 * then all CPUs will have observed an empty PUD ... (ii)
>> +	 * Combining (i) and (ii), we conclude that T1 observes an empty PUD
>> +	 * before entering CS => it is impossible for the CS to get hold of
>> +	 * the address of the isolated PMD table. Q.E.D
>> +	 *> +	 * We have proven that the claim is true on the assumption that
>> +	 * there is no context switch for T1 and T2. Note that the reasoning
>> +	 * of the proof uses barriers operating on the inner shareable domain,
>> +	 * which means that they will affect all CPUs, and also a context switch
>> +	 * will insert extra barriers into the code paths => the claim will
>> +	 * stand true even if we drop the assumption.
> Do these all rest on the fact that static_branch_enable() and mmap_write_lock()
> have acquire semantics available via a particular class of barrier instructions

Yes.

> ? In which case should the generation of those instructions via these functions
> be asserted for the above T1/T2 guarantee to hold ? Just curious.

Nice suggestion but how do we do that :) I feel it will be being over-cautious.

Although one can check by tracing the code paths that static_branch_enable()
and mmap_write_lock() will emit dmb(ish), one does not need to bother about
looking at the actual implementations. Let us ask ourselves what these
two functions must guarantee us on any operating system.

In the below sequence, let CSI stand for critical section instruction.

Take lock
CSI-1
CSI-2
Release lock

Since mmap_write_lock() is about locking an rw semaphore for protecting a
critical section, the computer science definition of a lock must ensure that
CSI-1 does not get reordered before taking the lock, CSI-2 does not get reordered
after releasing the lock, otherwise the critical section may be expanded beyond
the locks, thus contradicting the definition of a critical section. Hence any
implementation must have acquire and release semantics for locking and unlocking.
(Note that we do not care about instructions operating on registers since registers
are local to the CPU - we need to synchronize only memory loads and stores since
memory is shared). A similar reasoning can be used when one CPU does static_branch_enable()
and others observe it with static_branch_unlikely(). In the actual Linux implementation,
these converge onto __smp_mb_after_atomic() -> the atomic cmpxchg must be followed by
an smp_mb(), so that no CS instruction is executed before the doing the memory store
corresponding to acquiring the lock.

>
>> +	 *
>> +	 * It is also worth reasoning whether something can go wrong via
>> +	 * pud_free_pmd_page() -> __pmd_free_pte_page(), since the latter
>> +	 * will be called locklessly on this code path.
>> +	 *
>> +	 * For Case 1 (a), T2 will block until CS is finished, so we are safe.
>> +	 * For Case 1 (b) and Case 2, the PMD table will be isolated before
>> +	 * T1 can enter CS, therefore it is safe for T2 to operate on the
>> +	 * PMD table locklessly.
>> +	 */
> Although looks reasonable - the above comment block is going to be difficult to
> process after time has passed :) Just wondering could it some how be simplified ?

IMHO it is easy to process, but I have written that so obviously I am biased : )
I can only comment if you can give a suggestion of your own.

>
>> +	pud_clear(pudp);
>> +	__flush_tlb_kernel_pgtable(addr);
>> +	if (static_branch_unlikely(&arm64_ptdump_lock_key)) {
>> +		mmap_read_lock(&init_mm);
>> +		mmap_read_unlock(&init_mm);
>> +	}
>> +
>>   	pmdp = table;
>>   	next = addr;
>>   	end = addr + PUD_SIZE;
>>   	do {
>>   		if (pmd_present(pmdp_get(pmdp)))
>> -			pmd_free_pte_page(pmdp, next);
>> +			__pmd_free_pte_page(pmdp, next, false);
>>   	} while (pmdp++, next += PMD_SIZE, next != end);
>>   
>> -	pud_clear(pudp);
>> -	__flush_tlb_kernel_pgtable(addr);
>>   	pmd_free(NULL, table);
>>   	return 1;
>>   }
>> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
>> index 421a5de806c6..65335c7ba482 100644
>> --- a/arch/arm64/mm/ptdump.c
>> +++ b/arch/arm64/mm/ptdump.c
>> @@ -283,6 +283,13 @@ void note_page_flush(struct ptdump_state *pt_st)
>>   	note_page(pt_st, 0, -1, pte_val(pte_zero));
>>   }
>>   
>> +static void arm64_ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm)
>> +{
>> +	static_branch_enable(&arm64_ptdump_lock_key);
>> +	ptdump_walk_pgd(st, mm, NULL);
>> +	static_branch_disable(&arm64_ptdump_lock_key);
>> +}
> Encapsulating generic ptdump_walk_pgd() between arm64 platform
> specific static key enable/disable requirement does make sense.
>
>> +
>>   void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
>>   {
>>   	unsigned long end = ~0UL;
>> @@ -311,7 +318,7 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
>>   		}
>>   	};
>>   
>> -	ptdump_walk_pgd(&st.ptdump, info->mm, NULL);
>> +	arm64_ptdump_walk_pgd(&st.ptdump, info->mm);
>>   }
>>   
>>   static void __init ptdump_initialize(void)
>> @@ -353,7 +360,7 @@ bool ptdump_check_wx(void)
>>   		}
>>   	};
>>   
>> -	ptdump_walk_pgd(&st.ptdump, &init_mm, NULL);
>> +	arm64_ptdump_walk_pgd(&st.ptdump, &init_mm);
>>   
>>   	if (st.wx_pages || st.uxn_pages) {
>>   		pr_warn("Checked W+X mappings: FAILED, %lu W+X pages found, %lu non-UXN pages found\n",