[PATCH v4] arm64: Enable vmalloc-huge with ptdump

Sun Jul 6 17:44:23 PDT 2025

On Fri, Jul 04, 2025 at 05:12:13PM +0530, Dev Jain wrote:
> On 04/07/25 4:52 pm, Catalin Marinas wrote:
> > On Thu, Jun 26, 2025 at 10:55:24AM +0530, Dev Jain wrote:
> > > @@ -1301,16 +1314,39 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
> > >   	}
> > >   	table = pmd_offset(pudp, addr);
> > > +	/*
> > > +	 * Isolate the PMD table; in case of race with ptdump, this helps
> > > +	 * us to avoid taking the lock in __pmd_free_pte_page().
> > > +	 *
> > > +	 * Static key logic:
> > > +	 *
> > > +	 * Case 1: If ptdump does static_branch_enable(), and after that we
> > > +	 * execute the if block, then this patches in the read lock, ptdump has
> > > +	 * the write lock patched in, therefore ptdump will never read from
> > > +	 * a potentially freed PMD table.
> > > +	 *
> > > +	 * Case 2: If the if block starts executing before ptdump's
> > > +	 * static_branch_enable(), then no locking synchronization
> > > +	 * will be done. However, pud_clear() + the dsb() in
> > > +	 * __flush_tlb_kernel_pgtable will ensure that ptdump observes an
[...]
> > I don't get case 2. You want to ensure pud_clear() is observed by the
> > ptdump code before the pmd_free(). The DSB in the TLB flushing code
> > ensures some ordering between the pud_clear() and presumably something
> > that the ptdump code can observe as well. Would that be the mmap
> > semaphore? However, the read_lock would only be attempted if this code
> > is seeing the static branch update, which is not guaranteed. I don't
> > think it even matters since the lock may be released anyway before the
> > write_lock in ptdump.
> > 
> > For example, you do a pud_clear() above, skip the whole static branch.
> > The ptdump comes along on another CPU but does not observe the
> > pud_clear() since there's no synchronisation. It goes ahead and
> > dereferences it while this CPU does a pmd_free().
> 
> The objective is: ptdump must not dereference a freed pagetable.
> So for your example, if the static branch is not observed by
> pud_free_pmd_page, this means that ptdump will take the write lock
> after the execution of flush_tlb_kernel_pagetable completes (for if ptdump takes
> the write lock before execution of flush_tlb_kernel_pagetable completes, we have
> executed static_branch_enable(), contradiction).

I don't see why the write lock matters since pud_free_pmd_page() doesn't
take the read lock in the second scenario. What we need is acquire
semantics after the static branch update on the ptdump path but we get
it before we even attempt the write lock.

For simplicity, ignoring the observability of instruction writes and
considering the static branch a variable, if pud_free_pmd_page() did not
observe the static branch update, is the ptdump guaranteed to see the
cleared pud subsequently?

With initial state pud=1 (non-zero), stb=0 (static branch):

P0 (pud_free_pmd_page)		P1 (ptdump)

    W_pud=0			   W_stb=1
    DSB				   barrier/acq
    R_stb=0			   R_pud=?

The write to the static branch on P1 will be ordered after the read of
the branch on P0, so the pud will be seen as 0. It's not even worth
mentioning the semaphore here as the static branch update has enough
barriers for cache flushing and kick_all_cpus_sync().

The other scenario is P0 (pud_free_pmd_page) observing the write to the
static branch (that's case 1 in your comment). This doesn't say anything
about whether P1 (ptdump) sees a clear or valid pud. What we do know is
that P0 will try to acquire (and release) the lock. If P1 already
acquired the write lock, P0 will wait and the state of the pud is
irrelevant (no freeing). Similarly if P1 already completed by the time
P0 takes the lock.

If P0 takes the lock first, the lock release guarantees that the
pud_clear() is seen by the ptdump code _after_ it acquired the lock.

W.r.t. the visibility of the branch update vs pud access, the
combinations of DSB+ISB (part of the TLB flushing) on P0 and cache
maintenance to PoU together with kick_all_cpus_sync() on P1 should
suffice.

I think the logic works (though I'm heavily jetlagged and writing from a
plane) but the comments need to be improved. As described above, case 1
has two sub-cases depending on when P0 runs in relation to the write
lock (before or during/after). And the write lock doesn't matter for
case 2.

> > And I can't get my head around memory ordering, it doesn't look sound.
> > static_branch_enable() I don't think has acquire semantics, at least not
> > in relation to the actual branch update. The static_branch_unlikely()
> > test, again, not sure what guarantees it has (I don't think it has any
> > in relation to memory loads). Maybe you have worked it all out and is
> > fine but it needs a better explanation and ideally some simple formal
> > model. Show it's correct with a global variable first and then we can
> > optimise with static branches.
> 
> What do you suggest? As in what do you mean by showing its correct with
> a global variable first...and, for the formal model thingy, do you
> want mathematical rigor similar to what you explain in [1] :P, because unfortunately
> (and quite embarrassingly) I didn't have formal methods in college : )

Neither did I ;) (mostly analogue electronics). I was thinking something
like our cat/herd tools where you can write some simple assembly. It's a
good investment if you want to give it a try.

-- 
Catalin