[PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize

Thu Jul 11 04:39:52 PDT 2024

On Thu, Jul 11, 2024 at 02:31:25AM -0600, Yu Zhao wrote:
> On Wed, Jul 10, 2024 at 5:07 PM Yu Zhao <yuzhao at google.com> wrote:
> > On Wed, Jul 10, 2024 at 4:29 PM Catalin Marinas <catalin.marinas at arm.com> wrote:
> > > The Arm ARM states that we need a BBM if we change the output address
> > > and: the old or new mappings are RW *or* the content of the page
> > > changes. Ignoring the latter (page content), we can turn the PTEs RO
> > > first without changing the pfn followed by changing the pfn while they
> > > are RO. Once that's done, we make entry 0 RW and, of course, with
> > > additional TLBIs between all these steps.
> >
> > Aha! This is easy to do -- I just made the RO guaranteed, as I
> > mentioned earlier.
> >
> > Just to make sure I fully understand the workflow:
> >
> > 1. Split a RW PMD into 512 RO PTEs, pointing to the same 2MB `struct page` area.

I don't think we can turn all of them RO here since some of those 512
PTEs are not related to the hugetlb page. So you'd need to keep them RW
but preserving the pfn so that there's no actual translation change. I
think that's covered by FEAT_BBM level 2. Basically this step should be
only about breaking up a PMD block entry into a table entry.

> > 2. TLBI once, after pmd_populate_kernel()
> > 3. Remap PTE 1-7 to the 4KB `struct page` area of PTE 0, for every 8
> >    PTEs, while they remain RO.

You may need some intermediate step to turn these PTEs read-only since
step 1 should leave them RW. Also, if we want to free and order-3 page
here, it might be better to allocate an order 0 even for PTE entry 0 (I
had the impression that's what the core code does, I haven't checked).

> > 4. TLBI once, after set_pte_at() on PTE 1-7.
> > 5. Change PTE 0 from RO to RW, pointing to the same 4KB `struct page` area.
> > 6. TLBI once, after set_pte_at() on PTE 0.
> > 
> > No BBM required, regardless of FEAT_BBM level 2.
> 
> I just studied D8.16.1 from the reference manual, and it seems to me:
> 1. We still need either FEAT_BBM or BBM to split PMD.

Yes.

> 2. We still need BBM when we change PTE 1-7, because even if they
> remain RO, the content of the `struct page` page at the new location
> does not match that at the old location.

Yes, in theory, the data at the new pfn should be the same. We could try
to get clarification from the architects on what could go wrong but I
suspect it's some atomicity is not guarantee if you read the data (the
CPU getting confused whether to read from the old or the new page).

Otherwise, since after all these steps PTEs 1-7 point to the same data
as PTE 0, before step 3 we could copy the data in page 0 over to the
other 7 pages while entries 1-7 are still RO. The remapping afterwards
would be fully compliant.

> > > Can we leave entry 0 RO? This would save an additional TLBI.
> >
> > Unfortunately we can't. Otherwise we wouldn't be able to, e.g., grab a
> > refcnt on any hugeTLB pages.

OK, fair enough.

> > > Now, I wonder if all this is worth it. What are the scenarios where the
> > > 8 PTEs will be accessed? The vmemmap range corresponding to a 2MB
> > > hugetlb page for example is pretty well defined - 8 x 4K pages, aligned.
> 
> One of the fundamental assumptions in core MM is that anyone can
> read or try to grab (write) a refcnt from any `struct page`. Those
> speculative PFN walkers include memory compaction, etc.

But how does this work if PTEs 1-7 are RO? Do those walkers detect it's
a tail page and skip it. Actually, if they all point to the same vmemmap
page, how can one distinguish a tail page via PTE 1 from the head page
via PTE 0?

BTW, I'll be on holiday from tomorrow for two weeks and won't be able to
follow up on this thread (and likely to forget all the discussion by the
time I get back ;)).

-- 
Catalin