[PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock

Wed May 31 15:18:43 PDT 2023

On Wed, May 31, 2023 at 10:54 PM Peter Xu <peterx at redhat.com> wrote:
> On Wed, May 31, 2023 at 05:34:58PM +0200, Jann Horn wrote:
> > On Mon, May 29, 2023 at 8:25 AM Hugh Dickins <hughd at google.com> wrote:
> > > -static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> > > -                              struct mm_struct *target_mm,
> > > -                              unsigned long target_addr, struct page *hpage,
> > > -                              struct collapse_control *cc)
> > > +static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> > >  {
> > >         struct vm_area_struct *vma;
> > > -       int target_result = SCAN_FAIL;
> > >
> > > -       i_mmap_lock_write(mapping);
> > > +       i_mmap_lock_read(mapping);
> > >         vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > > -               int result = SCAN_FAIL;
> > > -               struct mm_struct *mm = NULL;
> > > -               unsigned long addr = 0;
> > > -               pmd_t *pmd;
> > > -               bool is_target = false;
> > > +               struct mm_struct *mm;
> > > +               unsigned long addr;
> > > +               pmd_t *pmd, pgt_pmd;
> > > +               spinlock_t *pml;
> > > +               spinlock_t *ptl;
> > >
> > >                 /*
> > >                  * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> > > -                * got written to. These VMAs are likely not worth investing
> > > -                * mmap_write_lock(mm) as PMD-mapping is likely to be split
> > > -                * later.
> > > +                * got written to. These VMAs are likely not worth removing
> > > +                * page tables from, as PMD-mapping is likely to be split later.
> > >                  *
> > > -                * Note that vma->anon_vma check is racy: it can be set up after
> > > -                * the check but before we took mmap_lock by the fault path.
> > > -                * But page lock would prevent establishing any new ptes of the
> > > -                * page, so we are safe.
> > > -                *
> > > -                * An alternative would be drop the check, but check that page
> > > -                * table is clear before calling pmdp_collapse_flush() under
> > > -                * ptl. It has higher chance to recover THP for the VMA, but
> > > -                * has higher cost too. It would also probably require locking
> > > -                * the anon_vma.
> > > +                * Note that vma->anon_vma check is racy: it can be set after
> > > +                * the check, but page locks (with XA_RETRY_ENTRYs in holes)
> > > +                * prevented establishing new ptes of the page. So we are safe
> > > +                * to remove page table below, without even checking it's empty.
> >
> > This "we are safe to remove page table below, without even checking
> > it's empty" assumes that the only way to create new anonymous PTEs is
> > to use existing file PTEs, right? What about private shmem VMAs that
> > are registered with userfaultfd as VM_UFFD_MISSING? I think for those,
> > the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without
> > looking at the mapping and its pages (except for checking that the
> > insertion point is before end-of-file), protected only by mmap_lock
> > (shared) and pte_offset_map_lock().
>
> Hmm, yes.  We probably need to keep that though, and 5b51072e97 explained
> the reason (to still respect file permissions).
>
> Maybe the anon_vma check can also be moved into the pgtable lock section,
> with some comments explaining (but it's getting a bit ugly..)?

Or check that all entries are pte_none() or something like that inside
the pgtable-locked section?

[...]
> > The old code called collapse_and_free_pmd(), which involves MMU
> > notifier invocation...
[...]
> > ... while the new code only does pmdp_collapse_flush(), which clears
> > the pmd entry and does a TLB flush, but AFAICS doesn't use MMU
> > notifiers. My understanding is that that's problematic - maybe (?) it
> > is sort of okay with regards to classic MMU notifier users like KVM,
> > but it's probably wrong for IOMMUv2 users, where an IOMMU directly
> > consumes the normal page tables?
>
> The iommuv2 wasn't "consuming" the pgtables?

My wording was confusing, I meant that as "the iommuv2 hardware
directly uses/walks the page tables".

> IIUC it relies on that to
> make sure no secondary (and illegal) tlb exists in the iommu tlbs.
>
> For this case if the pgtable _must_ be empty when reaching here (we'd
> better make sure of it..), maybe we're good?  Because we should have just
> invalidated once when unmap all the pages in the thp range, so no existing
> tlb should generate anyway for either cpu or iommu hardwares.

My headcanon is that there are approximately three reasons why we
normally have to do iommuv2 invalidations and I think one or two of
them might still apply here, though admittedly I haven't actually dug
up documentation on how this stuff actually works for IOMMUv2, so
maybe one of y'all can tell me that my concerns here are unfounded:

1. We have to flush normal TLB entries. This is probably not necessary
if the page table contains no entries.
2. We might have to flush "paging-structure caches" / "intermediate
table walk caches", if the IOMMU caches the physical addresses of page
tables to skip some levels of page table walk. IDK if IOMMUs do that,
but normal MMUs definitely do it, so I wouldn't be surprised if the
IOMMUs did it too (or reserved the right to do it in a future hardware
generation or whatever).
3. We have to *serialize* with page table walks performed by the
IOMMU. We're doing an RCU barrier to synchronize against page table
walks from the MMU, but without an appropriate mmu_notifier call, we
have nothing to ensure that we aren't yanking a page table out from
under an IOMMU page table walker while it's in the middle of its walk.
Sure, this isn't very likely in practice, the IOMMU page table walker
is probably pretty fast, but still we need some kind of explicit
synchronization to make this robust, I think.

> However OTOH, maybe it'll also be safer to just have the mmu notifiers like
> before (e.g., no idea whether anything can cache invalidate tlb
> translations from the empty pgtable)? As that doesn't seems to beat the
> purpose of the patchset as notifiers shouldn't fail.
>
> >
> > (FWIW, last I looked, there also seemed to be some other issues with
> > MMU notifier usage wrt IOMMUv2, see the thread
> > <https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@nvidia.com/>.)