[PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

Thu Apr 4 14:48:03 PDT 2024

On Tue, Mar 26, 2024 at 11:02:52AM -0300, Jason Gunthorpe wrote:
> The more I look at this the more I think we need to get to Matthew's
> idea of having some kind of generic page table API that is not tightly
> tied to level. Replacing the hugetlb trick of 'everything is a PTE'
> with 5 special cases in every place seems just horrible.
> 
>    struct mm_walk_ops {
>        int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk);
>    }
> 
> And many cases really want something like:
>    struct mm_walk_state state;
> 
>    if (!mm_walk_seek_leaf(state, mm, address))
>           goto no_present
>    if (mm_walk_is_write(state)) ..
> 
> And detailed walking:
>    for_each_pt_leaf(state, mm, address) {
>        if (mm_walk_is_write(state)) ..
>    }
> 
> Replacing it with a mm_walk_state that retains the level or otherwise
> to allow decoding any entry composes a lot better. Forced Loop
> unrolling can get back to the current code gen in alot of places.
> 
> It also makes the power stuff a bit nicer as the mm_walk_state could
> automatically retain back pointers to the higher levels in the state
> struct too...
> 
> The puzzle is how to do it and still get reasonable efficient codegen,
> many operations are going to end up switching on some state->level to
> know how to decode the entry.

These discussions are definitely constructive, thanks Jason.  Very helpful.

I thought about this last week but got interrupted.  It does make sense to
me; it looks pretty generic and it is flexible enough as a top design.  At
least that's what I thought.

However now when I rethink about it, and look more into the code when I got
the chance, it turns out this will be a major rewrite of mostly every
walkers..  it doesn't mean that this is a bad idea, but then I'll need to
compare the other approach, because there can be a huge difference on when
we can get that code ready, I think. :)

Consider that what we (or.. I) want to teach the pXd layers are two things
right now: (1) hugetlb mappings (2) MMIO (PFN) mappings.  That mostly
shares the generic concept when working on the mm walkers no matter which
way to go, just different treatment on different type of mem.  (2) is on
top of current code and new stuff, while (1) is a refactoring to drop
hugetlb_entry() hook point as the goal.

Taking a simplest mm walker (smaps) as example, I think most codes are
ready thanks to THP's existance, and also like vm_normal_page[_pmd]() which
should even already work for pfnmaps; pud layer is missing but that should
be trivial.  It means we may have chance to drop hugetlb_entry() without an
huge overhaul yet.

Now the important question I'm asking myself is: do we really need huge p4d
or even bigger?  It's 512GB on x86, and we said "No 512 GiB pages yet"
(commit fe1e8c3e963) since 2017 - that is 7 years without chaning this
fact.  While on non-x86 p4d_leaf() never defined.  Then it's also
interesting to see how many codes are "ready" to handle p4d entries (by
looking at p4d_leaf() calls; much easier to see with the removal of the
rest huge apis..) even if none existed.

So, can we over-engineer too much if we go the generic route now?
Considering that we already have most of pmd/pud entries around in the mm
walker ops.  So far it sounds better we leave it for later, until further
justifed to be useful.  And that won't block it if it ever justified to be
needed, I'd say it can also be seen as a step forward if I can make it to
remove hugetlb_entry() first.

Comments welcomed (before I start to work on anything..).

Thanks,

-- 
Peter Xu