LPA2 on non-LPA2 hardware broken with 16K pages

Will Deacon will at kernel.org
Wed Jul 24 04:33:41 PDT 2024


On Tue, Jul 23, 2024 at 06:28:16PM +0200, Ard Biesheuvel wrote:
> On Tue, 23 Jul 2024 at 18:05, Will Deacon <will at kernel.org> wrote:
> >
> > On Tue, Jul 23, 2024 at 05:02:15PM +0200, Ard Biesheuvel wrote:
> > > On Tue, 23 Jul 2024 at 16:52, Will Deacon <will at kernel.org> wrote:
> > > > On Fri, Jul 19, 2024 at 11:02:29AM -0700, Ard Biesheuvel wrote:
> ...
> > > > >
> > > > > We might add
> > > > >
> > > > > if (pgtable_l4_enabled())
> > > > >     pgdp = &pgd;
> > > > >
> > > > > here to preserve the existing 'lockless' behavior when PUDs are not
> > > > > folded.
> > > >
> > > > The code still needs to be 'lockless' for the 5-level case, so I don't
> > > > think this is necessary.
> > >
> > > The 5-level case is never handled here.
> >
> > Urgh, yes, sorry. I've done a fantasticly bad job of explaining myself.
> >
> > > There is the 3-level case, where the runtime PUD folding needs the
> > > actual address in order to recalculate the descriptor address using
> > > the correct shift. In this case, we don't dereference the pointer
> > > anyway so the 'lockless' thing doesn't matter (afaict)
> > >
> > > In the 4-level case, we want to preserve the original behavior, where
> > > pgd is not reloaded from pgdp. Setting pgdp to &pgd achieves that.
> >
> > Right. What I'm trying to get at is the case where we have folding. For
> > example, with my patch applied, if we have 3 levels then the lockless
> > GUP walk looks like:
> >
> >
> > pgd_t pgd = READ_ONCE(*pgdp);
> >
> > p4dp = p4d_offset_lockless(pgdp, pgd, addr);
> >         => Returns pgdp
> > p4d_t p4d = READ_ONCE(*p4dp);
> >
> > pudp = pud_offset_lockless(p4dp, p4d, addr);
> >         => Returns &p4d, which is again the pgdp
> > pud_t pud = READ_ONCE(*pudp);
> >
> >
> > So here we're reloading the same pointer multiple times and my argument
> > is that if we need to add logic to avoid this for the
> > pgtable_l4_enabled() case, then we have bigger problems.
> >
> 
> The 3-level case is not relevant here. My suggestion only affects the
> 4-level case:

That's exactly what I'm uneasy about :/

> if (pgtable_l4_enabled())
>    pgdp = &pgd;
> 
> which prevents us from evaluating *pgdp twice, which seems to me to be
> the reason these routines exist in the first place. Given that the
> 3-level runtime-folded case is the one we are trying to fix here, I'd
> argue that keeping the 4-level case the same as before is important.

I think consistency between 4-level and 3-level is far more important.
Adding code to avoid reloading the entry for one specific case (the
pgtable_l4_enabled() case), whilst requiring other cases (e.g. the
3-level compile-time folded case) to reload from the pointer is
inconsistent. Either they both need it or neither of them need it, no?

> > > > Yes, we'll load the same entry multiple times,
> > > > but it should be fine because they're in the context of a different
> > > > (albeit folded) level.
> > > >
> > >
> > > I don't understand what you are saying here. Why is that fine?
> >
> > I think it's fine because (a) the CPU guarantees same address
> > read-after-read ordering and (b) We only evaluate the most recently read
> > value. It would be a problem if we mixed data from different reads but,
> > because the use is confined to that 'level', we don't end up doing that.
> >
> > Dunno, am I making any sense?
> >
> 
> So what is the point of p?d_offset_lockless()? Is it a performance
> optimization that we don't care about on arm64? Or does this reasoning
> still only apply to the folded case?

Well, naturally it's all undocumented. S390 added the macros, but it
looks like that was because they hit a similar problem to the one
reported by Lina (see d3f7b1bb2040 ("mm/gup: fix gup_fast with dynamic
page table folding")) and, really, that change is just moving the logic
out of the fast GUP code and into some new macros.

If you think about trying to do fast GUP using the regular pXd_offset()
macros with 5 levels, the problem is a little more obvious. For example,
it would look something like:

	// gup_fast_pgd_range()
	pgd_t pgd = READ_ONCE(*pgdp);

	if (pgd_none(pgd))
		return;

	...

	// gup_fast_p4d_range()
	p4dp = p4d_offset(pgdp, addr);
		=> READ_ONCE(*pgdp);

A concurrent writer could e.g. clear *pgdp between the two reads and
we'd end up with junk in p4dp because of a ToCToU-type bug.

But I don't think that applies to the case we're discussing, because the
reload of the entry happens in the following READ_ONCE() rather than
in the _offset macro:

	p4d_t p4d = READ_ONCE(*p4dp);

and, as this is in the context of a new level, everything is then
rechecked so that a concurrent modification won't cause us to crash.

Will



More information about the linux-arm-kernel mailing list