[PATCH v3 08/22] mm: Allow page table accessors to be non-idempotent

Lorenzo Stoakes lorenzo.stoakes at oracle.com
Tue Dec 16 02:29:13 PST 2025


On Thu, Dec 11, 2025 at 01:59:53PM +0000, Ryan Roberts wrote:
> On 11/12/2025 00:33, Samuel Holland wrote:
> > On 2025-11-28 2:47 AM, David Hildenbrand (Red Hat) wrote:
> >> On 11/27/25 17:57, Ryan Roberts wrote:
> >>> On 13/11/2025 01:45, Samuel Holland wrote:
> >>>> Currently, some functions such as pte_offset_map() are passed both
> >>>> pointers to hardware page tables, and pointers to previously-read PMD
> >>>> entries on the stack. To ensure correctness in the first case, these
> >>>> functions must use the page table accessor function (pmdp_get()) to
> >>>> dereference the supplied pointer. However, this means pmdp_get() is
> >>>> called twice in the second case. This double call must be avoided if
> >>>> pmdp_get() applies some non-idempotent transformation to the value.
> >>>>
> >>>> Avoid the double transformation by calling set_pmd() on the stack
> >>>> variables where necessary to keep set_pmd()/pmdp_get() calls balanced.
> >>>
> >>> I don't think this is a good solution.
> >>
> >> Agreed,
> >>
> >>     set_pmd(&pmd, pmd);
> >>
> >> is rather horrible.
> > I agree that this patch is ugly. The only way I see to avoid code like this is
> > to refactor (or duplicate) the functions so no function takes pointers to both
> > hardware page tables and on-stack page table entries. Is that sort of
> > refactoring the right direction to go for v4?
>
> From a quick look at the code, I think that some cases are solvable by
> refactoring to pass the value instead of the pointer, and leave it to the higher
> level decide how to read the value from the pointer - it knows if it is pointing
> to HW pgtable or if it's a (e.g) stack value.
>
> But the more I look at the code, the more instances I find where pointers to
> stack variables are being passed to arch pgtable helpers as if they are HW
> pgtable entry pointers. (Mainly pmd level).

Ugh. We do need to nip this in the bud I think!

>
> I wonder if we need to bite the bullet and explicitly separate the types? At
> each level, we have:
>
>  1. page table entry value
>  2. pointer to page table entry _value_ (e.g. pointer to pXX_t on stack)
>  3. pointer to page table entry in HW pgtable
>
> Today, 1 is represented by pte_t, pmd_t, etc. 2 and 3 are represented by the
> same type; pte_t*, pmd_t*, etc.
>
> If we create a new type for 3, it will both document and enforce when type 2 or
> type 3 is required.
>
> e.g:
>
> // pte_t: defined by arch.
> typedef unsigned long pte_t;
>
> // ptep_t: new opaque type that can't be dereferenced.
> struct __ptep_t;
> typedef struct __ptep_t *ptep_t;
>
> // getter/setter responsible for cast & deref as appropriate.
> pte_t ptep_get(ptep_t ptep)
> {
> 	return READ_ONCE(*(pte_t *)ptep);
> }

I think we've got ourselves in a jumble with pte vs ptep vs pteval
vs. etc. etc. and we aren't always consistent with it.

So I think we ought to put 'hw' somewhere in the name.

And we can also now get away from the overly abbreviated names given we have the
chance :)

So something like 'pte_hw_t' perhaps?

I like the general concept, though I think it's kinda gross to hide the fact
that it's a pointer in the typedef _that_ directly.

So perhaps be less horrid if it was something like:

typedef struct {
	pte_t *ptr;
} pte_hw_t;

Perhaps?

Then could have:

pte_t hw_pte_get(pte_hw_t pte_hw)
{
	return READ_ONCE(*pte_hw.ptr);
}

>
> int do_stuff(void)
> {
> 	// value on stack: ok
> 	pte_t mypte;
>
> 	// pointer to value on stack: ok
> 	pte_t *pmypte = &mypte;
>
> 	// handle to entry on stack: not allowed by compiler!
> 	ptep_t myptep = &mypte;
>
> 	// handle to entry in pgtable: ok
> 	ptep_t myptep = pte_offset_kernel(...);
>
> 	// read value of pgtable entry: ok
> 	pte_t val = ptep_get(myptep);
>
> 	// attempt to pass pointer to stack variable: not allowed by compiler!
> 	pte_t val = ptep_get(&mypte);
>
> 	// attempt to directly dereference ptep: not allowed by compiler!
> 	pte_t val = *myptep;
> }
>
>
> We could do this incrementally by initially typedefing ptep_t to be:
> typedef pte_t *ptep_t;

Hm yeah still hate the idea of typedef'ing this as a ptr so directly.

Obviously this would make an incremental thing a little harder, but could have
this stuff on top of existing logic and make it incremental by changing logic
bit-by-bit?

>
> Then we could flip the switch arch-by-arch to enable the stronger checking. We
> likely wouldn't need to convert arches that don't care.
>
> I think by doing this, it will expose all the current issues and force us to fix
> them properly.

Yeah this is nice.

>
> On the related subject of conversion to pXXp_get(); I've been looking into this
> and personally, I think we should have 2 helper flavours at each level:
>
>  - pXXd_get()      optimizable by compiler; defaults to C dereference
>  - pXXd_get_once() single-copy-atomic and unmovable by compiler

Yes!

There is some _real_ confusion about when and when we don't need to do this.

Though wouldn't having a typdef separating out a hw entry imply that hw entry ->
get once, sw entry -> get?

>
> It simplifies the converstion process, and reduces the risk of bugs
> significantly (go read about the arm32 issues discussed in Anshuman's series if
> you haven't done already).

Though I haven't read this series so can't remember actually if there were cases
where even with hw entries we wanted to sometimes READ_ONCE() and sometimes not?

>
> I appreciate this is all probably a lot more work than you would prefer to sign
> up for, I'd be happy to collaborate if we get concensus that this approach makes
> sense. What do you think?
>
> Thanks,
> Ryan
>

Thanks for this concept, overall very much in agreement!

Cheers, Lorenzo



More information about the linux-riscv mailing list