[PATCH v9 01/14] mm: x86, arm64: add arch_has_hw_pte_young()

Fri Mar 11 14:57:57 PST 2022

On Fri, Mar 11, 2022 at 3:55 AM Barry Song <21cnbao at gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 3:47 PM Yu Zhao <yuzhao at google.com> wrote:
> >
> > Some architectures automatically set the accessed bit in PTEs, e.g.,
> > x86 and arm64 v8.2. On architectures that do not have this capability,
> > clearing the accessed bit in a PTE usually triggers a page fault
> > following the TLB miss of this PTE (to emulate the accessed bit).
> >
> > Being aware of this capability can help make better decisions, e.g.,
> > whether to spread the work out over a period of time to reduce bursty
> > page faults when trying to clear the accessed bit in many PTEs.
> >
> > Note that theoretically this capability can be unreliable, e.g.,
> > hotplugged CPUs might be different from builtin ones. Therefore it
> > should not be used in architecture-independent code that involves
> > correctness, e.g., to determine whether TLB flushes are required (in
> > combination with the accessed bit).
> >
> > Signed-off-by: Yu Zhao <yuzhao at google.com>
> > Acked-by: Brian Geffon <bgeffon at google.com>
> > Acked-by: Jan Alexander Steffens (heftig) <heftig at archlinux.org>
> > Acked-by: Oleksandr Natalenko <oleksandr at natalenko.name>
> > Acked-by: Steven Barrett <steven at liquorix.net>
> > Acked-by: Suleiman Souhlal <suleiman at google.com>
> > Acked-by: Will Deacon <will at kernel.org>
> > Tested-by: Daniel Byrne <djbyrne at mtu.edu>
> > Tested-by: Donald Carr <d at chaos-reins.com>
> > Tested-by: Holger Hoffstätte <holger at applied-asynchrony.com>
> > Tested-by: Konstantin Kharlamov <Hi-Angel at yandex.ru>
> > Tested-by: Shuang Zhai <szhai2 at cs.rochester.edu>
> > Tested-by: Sofia Trinh <sofia.trinh at edi.works>
> > Tested-by: Vaibhav Jain <vaibhav at linux.ibm.com>
> > ---
>
> Reviewed-by: Barry Song <baohua at kernel.org>

Thanks.

> i guess arch_has_hw_pte_young() isn't called that often in either
> mm/memory.c or mm/vmscan.c.
> Otherwise, moving to a static key might help. Is it?

MRS shouldn't be slower than either branch of a static key. With a
static key, we only can optimize one of the two cases.

There is a *theoretical* problem with MRS: ARM specs don't prohibit a
physical CPU to support both cases (on different logical CPUs).