[PATCH v11 06/14] mm: multi-gen LRU: minimal implementation

Thu Jun 9 07:46:42 PDT 2022

On 2022/6/9 8:34 下午, zhong jiang wrote:
>
> On 2022/5/18 9:46 上午, Yu Zhao wrote:
>> To avoid confusion, the terms "promotion" and "demotion" will be
>> applied to the multi-gen LRU, as a new convention; the terms
>> "activation" and "deactivation" will be applied to the active/inactive
>> LRU, as usual.
>>
>> The aging produces young generations. Given an lruvec, it increments
>> max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
>> promotes hot pages to the youngest generation when it finds them
>> accessed through page tables; the demotion of cold pages happens
>> consequently when it increments max_seq. The aging has the complexity
>> O(nr_hot_pages), since it is only interested in hot pages. Promotion
>> in the aging path does not involve any LRU list operations, only the
>> updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
>> the result of the increment of max_seq, requires LRU list operations,
>> e.g., lru_deactivate_fn().
>>
>> The eviction consumes old generations. Given an lruvec, it increments
>> min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
>> feedback loop modeled after the PID controller monitors refaults over
>> anon and file types and decides which type to evict when both types
>> are available from the same generation.
>>
>> Each generation is divided into multiple tiers. Tiers represent
>> different ranges of numbers of accesses through file descriptors. A
>> page accessed N times through file descriptors is in tier
>> order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
>> bits in folio->flags. In contrast to moving across generations, which
>> requires the LRU lock, moving across tiers only involves operations on
>> folio->flags. The feedback loop also monitors refaults over all tiers
>> and decides when to protect pages in which tiers (N>1), using the
>> first tier (N=0,1) as a baseline. The first tier contains single-use
>> unmapped clean pages, which are most likely the best choices. The
>> eviction moves a page to the next generation, i.e., min_seq+1, if the
>> feedback loop decides so. This approach has the following advantages:
>> 1. It removes the cost of activation in the buffered access path by
>>     inferring whether pages accessed multiple times through file
>>     descriptors are statistically hot and thus worth protecting in the
>>     eviction path.
>> 2. It takes pages accessed through page tables into account and avoids
>>     overprotecting pages accessed multiple times through file
>>     descriptors. (Pages accessed through page tables are in the first
>>     tier, since N=0.)
>> 3. More tiers provide better protection for pages accessed more than
>>     twice through file descriptors, when under heavy buffered I/O
>>     workloads.
>>
>> Server benchmark results:
>>    Single workload:
>>      fio (buffered I/O): +[40, 42]%
>>                  IOPS         BW
>>        5.18-rc1: 2463k        9621MiB/s
>>        patch1-6: 3484k        13.3GiB/s
>>
>>    Single workload:
>>      memcached (anon): +[44, 46]%
>>                  Ops/sec      KB/sec
>>        5.18-rc1: 771403.27    30004.17
>>        patch1-6: 1120643.70   43588.06
>>
>>    Configurations:
>>      CPU: two Xeon 6154
>>      Mem: total 256G
>>
>>      Node 1 was only used as a ram disk to reduce the variance in the
>>      results.
>>
>>      patch drivers/block/brd.c <<EOF
>>      99,100c99,100
>>      <     gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
>>      <     page = alloc_page(gfp_flags);
>>      ---
>>      >     gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | 
>> __GFP_THISNODE;
>>      >     page = alloc_pages_node(1, gfp_flags, 0);
>>      EOF
>>
>>      cat >>/etc/systemd/system.conf <<EOF
>>      CPUAffinity=numa
>>      NUMAPolicy=bind
>>      NUMAMask=0
>>      EOF
>>
>>      cat >>/etc/memcached.conf <<EOF
>>      -m 184320
>>      -s /var/run/memcached/memcached.sock
>>      -a 0766
>>      -t 36
>>      -B binary
>>      EOF
>>
>>      cat fio.sh
>>      modprobe brd rd_nr=1 rd_size=113246208
>>      swapoff -a
>>      mkfs.ext4 /dev/ram0
>>      mount -t ext4 /dev/ram0 /mnt
>>
>>      mkdir /sys/fs/cgroup/user.slice/test
>>      echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
>>      echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
>>      fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
>>        --buffered=1 --ioengine=io_uring --iodepth=128 \
>>        --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
>>        --rw=randread --random_distribution=random --norandommap \
>>        --time_based --ramp_time=10m --runtime=5m --group_reporting
>>
>>      cat memcached.sh
>>      modprobe brd rd_nr=1 rd_size=113246208
>>      swapoff -a
>>      mkswap /dev/ram0
>>      swapon /dev/ram0
>>
>>      memtier_benchmark -S /var/run/memcached/memcached.sock \
>>        -P memcache_binary -n allkeys --key-minimum=1 \
>>        --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
>>        --ratio 1:0 --pipeline 8 -d 2000
>>
>>      memtier_benchmark -S /var/run/memcached/memcached.sock \
>>        -P memcache_binary -n allkeys --key-minimum=1 \
>>        --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
>>        --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
>>
>> Client benchmark results:
>>    kswapd profiles:
>>      5.18-rc1
>>        40.53%  page_vma_mapped_walk
>>        20.37%  lzo1x_1_do_compress (real work)
>>         6.99%  do_raw_spin_lock
>>         3.93%  _raw_spin_unlock_irq
>>         2.08%  vma_interval_tree_subtree_search
>>         2.06%  vma_interval_tree_iter_next
>>         1.95%  folio_referenced_one
>>         1.93%  anon_vma_interval_tree_iter_first
>>         1.51%  ptep_clear_flush
>>         1.35%  __anon_vma_interval_tree_subtree_search
>>
>>      patch1-6
>>        35.99%  lzo1x_1_do_compress (real work)
>>        19.40%  page_vma_mapped_walk
>>         6.31%  _raw_spin_unlock_irq
>>         3.95%  do_raw_spin_lock
>>         2.39%  anon_vma_interval_tree_iter_first
>>         2.25%  ptep_clear_flush
>>         1.92%  __anon_vma_interval_tree_subtree_search
>>         1.70%  folio_referenced_one
>>         1.68%  __zram_bvec_write
>>         1.43%  anon_vma_interval_tree_iter_next
>>
>>    Configurations:
>>      CPU: single Snapdragon 7c
>>      Mem: total 4G
>>
>>      Chrome OS MemoryPressure [1]
>>
>> [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
>>
>> Signed-off-by: Yu Zhao <yuzhao at google.com>
>> Acked-by: Brian Geffon <bgeffon at google.com>
>> Acked-by: Jan Alexander Steffens (heftig) <heftig at archlinux.org>
>> Acked-by: Oleksandr Natalenko <oleksandr at natalenko.name>
>> Acked-by: Steven Barrett <steven at liquorix.net>
>> Acked-by: Suleiman Souhlal <suleiman at google.com>
>> Tested-by: Daniel Byrne <djbyrne at mtu.edu>
>> Tested-by: Donald Carr <d at chaos-reins.com>
>> Tested-by: Holger Hoffstätte <holger at applied-asynchrony.com>
>> Tested-by: Konstantin Kharlamov <Hi-Angel at yandex.ru>
>> Tested-by: Shuang Zhai <szhai2 at cs.rochester.edu>
>> Tested-by: Sofia Trinh <sofia.trinh at edi.works>
>> Tested-by: Vaibhav Jain <vaibhav at linux.ibm.com>
>> ---
>>   include/linux/mm_inline.h         |  36 ++
>>   include/linux/mmzone.h            |  42 ++
>>   include/linux/page-flags-layout.h |   5 +-
>>   kernel/bounds.c                   |   2 +
>>   mm/Kconfig                        |  11 +
>>   mm/swap.c                         |  39 ++
>>   mm/vmscan.c                       | 799 +++++++++++++++++++++++++++++-
>>   mm/workingset.c                   | 110 +++-
>>   8 files changed, 1034 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
>> index 98ae22bfaf12..85fe78832436 100644
>> --- a/include/linux/mm_inline.h
>> +++ b/include/linux/mm_inline.h
>> @@ -119,6 +119,33 @@ static inline int lru_gen_from_seq(unsigned long 
>> seq)
>>       return seq % MAX_NR_GENS;
>>   }
>>   +static inline int lru_hist_from_seq(unsigned long seq)
>> +{
>> +    return seq % NR_HIST_GENS;
>> +}
>> +
>> +static inline int lru_tier_from_refs(int refs)
>> +{
>> +    VM_WARN_ON_ONCE(refs > BIT(LRU_REFS_WIDTH));
>> +
>> +    /* see the comment in folio_lru_refs() */
>> +    return order_base_2(refs + 1);
>> +}
>> +
>> +static inline int folio_lru_refs(struct folio *folio)
>> +{
>> +    unsigned long flags = READ_ONCE(folio->flags);
>> +    bool workingset = flags & BIT(PG_workingset);
>> +
>> +    /*
>> +     * Return the number of accesses beyond PG_referenced, i.e., N-1 
>> if the
>> +     * total number of accesses is N>1, since N=0,1 both map to the 
>> first
>> +     * tier. lru_tier_from_refs() will account for this off-by-one. 
>> Also see
>> +     * the comment on MAX_NR_TIERS.
>> +     */
>> +    return ((flags & LRU_REFS_MASK) >> LRU_REFS_PGOFF) + workingset;
>> +}
>> +
>>   static inline int folio_lru_gen(struct folio *folio)
>>   {
>>       unsigned long flags = READ_ONCE(folio->flags);
>> @@ -171,6 +198,15 @@ static inline void lru_gen_update_size(struct 
>> lruvec *lruvec, struct folio *foli
>>           __update_lru_size(lruvec, lru, zone, -delta);
>>           return;
>>       }
>> +
>> +    /* promotion */
>> +    if (!lru_gen_is_active(lruvec, old_gen) && 
>> lru_gen_is_active(lruvec, new_gen)) {
>> +        __update_lru_size(lruvec, lru, zone, -delta);
>> +        __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, delta);
>> +    }
>> +
>> +    /* demotion requires isolation, e.g., lru_deactivate_fn() */
>> +    VM_WARN_ON_ONCE(lru_gen_is_active(lruvec, old_gen) && 
>> !lru_gen_is_active(lruvec, new_gen));
>>   }
>>     static inline bool lru_gen_add_folio(struct lruvec *lruvec, 
>> struct folio *folio, bool reclaiming)
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 6994acef63cb..2d023d243e73 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -348,6 +348,29 @@ enum lruvec_flags {
>>   #define MIN_NR_GENS        2U
>>   #define MAX_NR_GENS        4U
>>   +/*
>> + * Each generation is divided into multiple tiers. Tiers represent 
>> different
>> + * ranges of numbers of accesses through file descriptors. A page 
>> accessed N
>> + * times through file descriptors is in tier order_base_2(N). A page 
>> in the
>> + * first tier (N=0,1) is marked by PG_referenced unless it was 
>> faulted in
>> + * though page tables or read ahead. A page in any other tier (N>1) 
>> is marked
>> + * by PG_referenced and PG_workingset. This implies a minimum of two 
>> tiers is
>> + * supported without using additional bits in folio->flags.
>> + *
>> + * In contrast to moving across generations which requires the LRU 
>> lock, moving
>> + * across tiers only involves atomic operations on folio->flags and 
>> therefore
>> + * has a negligible cost in the buffered access path. In the 
>> eviction path,
>> + * comparisons of refaulted/(evicted+protected) from the first tier 
>> and the
>> + * rest infer whether pages accessed multiple times through file 
>> descriptors
>> + * are statistically hot and thus worth protecting.
>> + *
>> + * MAX_NR_TIERS is set to 4 so that the multi-gen LRU can support 
>> twice the
>> + * number of categories of the active/inactive LRU when keeping 
>> track of
>> + * accesses through file descriptors. It uses MAX_NR_TIERS-2 spare 
>> bits in
>> + * folio->flags (LRU_REFS_MASK).
>> + */
>> +#define MAX_NR_TIERS        4U
>> +
>>   #ifndef __GENERATING_BOUNDS_H
>>     struct lruvec;
>> @@ -362,6 +385,16 @@ enum {
>>       LRU_GEN_FILE,
>>   };
>>   +#define MIN_LRU_BATCH        BITS_PER_LONG
>> +#define MAX_LRU_BATCH        (MIN_LRU_BATCH * 128)
>> +
>> +/* whether to keep historical stats from evicted generations */
>> +#ifdef CONFIG_LRU_GEN_STATS
>> +#define NR_HIST_GENS        MAX_NR_GENS
>> +#else
>> +#define NR_HIST_GENS        1U
>> +#endif
>> +
>>   /*
>>    * The youngest generation number is stored in max_seq for both 
>> anon and file
>>    * types as they are aged on an equal footing. The oldest 
>> generation numbers are
>> @@ -384,6 +417,15 @@ struct lru_gen_struct {
>>       struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
>>       /* the sizes of the above lists */
>>       long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
>> +    /* the exponential moving average of refaulted */
>> +    unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
>> +    /* the exponential moving average of evicted+protected */
>> +    unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
>> +    /* the first tier doesn't need protection, hence the minus one */
>> +    unsigned long 
>> protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
>> +    /* can be modified without holding the LRU lock */
>> +    atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
>> +    atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
>>   };
>>     void lru_gen_init_lruvec(struct lruvec *lruvec);
>> diff --git a/include/linux/page-flags-layout.h 
>> b/include/linux/page-flags-layout.h
>> index 240905407a18..7d79818dc065 100644
>> --- a/include/linux/page-flags-layout.h
>> +++ b/include/linux/page-flags-layout.h
>> @@ -106,7 +106,10 @@
>>   #error "Not enough bits in page flags"
>>   #endif
>>   -#define LRU_REFS_WIDTH    0
>> +/* see the comment on MAX_NR_TIERS */
>> +#define LRU_REFS_WIDTH    min(__LRU_REFS_WIDTH, BITS_PER_LONG - 
>> NR_PAGEFLAGS - \
>> +                ZONES_WIDTH - LRU_GEN_WIDTH - SECTIONS_WIDTH - \
>> +                NODES_WIDTH - KASAN_TAG_WIDTH - LAST_CPUPID_WIDTH)
>>     #endif
>>   #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
>> diff --git a/kernel/bounds.c b/kernel/bounds.c
>> index 5ee60777d8e4..b529182e8b04 100644
>> --- a/kernel/bounds.c
>> +++ b/kernel/bounds.c
>> @@ -24,8 +24,10 @@ int main(void)
>>       DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
>>   #ifdef CONFIG_LRU_GEN
>>       DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
>> +    DEFINE(__LRU_REFS_WIDTH, MAX_NR_TIERS - 2);
>>   #else
>>       DEFINE(LRU_GEN_WIDTH, 0);
>> +    DEFINE(__LRU_REFS_WIDTH, 0);
>>   #endif
>>       /* End of constants */
>>   diff --git a/mm/Kconfig b/mm/Kconfig
>> index e62bd501082b..0aeacbd3361c 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -909,6 +909,7 @@ config ANON_VMA_NAME
>>         area from being merged with adjacent virtual memory areas due 
>> to the
>>         difference in their name.
>>   +# multi-gen LRU {
>>   config LRU_GEN
>>       bool "Multi-Gen LRU"
>>       depends on MMU
>> @@ -917,6 +918,16 @@ config LRU_GEN
>>       help
>>         A high performance LRU implementation to overcommit memory.
>>   +config LRU_GEN_STATS
>> +    bool "Full stats for debugging"
>> +    depends on LRU_GEN
>> +    help
>> +      Do not enable this option unless you plan to look at 
>> historical stats
>> +      from evicted generations for debugging purpose.
>> +
>> +      This option has a per-memcg and per-node memory overhead.
>> +# }
>> +
>>   source "mm/damon/Kconfig"
>>     endmenu
>> diff --git a/mm/swap.c b/mm/swap.c
>> index a6870ba0bd83..a99d22308f28 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -405,6 +405,40 @@ static void __lru_cache_activate_folio(struct 
>> folio *folio)
>>       local_unlock(&lru_pvecs.lock);
>>   }
>>   +#ifdef CONFIG_LRU_GEN
>> +static void folio_inc_refs(struct folio *folio)
>> +{
>> +    unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
>> +
>> +    if (folio_test_unevictable(folio))
>> +        return;
>> +
>> +    if (!folio_test_referenced(folio)) {
>> +        folio_set_referenced(folio);
>> +        return;
>> +    }
>> +
>> +    if (!folio_test_workingset(folio)) {
>> +        folio_set_workingset(folio);
>> +        return;
>> +    }
>> +
>> +    /* see the comment on MAX_NR_TIERS */
>> +    do {
>> +        new_flags = old_flags & LRU_REFS_MASK;
>> +        if (new_flags == LRU_REFS_MASK)
>> +            break;
>> +
>> +        new_flags += BIT(LRU_REFS_PGOFF);
>> +        new_flags |= old_flags & ~LRU_REFS_MASK;
>> +    } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
>> +}
>> +#else
>> +static void folio_inc_refs(struct folio *folio)
>> +{
>> +}
>> +#endif /* CONFIG_LRU_GEN */
>> +
>>   /*
>>    * Mark a page as having seen activity.
>>    *
>> @@ -417,6 +451,11 @@ static void __lru_cache_activate_folio(struct 
>> folio *folio)
>>    */
>>   void folio_mark_accessed(struct folio *folio)
>>   {
>> +    if (lru_gen_enabled()) {
>> +        folio_inc_refs(folio);
>> +        return;
>> +    }
>> +
>>       if (!folio_test_referenced(folio)) {
>>           folio_set_referenced(folio);
>>       } else if (folio_test_unevictable(folio)) {
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index b41ff9765cc7..891f0ab69b3a 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1275,9 +1275,11 @@ static int __remove_mapping(struct 
>> address_space *mapping, struct folio *folio,
>>         if (folio_test_swapcache(folio)) {
>>           swp_entry_t swap = folio_swap_entry(folio);
>> -        mem_cgroup_swapout(folio, swap);
>> +
>> +        /* get a shadow entry before mem_cgroup_swapout() clears 
>> folio_memcg() */
>>           if (reclaimed && !mapping_exiting(mapping))
>>               shadow = workingset_eviction(folio, target_memcg);
>> +        mem_cgroup_swapout(folio, swap);
>>           __delete_from_swap_cache(&folio->page, swap, shadow);
>>           xa_unlock_irq(&mapping->i_pages);
>>           put_swap_page(&folio->page, swap);
>> @@ -2649,6 +2651,9 @@ static void prepare_scan_count(pg_data_t 
>> *pgdat, struct scan_control *sc)
>>       unsigned long file;
>>       struct lruvec *target_lruvec;
>>   +    if (lru_gen_enabled())
>> +        return;
>> +
>>       target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
>>         /*
>> @@ -2974,6 +2979,17 @@ static bool can_age_anon_pages(struct 
>> pglist_data *pgdat,
>>    *                          shorthand helpers
>> ******************************************************************************/
>>   +#define LRU_REFS_FLAGS    (BIT(PG_referenced) | BIT(PG_workingset))
>> +
>> +#define DEFINE_MAX_SEQ(lruvec)                        \
>> +    unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq)
>> +
>> +#define DEFINE_MIN_SEQ(lruvec)                        \
>> +    unsigned long min_seq[ANON_AND_FILE] = {            \
>> + READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_ANON]),    \
>> + READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]),    \
>> +    }
>> +
>>   #define for_each_gen_type_zone(gen, type, zone) \
>>       for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++)            \
>>           for ((type) = 0; (type) < ANON_AND_FILE; (type)++)    \
>> @@ -2999,6 +3015,753 @@ static struct lruvec __maybe_unused 
>> *get_lruvec(struct mem_cgroup *memcg, int ni
>>       return pgdat ? &pgdat->__lruvec : NULL;
>>   }
>>   +static int get_swappiness(struct lruvec *lruvec, struct 
>> scan_control *sc)
>> +{
>> +    struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>> +    struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> +
>> +    if (!can_demote(pgdat->node_id, sc) &&
>> +        mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
>> +        return 0;
>> +
>> +    return mem_cgroup_swappiness(memcg);
>> +}
>> +
>> +static int get_nr_gens(struct lruvec *lruvec, int type)
>> +{
>> +    return lruvec->lrugen.max_seq - lruvec->lrugen.min_seq[type] + 1;
>> +}
>> +
>> +static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
>> +{
>> +    /* see the comment on lru_gen_struct */
>> +    return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS &&
>> +           get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, 
>> LRU_GEN_ANON) &&
>> +           get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
>> +}
>> +
>> +/****************************************************************************** 
>>
>> + *                          refault feedback loop
>> + 
>> ******************************************************************************/
>> +
>> +/*
>> + * A feedback loop based on Proportional-Integral-Derivative (PID) 
>> controller.
>> + *
>> + * The P term is refaulted/(evicted+protected) from a tier in the 
>> generation
>> + * currently being evicted; the I term is the exponential moving 
>> average of the
>> + * P term over the generations previously evicted, using the 
>> smoothing factor
>> + * 1/2; the D term isn't supported.
>> + *
>> + * The setpoint (SP) is always the first tier of one type; the 
>> process variable
>> + * (PV) is either any tier of the other type or any other tier of 
>> the same
>> + * type.
>> + *
>> + * The error is the difference between the SP and the PV; the 
>> correction is
>> + * turn off protection when SP>PV or turn on protection when SP<PV.
>> + *
>> + * For future optimizations:
>> + * 1. The D term may discount the other two terms over time so that 
>> long-lived
>> + *    generations can resist stale information.
>> + */
>> +struct ctrl_pos {
>> +    unsigned long refaulted;
>> +    unsigned long total;
>> +    int gain;
>> +};
>> +
>> +static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, 
>> int gain,
>> +              struct ctrl_pos *pos)
>> +{
>> +    struct lru_gen_struct *lrugen = &lruvec->lrugen;
>> +    int hist = lru_hist_from_seq(lrugen->min_seq[type]);
>> +
>> +    pos->refaulted = lrugen->avg_refaulted[type][tier] +
>> + atomic_long_read(&lrugen->refaulted[hist][type][tier]);
>> +    pos->total = lrugen->avg_total[type][tier] +
>> + atomic_long_read(&lrugen->evicted[hist][type][tier]);
>> +    if (tier)
>> +        pos->total += lrugen->protected[hist][type][tier - 1];
>> +    pos->gain = gain;
>> +}
>> +
>> +static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool 
>> carryover)
>> +{
>> +    int hist, tier;
>> +    struct lru_gen_struct *lrugen = &lruvec->lrugen;
>> +    bool clear = carryover ? NR_HIST_GENS == 1 : NR_HIST_GENS > 1;
>> +    unsigned long seq = carryover ? lrugen->min_seq[type] : 
>> lrugen->max_seq + 1;
>> +
>> +    lockdep_assert_held(&lruvec->lru_lock);
>> +
>> +    if (!carryover && !clear)
>> +        return;
>> +
>> +    hist = lru_hist_from_seq(seq);
>> +
>> +    for (tier = 0; tier < MAX_NR_TIERS; tier++) {
>> +        if (carryover) {
>> +            unsigned long sum;
>> +
>> +            sum = lrugen->avg_refaulted[type][tier] +
>> + atomic_long_read(&lrugen->refaulted[hist][type][tier]);
>> +            WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
>> +
>> +            sum = lrugen->avg_total[type][tier] +
>> + atomic_long_read(&lrugen->evicted[hist][type][tier]);
>> +            if (tier)
>> +                sum += lrugen->protected[hist][type][tier - 1];
>> +            WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
>> +        }
>> +
>> +        if (clear) {
>> + atomic_long_set(&lrugen->refaulted[hist][type][tier], 0);
>> + atomic_long_set(&lrugen->evicted[hist][type][tier], 0);
>> +            if (tier)
>> + WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0);
>> +        }
>> +    }
>> +}
>> +
>> +static bool positive_ctrl_err(struct ctrl_pos *sp, struct ctrl_pos *pv)
>> +{
>> +    /*
>> +     * Return true if the PV has a limited number of refaults or a 
>> lower
>> +     * refaulted/total than the SP.
>> +     */
>> +    return pv->refaulted < MIN_LRU_BATCH ||
>> +           pv->refaulted * (sp->total + MIN_LRU_BATCH) * sp->gain <=
>> +           (sp->refaulted + 1) * pv->total * pv->gain;
>> +}
>> +
>> +/****************************************************************************** 
>>
>> + *                          the aging
>> + 
>> ******************************************************************************/
>> +
>> +static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, 
>> bool reclaiming)
>> +{
>> +    int type = folio_is_file_lru(folio);
>> +    struct lru_gen_struct *lrugen = &lruvec->lrugen;
>> +    int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
>> +    unsigned long new_flags, old_flags = READ_ONCE(folio->flags);
>> +
>> +    VM_WARN_ON_ONCE_FOLIO(!(old_flags & LRU_GEN_MASK), folio);
>> +
>> +    do {
>> +        new_gen = (old_gen + 1) % MAX_NR_GENS;
>> +
>> +        new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_MASK | 
>> LRU_REFS_FLAGS);
>> +        new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
>> +        /* for folio_end_writeback() */
>> +        if (reclaiming)
>> +            new_flags |= BIT(PG_reclaim);
>> +    } while (!try_cmpxchg(&folio->flags, &old_flags, new_flags));
>> +
>> +    lru_gen_update_size(lruvec, folio, old_gen, new_gen);
>> +
>> +    return new_gen;
>> +}
>> +
>> +static void inc_min_seq(struct lruvec *lruvec, int type)
>> +{
>> +    struct lru_gen_struct *lrugen = &lruvec->lrugen;
>> +
>> +    reset_ctrl_pos(lruvec, type, true);
>> +    WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1);
>> +}
>> +
>> +static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
>> +{
>> +    int gen, type, zone;
>> +    bool success = false;
>> +    struct lru_gen_struct *lrugen = &lruvec->lrugen;
>> +    DEFINE_MIN_SEQ(lruvec);
>> +
>> +    VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
>> +
>> +    /* find the oldest populated generation */
>> +    for (type = !can_swap; type < ANON_AND_FILE; type++) {
>> +        while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) {
>> +            gen = lru_gen_from_seq(min_seq[type]);
>> +
>> +            for (zone = 0; zone < MAX_NR_ZONES; zone++) {
>> +                if (!list_empty(&lrugen->lists[gen][type][zone]))
>> +                    goto next;
>> +            }
>> +
>> +            min_seq[type]++;
>> +        }
>> +next:
>> +        ;
>> +    }
>> +
>> +    /* see the comment on lru_gen_struct */
>> +    if (can_swap) {
>> +        min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], 
>> min_seq[LRU_GEN_FILE]);
>> +        min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], 
>> lrugen->min_seq[LRU_GEN_FILE]);
>> +    }
>> +
>> +    for (type = !can_swap; type < ANON_AND_FILE; type++) {
>> +        if (min_seq[type] == lrugen->min_seq[type])
>> +            continue;
>> +
>> +        reset_ctrl_pos(lruvec, type, true);
>> +        WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
>> +        success = true;
>> +    }
>> +
>> +    return success;
>> +}
>> +
>> +static void inc_max_seq(struct lruvec *lruvec, unsigned long 
>> max_seq, bool can_swap)
>> +{
>> +    int prev, next;
>> +    int type, zone;
>> +    struct lru_gen_struct *lrugen = &lruvec->lrugen;
>> +
>> +    spin_lock_irq(&lruvec->lru_lock);
>> +
>> +    VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
>> +
>> +    if (max_seq != lrugen->max_seq)
>> +        goto unlock;
>> +
>> +    for (type = 0; type < ANON_AND_FILE; type++) {
>> +        if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
>> +            continue;
>> +
>> +        VM_WARN_ON_ONCE(type == LRU_GEN_FILE || can_swap);
>> +
>> +        inc_min_seq(lruvec, type);
>> +    }
>> +
>> +    /*
>> +     * Update the active/inactive LRU sizes for compatibility. Both 
>> sides of
>> +     * the current max_seq need to be covered, since max_seq+1 can 
>> overlap
>> +     * with min_seq[LRU_GEN_ANON] if swapping is constrained. And if 
>> they do
>> +     * overlap, cold/hot inversion happens.
>> +     */
>> +    prev = lru_gen_from_seq(lrugen->max_seq - 1);
>> +    next = lru_gen_from_seq(lrugen->max_seq + 1);
>> +
>> +    for (type = 0; type < ANON_AND_FILE; type++) {
>> +        for (zone = 0; zone < MAX_NR_ZONES; zone++) {
>> +            enum lru_list lru = type * LRU_INACTIVE_FILE;
>> +            long delta = lrugen->nr_pages[prev][type][zone] -
>> +                     lrugen->nr_pages[next][type][zone];
>> +
>> +            if (!delta)
>> +                continue;
>> +
>> +            __update_lru_size(lruvec, lru, zone, delta);
>> +            __update_lru_size(lruvec, lru + LRU_ACTIVE, zone, -delta);
>> +        }
>> +    }
>> +
>> +    for (type = 0; type < ANON_AND_FILE; type++)
>> +        reset_ctrl_pos(lruvec, type, false);
>> +
>> +    /* make sure preceding modifications appear */
>> +    smp_store_release(&lrugen->max_seq, lrugen->max_seq + 1);
>> +unlock:
>> +    spin_unlock_irq(&lruvec->lru_lock);
>> +}
>> +
>> +static long get_nr_evictable(struct lruvec *lruvec, unsigned long 
>> max_seq,
>> +                 unsigned long *min_seq, bool can_swap, bool 
>> *need_aging)
>> +{
>> +    int gen, type, zone;
>> +    long old = 0;
>> +    long young = 0;
>> +    long total = 0;
>> +    struct lru_gen_struct *lrugen = &lruvec->lrugen;
>> +
>> +    for (type = !can_swap; type < ANON_AND_FILE; type++) {
>> +        unsigned long seq;
>> +
>> +        for (seq = min_seq[type]; seq <= max_seq; seq++) {
>> +            long size = 0;
>> +
>> +            gen = lru_gen_from_seq(seq);
>> +
>> +            for (zone = 0; zone < MAX_NR_ZONES; zone++)
>> +                size += READ_ONCE(lrugen->nr_pages[gen][type][zone]);
>> +
>> +            total += size;
>> +            if (seq == max_seq)
>> +                young += size;
>> +            if (seq + MIN_NR_GENS == max_seq)
>> +                old += size;
>> +        }
>> +    }
>> +
>> +    /*
>> +     * The aging tries to be lazy to reduce the overhead. On the 
>> other hand,
>> +     * the eviction stalls when the number of generations reaches
>> +     * MIN_NR_GENS. So ideally, there should be MIN_NR_GENS+1 
>> generations,
>> +     * hence the first two if's.
>> +     *
>> +     * Also it's ideal to spread pages out evenly, meaning 
>> 1/(MIN_NR_GENS+1)
>> +     * of the total number of pages for each generation. A 
>> reasonable range
>> +     * for this average portion is [1/MIN_NR_GENS, 
>> 1/(MIN_NR_GENS+2)]. The
>> +     * eviction cares about the lower bound of cold pages, whereas 
>> the aging
>> +     * cares about the upper bound of hot pages.
>> +     */
>> +    if (min_seq[!can_swap] + MIN_NR_GENS > max_seq)
>> +        *need_aging = true;
>> +    else if (min_seq[!can_swap] + MIN_NR_GENS < max_seq)
>> +        *need_aging = false;
>> +    else if (young * MIN_NR_GENS > total)
>> +        *need_aging = true;
>> +    else if (old * (MIN_NR_GENS + 2) < total)
>> +        *need_aging = true;
>> +    else
>> +        *need_aging = false;
>> +
>> +    return total > 0 ? total : 0;
>> +}
>> +
>> +static void age_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>> +{
>> +    bool need_aging;
>> +    long nr_to_scan;
>> +    int swappiness = get_swappiness(lruvec, sc);
>> +    struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>> +    DEFINE_MAX_SEQ(lruvec);
>> +    DEFINE_MIN_SEQ(lruvec);
>> +
>> +    VM_WARN_ON_ONCE(sc->memcg_low_reclaim);
>> +
>> +    mem_cgroup_calculate_protection(NULL, memcg);
>> +
>> +    if (mem_cgroup_below_min(memcg))
>> +        return;
>> +
>> +    nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, 
>> swappiness, &need_aging);
>> +    if (!nr_to_scan)
>> +        return;
>> +
>> +    nr_to_scan >>= sc->priority;
>> +
>> +    if (!mem_cgroup_online(memcg))
>> +        nr_to_scan++;
>> +
>> +    if (nr_to_scan && need_aging)
>> +        inc_max_seq(lruvec, max_seq, swappiness);
>> +}
>> +
>> +static void lru_gen_age_node(struct pglist_data *pgdat, struct 
>> scan_control *sc)
>> +{
>> +    struct mem_cgroup *memcg;
>> +
>> +    VM_WARN_ON_ONCE(!current_is_kswapd());
>> +
>> +    memcg = mem_cgroup_iter(NULL, NULL, NULL);
>> +    do {
>> +        struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>> +
>> +        age_lruvec(lruvec, sc);
>> +
>> +        cond_resched();
>> +    } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)));
>> +}
>> +
>> +/****************************************************************************** 
>>
>> + *                          the eviction
>> + 
>> ******************************************************************************/
>> +
>> +static bool sort_folio(struct lruvec *lruvec, struct folio *folio, 
>> int tier_idx)
>> +{
>> +    bool success;
>> +    int gen = folio_lru_gen(folio);
>> +    int type = folio_is_file_lru(folio);
>> +    int zone = folio_zonenum(folio);
>> +    int delta = folio_nr_pages(folio);
>> +    int refs = folio_lru_refs(folio);
>> +    int tier = lru_tier_from_refs(refs);
>> +    struct lru_gen_struct *lrugen = &lruvec->lrugen;
>> +
>> +    VM_WARN_ON_ONCE_FOLIO(gen >= MAX_NR_GENS, folio);
>> +
>> +    /* unevictable */
>> +    if (!folio_evictable(folio)) {
>> +        success = lru_gen_del_folio(lruvec, folio, true);
>> +        VM_WARN_ON_ONCE_FOLIO(!success, folio);
>> +        folio_set_unevictable(folio);
>> +        lruvec_add_folio(lruvec, folio);
>> +        __count_vm_events(UNEVICTABLE_PGCULLED, delta);
>> +        return true;
>> +    }
>> +
>> +    /* dirtied lazyfree */
>> +    if (type == LRU_GEN_FILE && folio_test_anon(folio) && 
>> folio_test_dirty(folio)) {
>> +        success = lru_gen_del_folio(lruvec, folio, true);
>> +        VM_WARN_ON_ONCE_FOLIO(!success, folio);
>> +        folio_set_swapbacked(folio);
>> +        lruvec_add_folio_tail(lruvec, folio);
>> +        return true;
>> +    }
>> +
>> +    /* protected */
>> +    if (tier > tier_idx) {
>> +        int hist = lru_hist_from_seq(lrugen->min_seq[type]);
>> +
>> +        gen = folio_inc_gen(lruvec, folio, false);
>> +        list_move_tail(&folio->lru, &lrugen->lists[gen][type][zone]);
>> +
>> +        WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
>> +               lrugen->protected[hist][type][tier - 1] + delta);
>> +        __mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, 
>> delta);
>> +        return true;
>> +    }
>> +
>> +    /* waiting for writeback */
>> +    if (folio_test_locked(folio) || folio_test_writeback(folio) ||
>> +        (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
>> +        gen = folio_inc_gen(lruvec, folio, true);
>> +        list_move(&folio->lru, &lrugen->lists[gen][type][zone]);
>> +        return true;
>> +    }
>> +
>> +    return false;
>> +}
>> +
>> +static bool isolate_folio(struct lruvec *lruvec, struct folio 
>> *folio, struct scan_control *sc)
>> +{
>> +    bool success;
>> +
>> +    if (!sc->may_unmap && folio_mapped(folio))
>> +        return false;
>> +
>> +    if (!(sc->may_writepage && (sc->gfp_mask & __GFP_IO)) &&
>> +        (folio_test_dirty(folio) ||
>> +         (folio_test_anon(folio) && !folio_test_swapcache(folio))))
>> +        return false;
>> +
>> +    if (!folio_try_get(folio))
>> +        return false;
>> +
>> +    if (!folio_test_clear_lru(folio)) {
>> +        folio_put(folio);
>> +        return false;
>> +    }
>> +
>> +    success = lru_gen_del_folio(lruvec, folio, true);
>> +    VM_WARN_ON_ONCE_FOLIO(!success, folio);
>> +
>> +    return true;
>> +}
>> +
>> +static int scan_folios(struct lruvec *lruvec, struct scan_control *sc,
>> +               int type, int tier, struct list_head *list)
>> +{
>> +    int gen, zone;
>> +    enum vm_event_item item;
>> +    int sorted = 0;
>> +    int scanned = 0;
>> +    int isolated = 0;
>> +    int remaining = MAX_LRU_BATCH;
>> +    struct lru_gen_struct *lrugen = &lruvec->lrugen;
>> +    struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>> +
>> +    VM_WARN_ON_ONCE(!list_empty(list));
>> +
>> +    if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
>> +        return 0;
>> +
>> +    gen = lru_gen_from_seq(lrugen->min_seq[type]);
>> +
>> +    for (zone = sc->reclaim_idx; zone >= 0; zone--) {
>> +        LIST_HEAD(moved);
>> +        int skipped = 0;
>> +        struct list_head *head = &lrugen->lists[gen][type][zone];
>> +
>> +        while (!list_empty(head)) {
>> +            struct folio *folio = lru_to_folio(head);
>> +            int delta = folio_nr_pages(folio);
>> +
>> + VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
>> +            VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
>> +            VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, 
>> folio);
>> +            VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
>> +
>> +            scanned += delta;
>> +
>> +            if (sort_folio(lruvec, folio, tier))
>> +                sorted += delta;
>> +            else if (isolate_folio(lruvec, folio, sc)) {
>> +                list_add(&folio->lru, list);
>> +                isolated += delta;
>> +            } else {
>> +                list_move(&folio->lru, &moved);
>> +                skipped += delta;
>> +            }
>> +
>> +            if (!--remaining || max(isolated, skipped) >= 
>> MIN_LRU_BATCH)
>> +                break;
>> +        }
>> +
>> +        if (skipped) {
>> +            list_splice(&moved, head);
>> +            __count_zid_vm_events(PGSCAN_SKIP, zone, skipped);
>> +        }
>> +
>> +        if (!remaining || isolated >= MIN_LRU_BATCH)
>> +            break;
>> +    }
>> +
>> +    item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
>> +    if (!cgroup_reclaim(sc)) {
>> +        __count_vm_events(item, isolated);
>> +        __count_vm_events(PGREFILL, sorted);
>> +    }
>> +    __count_memcg_events(memcg, item, isolated);
>> +    __count_memcg_events(memcg, PGREFILL, sorted);
>> +    __count_vm_events(PGSCAN_ANON + type, isolated);
>> +
>> +    /*
>> +     * There might not be eligible pages due to reclaim_idx, 
>> may_unmap and
>> +     * may_writepage. Check the remaining to prevent livelock if 
>> there is no
>> +     * progress.
>> +     */
>> +    return isolated || !remaining ? scanned : 0;
>> +}
>> +
>> +static int get_tier_idx(struct lruvec *lruvec, int type)
>> +{
>> +    int tier;
>> +    struct ctrl_pos sp, pv;
>> +
>> +    /*
>> +     * To leave a margin for fluctuations, use a larger gain factor 
>> (1:2).
>> +     * This value is chosen because any other tier would have at 
>> least twice
>> +     * as many refaults as the first tier.
>> +     */
>> +    read_ctrl_pos(lruvec, type, 0, 1, &sp);
>> +    for (tier = 1; tier < MAX_NR_TIERS; tier++) {
>> +        read_ctrl_pos(lruvec, type, tier, 2, &pv);
>> +        if (!positive_ctrl_err(&sp, &pv))
>> +            break;
>> +    }
>> +
>> +    return tier - 1;
>> +}
>> +
>> +static int get_type_to_scan(struct lruvec *lruvec, int swappiness, 
>> int *tier_idx)
>> +{
>> +    int type, tier;
>> +    struct ctrl_pos sp, pv;
>> +    int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness };
>> +
>> +    /*
>> +     * Compare the first tier of anon with that of file to determine 
>> which
>> +     * type to scan. Also need to compare other tiers of the 
>> selected type
>> +     * with the first tier of the other type to determine the last 
>> tier (of
>> +     * the selected type) to evict.
>> +     */
>> +    read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp);
>> +    read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv);
>> +    type = positive_ctrl_err(&sp, &pv);
>> +
>> +    read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp);
>> +    for (tier = 1; tier < MAX_NR_TIERS; tier++) {
>> +        read_ctrl_pos(lruvec, type, tier, gain[type], &pv);
>> +        if (!positive_ctrl_err(&sp, &pv))
>> +            break;
>> +    }
>> +
>> +    *tier_idx = tier - 1;
>> +
>> +    return type;
>> +}
>> +
>> +static int isolate_folios(struct lruvec *lruvec, struct scan_control 
>> *sc, int swappiness,
>> +              int *type_scanned, struct list_head *list)
>> +{
>> +    int i;
>> +    int type;
>> +    int scanned;
>> +    int tier = -1;
>> +    DEFINE_MIN_SEQ(lruvec);
>> +
>> +    /*
>> +     * Try to make the obvious choice first. When anon and file are 
>> both
>> +     * available from the same generation, interpret swappiness 1 as 
>> file
>> +     * first and 200 as anon first.
>> +     */
>> +    if (!swappiness)
>> +        type = LRU_GEN_FILE;
>> +    else if (min_seq[LRU_GEN_ANON] < min_seq[LRU_GEN_FILE])
>> +        type = LRU_GEN_ANON;
>> +    else if (swappiness == 1)
>> +        type = LRU_GEN_FILE;
>> +    else if (swappiness == 200)
>> +        type = LRU_GEN_ANON;
>> +    else
>> +        type = get_type_to_scan(lruvec, swappiness, &tier);
>> +
>> +    for (i = !swappiness; i < ANON_AND_FILE; i++) {
>> +        if (tier < 0)
>> +            tier = get_tier_idx(lruvec, type);
>> +
>> +        scanned = scan_folios(lruvec, sc, type, tier, list);
>> +        if (scanned)
>> +            break;
>> +
>> +        type = !type;
>> +        tier = -1;
>> +    }
>> +
>> +    *type_scanned = type;
>> +
>> +    return scanned;
>> +}
>> +
>> +static int evict_folios(struct lruvec *lruvec, struct scan_control 
>> *sc, int swappiness)
>> +{
>> +    int type;
>> +    int scanned;
>> +    int reclaimed;
>> +    LIST_HEAD(list);
>> +    struct folio *folio;
>> +    enum vm_event_item item;
>> +    struct reclaim_stat stat;
>> +    struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>> +    struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> +
>> +    spin_lock_irq(&lruvec->lru_lock);
>> +
>> +    scanned = isolate_folios(lruvec, sc, swappiness, &type, &list);
>> +
>> +    if (try_to_inc_min_seq(lruvec, swappiness))
>> +        scanned++;
>> +
>> +    if (get_nr_gens(lruvec, !swappiness) == MIN_NR_GENS)
>> +        scanned = 0;
>> +
>> +    spin_unlock_irq(&lruvec->lru_lock);
>> +
>> +    if (list_empty(&list))
>> +        return scanned;
>> +
>> +    reclaimed = shrink_page_list(&list, pgdat, sc, &stat, false);
>> +
>> +    /*
>> +     * To avoid livelock, don't add rejected pages back to the same 
>> lists
>> +     * they were isolated from. See lru_gen_add_folio().
>> +     */
>> +    list_for_each_entry(folio, &list, lru) {
>> +        folio_clear_referenced(folio);
>> +        folio_clear_workingset(folio);
>> +
>> +        if (folio_test_reclaim(folio) &&
>> +            (folio_test_dirty(folio) || folio_test_writeback(folio)))
>> +            folio_clear_active(folio);
>> +        else
>> +            folio_set_active(folio);
>> +    }
>> +
>> +    spin_lock_irq(&lruvec->lru_lock);
>> +
>> +    move_pages_to_lru(lruvec, &list);
>> +
>> +    item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
>> +    if (!cgroup_reclaim(sc))
>> +        __count_vm_events(item, reclaimed);
>> +    __count_memcg_events(memcg, item, reclaimed);
>> +    __count_vm_events(PGSTEAL_ANON + type, reclaimed);
>> +
>> +    spin_unlock_irq(&lruvec->lru_lock);
>> +
>> +    mem_cgroup_uncharge_list(&list);
>> +    free_unref_page_list(&list);
>> +
>> +    sc->nr_reclaimed += reclaimed;
>> +
>> +    return scanned;
>> +}
>> +
>> +static long get_nr_to_scan(struct lruvec *lruvec, struct 
>> scan_control *sc, bool can_swap)
>> +{
>> +    bool need_aging;
>> +    long nr_to_scan;
>> +    struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>> +    DEFINE_MAX_SEQ(lruvec);
>> +    DEFINE_MIN_SEQ(lruvec);
>> +
>> +    if (mem_cgroup_below_min(memcg) ||
>> +        (mem_cgroup_below_low(memcg) && !sc->memcg_low_reclaim))
>> +        return 0;
>> +
>> +    nr_to_scan = get_nr_evictable(lruvec, max_seq, min_seq, 
>> can_swap, &need_aging);
>> +    if (!nr_to_scan)
>> +        return 0;
>> +
>> +    /* reset the priority if the target has been met */
>> +    nr_to_scan >>= sc->nr_reclaimed < sc->nr_to_reclaim ? 
>> sc->priority : DEF_PRIORITY;
>> +
>> +    if (!mem_cgroup_online(memcg))
>> +        nr_to_scan++;
>> +
>> +    if (!nr_to_scan)
>> +        return 0;
>> +
>> +    if (!need_aging)
>> +        return nr_to_scan;
>> +
>> +    /* leave the work to lru_gen_age_node() */
>> +    if (current_is_kswapd())
>> +        return 0;
>> +
>> +    /* try other memcgs before going to the aging path */
>> +    if (!cgroup_reclaim(sc) && !sc->force_deactivate) {
>> +        sc->skipped_deactivate = true;
>> +        return 0;
>> +    }
>> +
>> +    inc_max_seq(lruvec, max_seq, can_swap);
>> +
>> +    return nr_to_scan;
>> +}
>> +
>> +static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct 
>> scan_control *sc)
>> +{
>> +    struct blk_plug plug;
>> +    long scanned = 0;
>> +
>> +    lru_add_drain();
>> +
>> +    blk_start_plug(&plug);
>> +
>> +    while (true) {
>> +        int delta;
>> +        int swappiness;
>> +        long nr_to_scan;
>> +
>> +        if (sc->may_swap)
>> +            swappiness = get_swappiness(lruvec, sc);
>> +        else if (!cgroup_reclaim(sc) && get_swappiness(lruvec, sc))
>> +            swappiness = 1;
>> +        else
>> +            swappiness = 0;
>> +
>> +        nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
>> +        if (!nr_to_scan)
>> +            break;
>> +
>> +        delta = evict_folios(lruvec, sc, swappiness);
>> +        if (!delta)
>> +            break;
>> +
>> +        scanned += delta;
>> +        if (scanned >= nr_to_scan)
>> +            break;
>> +
>> +        cond_resched();
>> +    }
>> +
>> +    blk_finish_plug(&plug);
>> +}
>> +
>> /******************************************************************************
>>    *                          initialization
>> ******************************************************************************/
>> @@ -3041,6 +3804,16 @@ static int __init init_lru_gen(void)
>>   };
>>   late_initcall(init_lru_gen);
>>   +#else
>> +
>> +static void lru_gen_age_node(struct pglist_data *pgdat, struct 
>> scan_control *sc)
>> +{
>> +}
>> +
>> +static void lru_gen_shrink_lruvec(struct lruvec *lruvec, struct 
>> scan_control *sc)
>> +{
>> +}
>> +
>>   #endif /* CONFIG_LRU_GEN */
>>     static void shrink_lruvec(struct lruvec *lruvec, struct 
>> scan_control *sc)
>> @@ -3054,6 +3827,11 @@ static void shrink_lruvec(struct lruvec 
>> *lruvec, struct scan_control *sc)
>>       struct blk_plug plug;
>>       bool scan_adjusted;
>>   +    if (lru_gen_enabled()) {
>> +        lru_gen_shrink_lruvec(lruvec, sc);
>> +        return;
>> +    }
>> +
>>       get_scan_count(lruvec, sc, nr);
>>         /* Record the original scan target for proportional 
>> adjustments later */
>> @@ -3558,6 +4336,9 @@ static void snapshot_refaults(struct mem_cgroup 
>> *target_memcg, pg_data_t *pgdat)
>>       struct lruvec *target_lruvec;
>>       unsigned long refaults;
>>   +    if (lru_gen_enabled())
>> +        return;
>> +
>>       target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
>>       refaults = lruvec_page_state(target_lruvec, 
>> WORKINGSET_ACTIVATE_ANON);
>>       target_lruvec->refaults[0] = refaults;
>> @@ -3922,12 +4703,17 @@ unsigned long 
>> try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>>   }
>>   #endif
>>   -static void age_active_anon(struct pglist_data *pgdat,
>> +static void kswapd_age_node(struct pglist_data *pgdat,
>>                   struct scan_control *sc)
>>   {
>>       struct mem_cgroup *memcg;
>>       struct lruvec *lruvec;
>>   +    if (lru_gen_enabled()) {
>> +        lru_gen_age_node(pgdat, sc);
>> +        return;
>> +    }
>> +
>>       if (!can_age_anon_pages(pgdat, sc))
>>           return;
>>   @@ -4247,12 +5033,11 @@ static int balance_pgdat(pg_data_t *pgdat, 
>> int order, int highest_zoneidx)
>>           sc.may_swap = !nr_boost_reclaim;
>>             /*
>> -         * Do some background aging of the anon list, to give
>> -         * pages a chance to be referenced before reclaiming. All
>> -         * pages are rotated regardless of classzone as this is
>> -         * about consistent aging.
>> +         * Do some background aging, to give pages a chance to be
>> +         * referenced before reclaiming. All pages are rotated
>> +         * regardless of classzone as this is about consistent aging.
>>            */
>> -        age_active_anon(pgdat, &sc);
>> +        kswapd_age_node(pgdat, &sc);
>>             /*
>>            * If we're getting trouble reclaiming, start doing writepage
>> diff --git a/mm/workingset.c b/mm/workingset.c
>> index 592569a8974c..db6f0c8a98c2 100644
>> --- a/mm/workingset.c
>> +++ b/mm/workingset.c
>> @@ -187,7 +187,6 @@ static unsigned int bucket_order __read_mostly;
>>   static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned 
>> long eviction,
>>                bool workingset)
>>   {
>> -    eviction >>= bucket_order;
>>       eviction &= EVICTION_MASK;
>>       eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
>>       eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
>> @@ -212,10 +211,107 @@ static void unpack_shadow(void *shadow, int 
>> *memcgidp, pg_data_t **pgdat,
>>         *memcgidp = memcgid;
>>       *pgdat = NODE_DATA(nid);
>> -    *evictionp = entry << bucket_order;
>> +    *evictionp = entry;
>>       *workingsetp = workingset;
>>   }
>>   +#ifdef CONFIG_LRU_GEN
>> +
>> +static void *lru_gen_eviction(struct folio *folio)
>> +{
>> +    int hist;
>> +    unsigned long token;
>> +    unsigned long min_seq;
>> +    struct lruvec *lruvec;
>> +    struct lru_gen_struct *lrugen;
>> +    int type = folio_is_file_lru(folio);
>> +    int delta = folio_nr_pages(folio);
>> +    int refs = folio_lru_refs(folio);
>> +    int tier = lru_tier_from_refs(refs);
>> +    struct mem_cgroup *memcg = folio_memcg(folio);
>> +    struct pglist_data *pgdat = folio_pgdat(folio);
>> +
>> +    BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - 
>> EVICTION_SHIFT);
>> +
>> +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
>> +    lrugen = &lruvec->lrugen;
>> +    min_seq = READ_ONCE(lrugen->min_seq[type]);
>> +    token = (min_seq << LRU_REFS_WIDTH) | max(refs - 1, 0);
>> +
>> +    hist = lru_hist_from_seq(min_seq);
>> +    atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
>> +
>> +    return pack_shadow(mem_cgroup_id(memcg), pgdat, token, refs);
>
> pack_shadow pass refs rather than PageWorkingset(page), is it done on 
> purpose?
>
> In my opinion,  PageWorkingset can be removed due to failed to reclaim
look at this again,  It's my fault , the workingset do not mean the 
orignal  page status.  Thanks,
>
>> +}
>> +
>> +static void lru_gen_refault(struct folio *folio, void *shadow)
>> +{
>> +    int hist, tier, refs;
>> +    int memcg_id;
>> +    bool workingset;
>> +    unsigned long token;
>> +    unsigned long min_seq;
>> +    struct lruvec *lruvec;
>> +    struct lru_gen_struct *lrugen;
>> +    struct mem_cgroup *memcg;
>> +    struct pglist_data *pgdat;
>> +    int type = folio_is_file_lru(folio);
>> +    int delta = folio_nr_pages(folio);
>> +
>> +    unpack_shadow(shadow, &memcg_id, &pgdat, &token, &workingset);
>> +
>> +    if (folio_pgdat(folio) != pgdat)
>> +        return;
>> +
>> +    /* see the comment in folio_lru_refs() */
>> +    refs = (token & (BIT(LRU_REFS_WIDTH) - 1)) + workingset;
>> +    tier = lru_tier_from_refs(refs);
>> +
>> +    rcu_read_lock();
>> +    memcg = folio_memcg_rcu(folio);
>> +    if (mem_cgroup_id(memcg) != memcg_id)
>> +        goto unlock;
>> +
>> +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
>> +    lrugen = &lruvec->lrugen;
>> +    min_seq = READ_ONCE(lrugen->min_seq[type]);
>> +
>> +    token >>= LRU_REFS_WIDTH;
>> +    if (token != (min_seq & (EVICTION_MASK >> LRU_REFS_WIDTH)))
>> +        goto unlock;
>> +
>> +    hist = lru_hist_from_seq(min_seq);
>> +    atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
>> +    mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta);
>> +
>> +    /*
>> +     * Count the following two cases as stalls:
>> +     * 1. For pages accessed through page tables, hotter pages 
>> pushed out
>> +     *    hot pages which refaulted immediately.
>> +     * 2. For pages accessed through file descriptors, numbers of 
>> accesses
>> +     *    might have been beyond the limit.
>> +     */
>> +    if (lru_gen_in_fault() || refs == BIT(LRU_REFS_WIDTH)) {
>> +        folio_set_workingset(folio);
>> +        mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + type, 
>> delta);
>> +    }
>> +unlock:
>> +    rcu_read_unlock();
>> +}
>> +
>> +#else
>> +
>> +static void *lru_gen_eviction(struct folio *folio)
>> +{
>> +    return NULL;
>> +}
>> +
>> +static void lru_gen_refault(struct folio *folio, void *shadow)
>> +{
>> +}
>> +
>> +#endif /* CONFIG_LRU_GEN */
>> +
>>   /**
>>    * workingset_age_nonresident - age non-resident entries as LRU ages
>>    * @lruvec: the lruvec that was aged
>> @@ -264,10 +360,14 @@ void *workingset_eviction(struct folio *folio, 
>> struct mem_cgroup *target_memcg)
>>       VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
>>       VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>>   +    if (lru_gen_enabled())
>> +        return lru_gen_eviction(folio);
>> +
>>       lruvec = mem_cgroup_lruvec(target_memcg, pgdat);
>>       /* XXX: target_memcg can be NULL, go through lruvec */
>>       memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
>>       eviction = atomic_long_read(&lruvec->nonresident_age);
>> +    eviction >>= bucket_order;
>>       workingset_age_nonresident(lruvec, folio_nr_pages(folio));
>>       return pack_shadow(memcgid, pgdat, eviction,
>>                   folio_test_workingset(folio));
>> @@ -298,7 +398,13 @@ void workingset_refault(struct folio *folio, 
>> void *shadow)
>>       int memcgid;
>>       long nr;
>>   +    if (lru_gen_enabled()) {
>> +        lru_gen_refault(folio, shadow);
>> +        return;
>> +    }
>> +
>>       unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
>> +    eviction <<= bucket_order;
>>         rcu_read_lock();
>>       /*