[PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize

Thu Jul 4 12:45:31 PDT 2024

On Thu, Jul 4, 2024 at 5:47 AM Nanyong Sun <sunnanyong at huawei.com> wrote:
>
> On 2024/6/28 5:03, Yu Zhao wrote:
> > On Thu, Jun 27, 2024 at 8:34 AM Nanyong Sun <sunnanyong at huawei.com> wrote:
> >>
> >> 在 2024/6/24 13:39, Yu Zhao 写道:
> >>> On Mon, Mar 25, 2024 at 11:24:34PM +0800, Nanyong Sun wrote:
> >>>> On 2024/3/14 7:32, David Rientjes wrote:
> >>>>
> >>>>> On Thu, 8 Feb 2024, Will Deacon wrote:
> >>>>>
> >>>>>>> How about take a new lock with irq disabled during BBM, like:
> >>>>>>>
> >>>>>>> +void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte)
> >>>>>>> +{
> >>>>>>> +     (NEW_LOCK);
> >>>>>>> +    pte_clear(&init_mm, addr, ptep);
> >>>>>>> +    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> >>>>>>> +    set_pte_at(&init_mm, addr, ptep, pte);
> >>>>>>> +    spin_unlock_irq(NEW_LOCK);
> >>>>>>> +}
> >>>>>> I really think the only maintainable way to achieve this is to avoid the
> >>>>>> possibility of a fault altogether.
> >>>>>>
> >>>>>> Will
> >>>>>>
> >>>>>>
> >>>>> Nanyong, are you still actively working on making HVO possible on arm64?
> >>>>>
> >>>>> This would yield a substantial memory savings on hosts that are largely
> >>>>> configured with hugetlbfs.  In our case, the size of this hugetlbfs pool
> >>>>> is actually never changed after boot, but it sounds from the thread that
> >>>>> there was an idea to make HVO conditional on FEAT_BBM.  Is this being
> >>>>> pursued?
> >>>>>
> >>>>> If so, any testing help needed?
> >>>> I'm afraid that FEAT_BBM may not solve the problem here
> >>> I think so too -- I came cross this while working on TAO [1].
> >>>
> >>> [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/
> >>>
> >>>> because from Arm
> >>>> ARM,
> >>>> I see that FEAT_BBM is only used for changing block size. Therefore, in this
> >>>> HVO feature,
> >>>> it can work in the split PMD stage, that is, BBM can be avoided in
> >>>> vmemmap_split_pmd,
> >>>> but in the subsequent vmemmap_remap_pte, the Output address of PTE still
> >>>> needs to be
> >>>> changed. I'm afraid FEAT_BBM is not competent for this stage. Perhaps my
> >>>> understanding
> >>>> of ARM FEAT_BBM is wrong, and I hope someone can correct me.
> >>>> Actually, the solution I first considered was to use the stop_machine
> >>>> method, but we have
> >>>> products that rely on /proc/sys/vm/nr_overcommit_hugepages to dynamically
> >>>> use hugepages,
> >>>> so I have to consider performance issues. If your product does not change
> >>>> the amount of huge
> >>>> pages after booting, using stop_machine() may be a feasible way.
> >>>> So far, I still haven't come up with a good solution.
> >>> I do have a patch that's similar to stop_machine() -- it uses NMI IPIs
> >>> to pause/resume remote CPUs while the local one is doing BBM.
> >>>
> >>> Note that the problem of updating vmemmap for struct page[], as I see
> >>> it, is beyond hugeTLB HVO. I think it impacts virtio-mem and memory
> >>> hot removal in general [2]. On arm64, we would need to support BBM on
> >>> vmemmap so that we can fix the problem with offlining memory (or to be
> >>> precise, unmapping offlined struct page[]), by mapping offlined struct
> >>> page[] to a read-only page of dummy struct page[], similar to
> >>> ZERO_PAGE(). (Or we would have to make extremely invasive changes to
> >>> the reader side, i.e., all speculative PFN walkers.)
> >>>
> >>> In case you are interested in testing my approach, you can swap your
> >>> patch 2 with the following:
> >> I don't have an NMI IPI capable ARM machine on hand, so I think this feature
> >> depends on a higher version of the ARM cpu.
> > (Pseudo) NMI does require GICv3 (released in 2015). But that's
> > independent from CPU versions. Just to double check: you don't have
> > GICv3 (rather than not have CONFIG_ARM64_PSEUDO_NMI=y or
> > irqchip.gicv3_pseudo_nmi=1), is that correct?
> >
> > Even without GICv3, IPIs can be masked but still works, with a less
> > bounded latency.
> Oh，I misunderstood. Pseudo NMI is available. We have
> CONFIG_ARM64_PSEUDO_NMI=y
> but did not set irqchip.gicv3_pseudo_nmi=1 by default. So I can test
> this solution after
> opening this in cmdline.
>
> >> What I worried about was that other cores would occasionally be interrupted
> >> frequently(8 times every 2M and 4096 times every 1G) and then wait for the
> >> update of page table to complete before resuming.
> > Catalin has suggested batching, and to echo what he said [1]: it's
> > possible to make all vmemmap changes from a single HVO/de-HVO
> > operation into *one batch*.
> >
> > [1] https://lore.kernel.org/linux-mm/ZcN7P0CGUOOgki71@arm.com/
> >
> >> If there are workloads
> >> running on other cores, performance may be affected. This implementation
> >> speeds up stopping and resuming other cores, but they still have to wait
> >> for the update to finish.
> > How often does your use case trigger HVO/de-HVO operations?
> >
> > For our VM use case, it's generally correlated to VM lifetimes, i.e.,
> > how often VM bin-packing happens. For our THP use case, it can be more
> > often, but I still don't think we would trigger HVO/de-HVO every
> > minute. So with NMI IPIs, IMO, the performance impact would be
> > acceptable to our use cases.
> >
> > .
> We have many use cases so that I'm not thinking about a specific use case,
> but rather a generic one. I will test the performance impact of different
> HVO trigger frequencies, such as triggering HVO while running redis.

Thanks, and if it's not good enough for whatever you are going to
test, we can batch the updates at least at the PTE level, or even at
the PMD level.