[PATCH v6 0/9] variable-order, large folios for anonymous memory

Kefeng Wang wangkefeng.wang at huawei.com
Mon Nov 13 06:52:47 PST 2023



On 2023/11/13 20:12, Ryan Roberts wrote:
> On 13/11/2023 11:52, Kefeng Wang wrote:
>>
>>
>> On 2023/11/13 18:19, Ryan Roberts wrote:
>>> On 13/11/2023 05:18, Matthew Wilcox wrote:
>>>> On Sun, Nov 12, 2023 at 10:57:47PM -0500, John Hubbard wrote:
>>>>> I've done some initial performance testing of this patchset on an arm64
>>>>> SBSA server. When these patches are combined with the arm64 arch contpte
>>>>> patches in Ryan's git tree (he has conveniently combined everything
>>>>> here: [1]), we are seeing a remarkable, consistent speedup of 10.5x on
>>>>> some memory-intensive workloads. Many test runs, conducted independently
>>>>> by different engineers and on different machines, have convinced me and
>>>>> my colleagues that this is an accurate result.
>>>>>
>>>>> In order to achieve that result, we used the git tree in [1] with
>>>>> following settings:
>>>>>
>>>>>       echo always >/sys/kernel/mm/transparent_hugepage/enabled
>>>>>       echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders
>>>>>
>>>>> This was on a aarch64 machine configure to use a 64KB base page size.
>>>>> That configuration means that the PMD size is 512MB, which is of course
>>>>> too large for practical use as a pure PMD-THP. However, with with these
>>>>> small-size (less than PMD-sized) THPs, we get the improvements in TLB
>>>>> coverage, while still getting pages that are small enough to be
>>>>> effectively usable.
>>>>
>>>> That is quite remarkable!
>>>
>>> Yes, agreed - thanks for sharing these results! A very nice Monday morning boost!
>>>
>>>>
>>>> My hope is to abolish the 64kB page size configuration.  ie instead of
>>>> using the mixture of page sizes that you currently are -- 64k and
>>>> 1M (right?  Order-0, and order-4)
>>>
>>> Not quite; the contpte-size for a 64K page size is 2M/order-5. (and yes, it is
>>> 64K/order-4 for a 4K page size, and 2M/order-7 for a 16K page size. I agree that
>>> intuitively you would expect the order to remain constant, but it doesn't).
>>>
>>> The "recommend" setting above will actually enable order-3 as well even though
>>> there is no HW benefit to this. So the full set of available memory sizes here
>>> is:
>>>
>>> 64K/order-0, 512K/order-3, 2M/order-5, 512M/order-13
>>>
>>>> , that 4k, 64k and 2MB (order-0,
>>>> order-4 and order-9) will provide better performance.
>>>>
>>>> Have you run any experiements with a 4kB page size?
>>>
>>> Agree that would be interesting with 64K small-sized THP enabled. And I'd love
>>> to get to a world were we universally deal in variable sized chunks of memory,
>>> aligned on 4K boundaries.
>>>
>>> In my experience though, there are still some performance benefits to 64K base
>>> page vs 4K+contpte; the page tables are more cache efficient for the former case
>>> - 64K of memory is described by 8 bytes in the former vs 8x16=128 bytes in the
>>> latter. In practice the HW will still only read 8 bytes in the latter but that's
>>> taking up a full cache line vs the former where a single cache line stores 8x
>>> 64K entries.
>>
>> We test some benchmark, eg, unixbench, lmbench, sysbench, with v5 on
>> arm64 board(for better evaluation of anon large folio, using ext4,
>> which don't support large folio for now), will test again and send
>> the results once v7 out.
> 
> Thanks for the testing and for posting the insights!
> 
>>
>> 1) base page 4k  + without anon large folio
>> 2) base page 64k + without anon large folio
>> 3) base page 4k  + with anon large folio + cont-pte(order = 4,0)
>>
>> Most of the test results from v5 show the 3) have a good improvement
>> vs 1), but still low than 2)
> 
> Do you have any understanding what the shortfall is for these particular
> workloads? Certainly the cache spatial locality benefit of the 64K page tables
> could be a factor. But certainly for the workloads I've been looking at, a
> bigger factor is often the fact that executable file-backed memory (elf
> segments) are not in 64K folios and therefore not contpte-mapped. If the iTLB is
> under pressure this can help a lot. I have a change (hack) to force all
> executable mappings to be read-ahead into 64K folios and this gives an
> improvement. But obviously that only works when the file system supports large
> folios (so not ext4 right now). It would certainly be interesting to see just
> how close to native 64K we can get when employing these extra ideas.

No detailed analysis, but with base page 64k,
  less page fault
  less TLB operation
  less zone-lock congestion(pcp)
  less buddy split/merge
  no reclaim/compact when allocate 64k page, and no fallback logical
  execfolio
  faster page table opreation?
  ...

> 
>> , also for some latency-sensitive
>> benchmark, 2) and 3) maybe have poor performance vs 1).
>>
>> Note, for pcp_allowed_order, order <= PAGE_ALLOC_COSTLY_ORDER=3, for
>> 3), we maybe enlarge it for better scalability when page allocation
>> on arm64, not test on v5, will try to enlarge it on v7.
> 
> Yes interesting! I'm hoping to post v7 this week - just waiting for mm-unstable
> to be rebased on v6.7-rc1. I'd be interested to see your results.
> 
Glad to see it.>>
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>>
> 
> 



More information about the linux-arm-kernel mailing list