[PATCH v3 0/2] arm64/mm: Enable color zero pages

Catalin Marinas catalin.marinas at arm.com
Mon Sep 28 11:22:06 EDT 2020


Hi Gavin,

On Mon, Sep 28, 2020 at 05:22:54PM +1000, Gavin Shan wrote:
> Testing
> =======
> [1] The experiment reveals how heavily the (L1) data cache miss impacts
>     the overall application's performance. The machine where the test
>     is carried out has the following L1 data cache topology. In the
>     mean while, the host kernel have following configurations.
> 
>     The test case allocates contiguous page frames through HugeTLBfs
>     and reads 4-bytes data from the same offset (0x0) from these (N)
>     contiguous page frames. N is equal to 8 or 9 separately in the
>     following two test cases. This is repeated for one million of
>     times.
> 
>     Note that 8 is number of L1 data cache ways. The experiment is
>     cause L1 cache thrashing on one particular set.
> 
>     Host:      CONFIG_ARM64_PAGE_SHIFT=12
>                DEFAULT_HUGE_PAGE_SIZE=2MB
>     L1 dcache: cache-line-size=64
>                number-of-sets=64
>                number-of-ways=8
> 
>                             N=8           N=9
>     ------------------------------------------------------------------
>     cache-misses:           43,429        9,038,460
>     L1-dcache-load-misses:  43,429        9,038,460
>     seconds time elapsed:   0.299206372   0.722253140   (2.41 times)
> 
> [2] The experiment should have been carried out on machine where the
>     L1 data cache capacity of one particular way is larger than 4KB.
>     However, I'm unable to find such kind of machines. So I have to
>     evaluate the performance impact caused by L2 data cache thrashing.
>     The experiment is carried out on the machine, which has following
>     L1/L2 data cache topology. The host kernel configuration is same
>     to [1].
> 
>     The corresponding test program allocates contiguous page frames
>     through hugeTLBfs and builds VMAs backed by zero pages. These
>     contiguous pages are sequentially read from fixed offset (0) in step
>     of 32KB and by 8 times. After that, the VMA backed by zero pages are
>     sequentially read in step of 4KB and by once. It's repeated by 8
>     millions of times.
> 
>     Note 32KB is the cache capacity in one L2 data cache way and 8 is
>     number of L2 data cache sets. This experiment is to cause L2 data
>     cache thrashing on one particular set.
> 
>     L1 dcache:  <same as [1]>
>     L2 dcache:  cache-line-size=64
>                 number-of-sets=512
>                 number-of-ways=8
> 
>     -----------------------------------------------------------------------
>     cache-references:       1,427,213,737    1,421,394,472
>     cache-misses:              35,804,552       42,636,698
>     L1-dcache-load-misses:     35,804,552       42,636,698
>     seconds time elapsed:   2.602511671      2.098198172      (+19.3%)

No-one is denying a performance improvement in a very specific way but
what's missing here is explaining how these artificial benchmarks relate
to real-world applications.

-- 
Catalin



More information about the linux-arm-kernel mailing list