[PATCH 2/2] arm64/mm: Enable color zero pages

Thu Sep 17 06:22:50 EDT 2020

On 2020-09-17 04:35, Gavin Shan wrote:
> Hi Will,
> 
> On 9/16/20 6:28 PM, Will Deacon wrote:
>> On Wed, Sep 16, 2020 at 01:25:23PM +1000, Gavin Shan wrote:
>>> This enables color zero pages by allocating contigous page frames
>>> for it. The number of pages for this is determined by L1 dCache
>>> (or iCache) size, which is probbed from the hardware.
>>>
>>>     * Add cache_total_size() to return L1 dCache (or iCache) size
>>>
>>>     * Implement setup_zero_pages(), which is called after the page
>>>       allocator begins to work, to allocate the contigous pages
>>>       needed by color zero page.
>>>
>>>     * Reworked ZERO_PAGE() and define __HAVE_COLOR_ZERO_PAGE.
>>>
>>> Signed-off-by: Gavin Shan <gshan at redhat.com>
>>> ---
>>>   arch/arm64/include/asm/cache.h   | 22 ++++++++++++++++++++
>>>   arch/arm64/include/asm/pgtable.h |  9 ++++++--
>>>   arch/arm64/kernel/cacheinfo.c    | 34 +++++++++++++++++++++++++++++++
>>>   arch/arm64/mm/init.c             | 35 ++++++++++++++++++++++++++++++++
>>>   arch/arm64/mm/mmu.c              |  7 -------
>>>   5 files changed, 98 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/cache.h 
>>> b/arch/arm64/include/asm/cache.h
>>> index a4d1b5f771f6..420e9dde2c51 100644
>>> --- a/arch/arm64/include/asm/cache.h
>>> +++ b/arch/arm64/include/asm/cache.h
>>> @@ -39,6 +39,27 @@
>>>   #define CLIDR_LOC(clidr)    (((clidr) >> CLIDR_LOC_SHIFT) & 0x7)
>>>   #define CLIDR_LOUIS(clidr)    (((clidr) >> CLIDR_LOUIS_SHIFT) & 0x7)
>>> +#define CSSELR_TND_SHIFT    4
>>> +#define CSSELR_TND_MASK        (UL(1) << CSSELR_TND_SHIFT)
>>> +#define CSSELR_LEVEL_SHIFT    1
>>> +#define CSSELR_LEVEL_MASK    (UL(7) << CSSELR_LEVEL_SHIFT)
>>> +#define CSSELR_IND_SHIFT    0
>>> +#define CSSERL_IND_MASK        (UL(1) << CSSELR_IND_SHIFT)
>>> +
>>> +#define CCSIDR_64_LS_SHIFT    0
>>> +#define CCSIDR_64_LS_MASK    (UL(7) << CCSIDR_64_LS_SHIFT)
>>> +#define CCSIDR_64_ASSOC_SHIFT    3
>>> +#define CCSIDR_64_ASSOC_MASK    (UL(0x1FFFFF) << CCSIDR_64_ASSOC_SHIFT)
>>> +#define CCSIDR_64_SET_SHIFT    32
>>> +#define CCSIDR_64_SET_MASK    (UL(0xFFFFFF) << CCSIDR_64_SET_SHIFT)
>>> +
>>> +#define CCSIDR_32_LS_SHIFT    0
>>> +#define CCSIDR_32_LS_MASK    (UL(7) << CCSIDR_32_LS_SHIFT)
>>> +#define CCSIDR_32_ASSOC_SHIFT    3
>>> +#define CCSIDR_32_ASSOC_MASK    (UL(0x3FF) << CCSIDR_32_ASSOC_SHIFT)
>>> +#define CCSIDR_32_SET_SHIFT    13
>>> +#define CCSIDR_32_SET_MASK    (UL(0x7FFF) << CCSIDR_32_SET_SHIFT)
>>
>> I don't think we should be inferring cache structure from these register
>> values. The Arm ARM helpfully says:
>>
>>    | You cannot make any inference about the actual sizes of caches based
>>    | on these parameters.
>>
>> so we need to take the topology information from elsewhere.
>>
> 
> Yeah, I also noticed the statement in the spec. However, the L1 cache size
> figured out from above registers are matching with "lscpu" on the machine
> where I did my tests. Note "lscpu" depends on sysfs entries whose 
> information
> is retrieved from ACPI (PPTT) table. The number of cache levels are 
> partially
> retrieved from system register (clidr_el1).
> 
> It's doable to retrieve the L1 cache size from ACPI (PPTT) table. I'll
> change accordingly in v2 if this enablement is really needed. More clarify
> is provided below.
> 
>> But before we get into that, can you justify why we need to do this at 
>> all,
>> please? Do you have data to show the benefit of adding this complexity?
>>
> 
> Initially, I found it's the missed feature which has been enabled on
> mips/s390. Currently, all read-only anonymous VMAs are backed up by
> same zero page. It means all reads to these VMAs are cached by same
> set of cache, but still multiple ways if supported. So it would be
> nice to have multiple zero pages to back up these read-only anonymous
> VMAs, so that the reads on them can be cached by multiple sets (multiple
> ways still if supported). It's overall beneficial to the performance.

Is this a concern for true PIPT caches, or is it really just working 
around a pathological case for alias-detecting VIPT caches?

> Unfortunately, I didn't find a machine where the size of cache set is
> larger than page size. So I had one experiment as indication how L1
> data cache miss affects the overall performance:
> 
>      L1 data cache size:           32KB
>      L1 data cache line size:      64
>      Number of L1 data cache set:  64
>      Number of L1 data cache ways: 8
>      ----------------------------------------------------------------------
>                    size = (cache_line_size) * (num_of_sets) * (num_of_ways)
> 
>      Kernel configuration:
>             VA_BITS:               48
>             PAGE_SIZE:             4KB
>             PMD HugeTLB Page Size: 2MB
> 
>      Experiment:
>             I have a program to do the following things and check the
>             consumed time and L1-data-cache-misses by perf.
> 
>             (1) Allocate (mmap) a PMD HugeTLB Page, which is 2MB.
>             (2) Read on the mmap'd region in step of page size (4KB)
>                 for 8 or 9 times. Note 8 is the number of data cache
>                 ways.
>             (3) Repeat (2) for 1000000 times.
>      Result:
>             (a) when we have 8 for the steps in (2):
>                 37,103      L1-dcache-load-misses
>                 0.217522515 seconds time elapsed
>                 0.217564000 seconds user
>                 0.000000000 seconds sys
>             (b) when we have 9 for the steps in (2):
>                 4,687,932   L1-dcache-load-misses            (126 times)
>                 0.248132105 seconds time elapsed             (+14.2%)
>                 0.248267000 seconds user
>                 0.000000000 seconds sys

I have a vague feeling this may have come up before, but are there 
real-world applications that have a performance bottleneck on reading 
from uninitialised memory? As far as synthetic benchmarks go, I'm sure 
we could equally come up with one that shows a regression due to real 
data being pushed out of the cache by all those extra zeros ;)

Robin.

> Please let me know if it's worthy for a v2, to retrieve the cache size
> from ACPI (PPTT) table. The cost is to allocate multiple zero pages and
> the worst case is fail back to one zero page, as before :)
> 
> Cheers,
> Gavin
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel