[PATCH v2 1/1] mm/contpte: Optimize loop to reduce redundant operations

Gavin Shan gshan at redhat.com
Wed Apr 9 20:06:31 PDT 2025


Hi Xavier,

On 4/10/25 12:53 PM, Xavier wrote:
> At 2025-04-10 08:58:46, "Gavin Shan" <gshan at redhat.com> wrote:
>>
>> On 4/10/25 1:10 AM, Xavier wrote:
>>>
>>> Thank you for carefully reviewing this patch and raising your questions.
>>> I'll try to explain and answer them below.
>>>
>>
>> Not a problem :)
>>
>>>
>>> At 2025-04-09 12:09:48, "Gavin Shan" <gshan at redhat.com> wrote:
>>>> Hi Xavier,
>>>>
>>>> On 4/8/25 6:58 PM, Xavier wrote:
>>>>> This commit optimizes the contpte_ptep_get function by adding early
>>>>>     termination logic. It checks if the dirty and young bits of orig_pte
>>>>>     are already set and skips redundant bit-setting operations during
>>>>>     the loop. This reduces unnecessary iterations and improves performance.
>>>>>
>>>>> Signed-off-by: Xavier <xavier_qy at 163.com>
>>>>> ---
>>>>>     arch/arm64/mm/contpte.c | 22 ++++++++++++++++++++--
>>>>>     1 file changed, 20 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>>> index bcac4f55f9c1..034d153d7d19 100644
>>>>> --- a/arch/arm64/mm/contpte.c
>>>>> +++ b/arch/arm64/mm/contpte.c
>>>>> @@ -152,6 +152,18 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>>>>     }
>>>>>     EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>>>>     
>>>>
>>>> I'm wandering how it can work. More details are given below.
>>>>
>>>>> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>>>>> +	do { \
>>>>> +		int _start = start; \
>>>>> +		pte_t *_ptep = ptep; \
>>>>> +		while (_start++ < CONT_PTES) { \
>>>>> +			if (pte_##flag(__ptep_get(_ptep++))) { \
>>>>> +				orig_pte = pte_mk##flag(orig_pte); \
>>>>> +				break; \
>>>>> +			} \
>>>>> +		} \
>>>>> +	} while (0)
>>>>> +
>>
>> nit: the variable _start and _ptep can be dropped since the caller is going to
>> bail after CHECK_CONTPTE_FLAG(). However, I'm wandering it's going to introduce
>> burden to contpte_ptep_get() for its readability, much more complex than I
>> thought.
> 
> Not adding these two variables in the current code has no impact. The main purpose
> of adding them is to improve maintainability and prevent the external code from
> continuing operations unknowingly after the macro has modified the variables.
> 
>>
>>>>
>>>> CONT_PTES is 16 with the assumption of 4KB base page size. CHECK_CONTPTE_FLAG()
>>>> collects the flag for ptep[start, CONT_PTES].
>>>>
>>>>>     pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>>>     {
>>>>>     	/*
>>>>> @@ -169,11 +181,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>>>     	for (i = 0; i < CONT_PTES; i++, ptep++) {
>>>>>     		pte = __ptep_get(ptep);
>>>>>     
>>>>> -		if (pte_dirty(pte))
>>>>> +		if (pte_dirty(pte)) {
>>>>>     			orig_pte = pte_mkdirty(orig_pte);
>>>>> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
>>>>> +			break;
>>>>> +		}
>>>>>     
>>>>> -		if (pte_young(pte))
>>>>> +		if (pte_young(pte)) {
>>>>>     			orig_pte = pte_mkyoung(orig_pte);
>>>>> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
>>>>> +			break;
>>>>> +		}
>>>>>     	}
>>>>>     
>>>>>     	return orig_pte;
>>>>
>>>> There are two issues as I can see: (1) The loop stops on any dirty or young flag. Another
>>>> flag can be missed when one flag is seen. For example, when ptep[0] has both dirty/young
>>>> flag, only the dirty flag is collected. (2) CHECK_CONTPTE_FLAG() iterates ptep[i, CONT_PTES],
>>>> conflicting to the outer loop, which iterates ptep[0, CONT_PTES].
>>>
>>> No flags will be missed. The outer loop is used to check for the first flag,
>>> which could be either the dirty or young flag.
>>> Once this flag (let's assume it's the dirty flag) is found in the i-th PTE,
>>> the dirty flag of orig_pte will be set, and the code will immediately enter
>>> the inner loop, namely CHECK_CONTPTE_FLAG. This inner loop will continue
>>> to check only for the young flag starting from the i-th position, and we needn't
>>> concern about the dirty flag anymore.
>>> If CHECK_CONTPTE_FLAG finds the young flag in the j-th PTE, the young flag
>>> of orig_pte will be set. At this point, both the young and dirty flags of
>>> orig_pte have been set, and there's no need for further loop judgments, so
>>> the both the inner and outer loops can be exited directly. This approach
>>> reduces unnecessary repeated traversals and judgments.
>>>
>>
>> Thanks for the explanation. I missed that the subsequent young bit is collected
>> on pte_dirty(). Similarly, the subsequent dirty bit is collected on pte_young().
>> Now I can see all (dirty | young) bits are collected with a lost.
>>
>>>>
>>>> Besides, I also doubt how much performance can be gained by bailing early on (dirty | young).
>>>> However, it's avoided to cross the L1D cache line boundary if possible. With 4KB base page
>>>> size, 128 bytes are needed for ptep[CONT_PTES], equal to two cache lines. If we can bail
>>>> early with luck, we don't have to step into another cache line. Note that extra checks needs
>>>> more CPU cycles.
>>>
>>> Compared to the previous function, this code doesn't add any extra checks.
>>> Even in the worst-case scenario, where neither a dirty nor a young flag is
>>> found among the 16 PTEs, the number of checks is the same as in the original
>>> function. If any flag is found earlier, the optimized patch will reduce the
>>> number of subsequent checks for that flag compared to the original code.
>>>
>>
>> There are 32 checks in the original code (assuming we have 4kb base page size
>> and CONT_PTES == 16) and the number of checks is fixed. With your patch applied,
>> the number becomes 32 + (number from CHECK_CONTPTE_FLAG) if all PTEs have
>> pte_dirty() and none of them has pte_young(), if I don't miss anything here.
>>
> 
> If all PTEs are dirty and none are young, the code will enter CHECK_CONTPTE_FLAG
> when it encounters the first dirty PTE. Inside CHECK_CONTPTE_FLAG, it only checks
> for the young bit (16 - 0) times and then exits all loops. In this case, the total number
> of checks is 1 + 16, which is less than the original 32.
> 

You're correct. The outer loop is going to stop on pte_dirty() or pte_young(). Well,
you can see the code becomes hard to be understood, at least to me :)

[...]

Thanks,
Gavin




More information about the linux-arm-kernel mailing list