[mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations

Wed Apr 16 09:09:36 PDT 2025

At 2025-04-16 20:54:47, "Ryan Roberts" <ryan.roberts at arm.com> wrote:
>On 15/04/2025 09:22, Xavier wrote:
>> This commit optimizes the contpte_ptep_get function by adding early
>>  termination logic. It checks if the dirty and young bits of orig_pte
>>  are already set and skips redundant bit-setting operations during
>>  the loop. This reduces unnecessary iterations and improves performance.
>> 
>> Signed-off-by: Xavier <xavier_qy at 163.com>
>> ---
>>  arch/arm64/mm/contpte.c | 20 ++++++++++++++++++--
>>  1 file changed, 18 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index bcac4f55f9c1..0acfee604947 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -152,6 +152,16 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>  }
>>  EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>  
>> +/* Note: in order to improve efficiency, using this macro will modify the
>> + * passed-in parameters.*/
>> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>> +    for (; (start) < CONT_PTES; (start)++, (ptep)++) { \
>> +		if (pte_##flag(__ptep_get(ptep))) { \
>> +				orig_pte = pte_mk##flag(orig_pte); \
>> +				break; \
>> +		} \
>> +    }
>
>I'm really not a fan of this macro, it just obfuscates what is going on. I'd
>personally prefer to see the 2 extra loops open coded below.
>
>Or even better, could you provide results comparing this 3 loop version to the
>simpler approach I suggested previously? If the performance is similar (which I
>expect it will be, especially given Barry's point that your test always ensures
>the first PTE is both young and dirty) then I'd prefer to go with the simpler code.

I will test the performance differences among these several implementations.
If the differences are not significant, I will try to implement it with the simpler method.

At the same time, according to Barry's suggestion, I will add some more general
test cases to verify whether the optimization is effective.

Since I'm a bit busy recently, it may take me a few days to complete this task.

>> +
>>  pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>  {
>>  	/*
>> @@ -169,11 +179,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>  	for (i = 0; i < CONT_PTES; i++, ptep++) {
>>  		pte = __ptep_get(ptep);
>>  
>> -		if (pte_dirty(pte))
>> +		if (pte_dirty(pte)) {
>>  			orig_pte = pte_mkdirty(orig_pte);
>> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
>> +			break;
>> +		}
>>  
>> -		if (pte_young(pte))
>> +		if (pte_young(pte)) {
>>  			orig_pte = pte_mkyoung(orig_pte);
>> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
>> +			break;
>> +		}
>>  	}
>>  
>>  	return orig_pte;
>
>If we decide this is all worth the trouble, then I think we can (and *should*,
>in order to be consistent) pull a similar stunt in contpte_ptep_get_lockless().

OK. This function seems to be a bit more complicated. I'll give it a try to
figure out how to optimize it.

--
Thanks,
Xavier