[PATCH v2 1/1] mm/contpte: Optimize loop to reduce redundant operations

Wed Apr 9 17:58:46 PDT 2025

Hi Xavier,

On 4/10/25 1:10 AM, Xavier wrote:
> 
> Thank you for carefully reviewing this patch and raising your questions.
> I'll try to explain and answer them below.
> 

Not a problem :)

> 
> At 2025-04-09 12:09:48, "Gavin Shan" <gshan at redhat.com> wrote:
>> Hi Xavier,
>>
>> On 4/8/25 6:58 PM, Xavier wrote:
>>> This commit optimizes the contpte_ptep_get function by adding early
>>>    termination logic. It checks if the dirty and young bits of orig_pte
>>>    are already set and skips redundant bit-setting operations during
>>>    the loop. This reduces unnecessary iterations and improves performance.
>>>
>>> Signed-off-by: Xavier <xavier_qy at 163.com>
>>> ---
>>>    arch/arm64/mm/contpte.c | 22 ++++++++++++++++++++--
>>>    1 file changed, 20 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>> index bcac4f55f9c1..034d153d7d19 100644
>>> --- a/arch/arm64/mm/contpte.c
>>> +++ b/arch/arm64/mm/contpte.c
>>> @@ -152,6 +152,18 @@ void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>>    }
>>>    EXPORT_SYMBOL_GPL(__contpte_try_unfold);
>>>    
>>
>> I'm wandering how it can work. More details are given below.
>>
>>> +#define CHECK_CONTPTE_FLAG(start, ptep, orig_pte, flag) \
>>> +	do { \
>>> +		int _start = start; \
>>> +		pte_t *_ptep = ptep; \
>>> +		while (_start++ < CONT_PTES) { \
>>> +			if (pte_##flag(__ptep_get(_ptep++))) { \
>>> +				orig_pte = pte_mk##flag(orig_pte); \
>>> +				break; \
>>> +			} \
>>> +		} \
>>> +	} while (0)
>>> +

nit: the variable _start and _ptep can be dropped since the caller is going to
bail after CHECK_CONTPTE_FLAG(). However, I'm wandering it's going to introduce
burden to contpte_ptep_get() for its readability, much more complex than I
thought.

>>
>> CONT_PTES is 16 with the assumption of 4KB base page size. CHECK_CONTPTE_FLAG()
>> collects the flag for ptep[start, CONT_PTES].
>>
>>>    pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>    {
>>>    	/*
>>> @@ -169,11 +181,17 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>    	for (i = 0; i < CONT_PTES; i++, ptep++) {
>>>    		pte = __ptep_get(ptep);
>>>    
>>> -		if (pte_dirty(pte))
>>> +		if (pte_dirty(pte)) {
>>>    			orig_pte = pte_mkdirty(orig_pte);
>>> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, young);
>>> +			break;
>>> +		}
>>>    
>>> -		if (pte_young(pte))
>>> +		if (pte_young(pte)) {
>>>    			orig_pte = pte_mkyoung(orig_pte);
>>> +			CHECK_CONTPTE_FLAG(i, ptep, orig_pte, dirty);
>>> +			break;
>>> +		}
>>>    	}
>>>    
>>>    	return orig_pte;
>>
>> There are two issues as I can see: (1) The loop stops on any dirty or young flag. Another
>> flag can be missed when one flag is seen. For example, when ptep[0] has both dirty/young
>> flag, only the dirty flag is collected. (2) CHECK_CONTPTE_FLAG() iterates ptep[i, CONT_PTES],
>> conflicting to the outer loop, which iterates ptep[0, CONT_PTES].
> 
> No flags will be missed. The outer loop is used to check for the first flag,
> which could be either the dirty or young flag.
> Once this flag (let's assume it's the dirty flag) is found in the i-th PTE,
> the dirty flag of orig_pte will be set, and the code will immediately enter
> the inner loop, namely CHECK_CONTPTE_FLAG. This inner loop will continue
> to check only for the young flag starting from the i-th position, and we needn't
> concern about the dirty flag anymore.
> If CHECK_CONTPTE_FLAG finds the young flag in the j-th PTE, the young flag
> of orig_pte will be set. At this point, both the young and dirty flags of
> orig_pte have been set, and there's no need for further loop judgments, so
> the both the inner and outer loops can be exited directly. This approach
> reduces unnecessary repeated traversals and judgments.
> 

Thanks for the explanation. I missed that the subsequent young bit is collected
on pte_dirty(). Similarly, the subsequent dirty bit is collected on pte_young().
Now I can see all (dirty | young) bits are collected with a lost.

>>
>> Besides, I also doubt how much performance can be gained by bailing early on (dirty | young).
>> However, it's avoided to cross the L1D cache line boundary if possible. With 4KB base page
>> size, 128 bytes are needed for ptep[CONT_PTES], equal to two cache lines. If we can bail
>> early with luck, we don't have to step into another cache line. Note that extra checks needs
>> more CPU cycles.
> 
> Compared to the previous function, this code doesn't add any extra checks.
> Even in the worst-case scenario, where neither a dirty nor a young flag is
> found among the 16 PTEs, the number of checks is the same as in the original
> function. If any flag is found earlier, the optimized patch will reduce the
> number of subsequent checks for that flag compared to the original code.
> 

There are 32 checks in the original code (assuming we have 4kb base page size
and CONT_PTES == 16) and the number of checks is fixed. With your patch applied,
the number becomes 32 + (number from CHECK_CONTPTE_FLAG) if all PTEs have
pte_dirty() and none of them has pte_young(), if I don't miss anything here.

> I'm not sure if my explanation is clear. If you have any further questions,
> feel free to communicate with me again.
> 

Thanks for the explanation, helped me to understand how all (dirty|young) bits
are collected without a lost.

Thanks,
Gavin