[PATCH v2] arm64/mm: pmd_mkinvalid() must handle swap pmds

Zi Yan ziy at nvidia.com
Wed May 1 05:07:55 PDT 2024


On 1 May 2024, at 7:38, Ryan Roberts wrote:

> Pulling in David, who may be able to advise...
>
>
> On 01/05/2024 12:35, Ryan Roberts wrote:
>> Zi Yan, I'm hoping you might have some input on the below...
>>
>>
>> On 30/04/2024 14:31, Ryan Roberts wrote:
>>> __split_huge_pmd_locked() can be called for a present THP, devmap or
>>> (non-present) migration entry. It calls pmdp_invalidate()
>>> unconditionally on the pmdp and only determines if it is present or not
>>> based on the returned old pmd.
>>>
>>> But arm64's pmd_mkinvalid(), called by pmdp_invalidate(),
>>> unconditionally sets the PMD_PRESENT_INVALID flag, which causes future
>>> pmd_present() calls to return true - even for a swap pmd. Therefore any
>>> lockless pgtable walker could see the migration entry pmd in this state
>>> and start interpretting the fields (e.g. pmd_pfn()) as if it were
>>> present, leading to BadThings (TM). GUP-fast appears to be one such
>>> lockless pgtable walker.
>>>
>>> While the obvious fix is for core-mm to avoid such calls for non-present
>>> pmds (pmdp_invalidate() will also issue TLBI which is not necessary for
>>> this case either), all other arches that implement pmd_mkinvalid() do it
>>> in such a way that it is robust to being called with a non-present pmd.
>>
>> OK the plot thickens; The tests I wrote to check that pmd_mkinvalid() is safe for swap entries fails on x86_64. See below...
>>
>>> So it is simpler and safer to make arm64 robust too. This approach means
>>> we can even add tests to debug_vm_pgtable.c to validate the required
>>> behaviour.
>>>
>>> This is a theoretical bug found during code review. I don't have any
>>> test case to trigger it in practice.
>>>
>>> Cc: stable at vger.kernel.org
>>> Fixes: 53fa117bb33c ("arm64/mm: Enable THP migration")
>>> Signed-off-by: Ryan Roberts <ryan.roberts at arm.com>
>>> ---
>>>
>>> Hi all,
>>>
>>> v1 of this fix [1] took the approach of fixing core-mm to never call
>>> pmdp_invalidate() on a non-present pmd. But Zi Yan highlighted that only arm64
>>> suffers this problem; all other arches are robust. So his suggestion was to
>>> instead make arm64 robust in the same way and add tests to validate it. Despite
>>> my stated reservations in the context of the v1 discussion, having thought on it
>>> for a bit, I now agree with Zi Yan. Hence this post.
>>>
>>> Andrew has v1 in mm-unstable at the moment, so probably the best thing to do is
>>> remove it from there and have this go in through the arm64 tree? Assuming there
>>> is agreement that this approach is right one.
>>>
>>> This applies on top of v6.9-rc5. Passes all the mm selftests on arm64.
>>>
>>> [1] https://lore.kernel.org/linux-mm/20240425170704.3379492-1-ryan.roberts@arm.com/
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>>  arch/arm64/include/asm/pgtable.h | 12 +++++--
>>>  mm/debug_vm_pgtable.c            | 61 ++++++++++++++++++++++++++++++++
>>>  2 files changed, 71 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index afdd56d26ad7..7d580271a46d 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -511,8 +511,16 @@ static inline int pmd_trans_huge(pmd_t pmd)
>>>
>>>  static inline pmd_t pmd_mkinvalid(pmd_t pmd)
>>>  {
>>> -	pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
>>> -	pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
>>> +	/*
>>> +	 * If not valid then either we are already present-invalid or we are
>>> +	 * not-present (i.e. none or swap entry). We must not convert
>>> +	 * not-present to present-invalid. Unbelievably, the core-mm may call
>>> +	 * pmd_mkinvalid() for a swap entry and all other arches can handle it.
>>> +	 */
>>> +	if (pmd_valid(pmd)) {
>>> +		pmd = set_pmd_bit(pmd, __pgprot(PMD_PRESENT_INVALID));
>>> +		pmd = clear_pmd_bit(pmd, __pgprot(PMD_SECT_VALID));
>>> +	}
>>>
>>>  	return pmd;
>>>  }
>>> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
>>> index 65c19025da3d..7e9c387d06b0 100644
>>> --- a/mm/debug_vm_pgtable.c
>>> +++ b/mm/debug_vm_pgtable.c
>>> @@ -956,6 +956,65 @@ static void __init hugetlb_basic_tests(struct pgtable_debug_args *args) { }
>>>  #endif /* CONFIG_HUGETLB_PAGE */
>>>
>>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +#if !defined(__HAVE_ARCH_PMDP_INVALIDATE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
>>> +static void __init swp_pmd_mkinvalid_tests(struct pgtable_debug_args *args)
>>> +{
>>
>> Printing various values at different locations in this function for debug:
>>
>>> +	unsigned long max_swap_offset;
>>> +	swp_entry_t swp_set, swp_clear, swp_convert;
>>> +	pmd_t pmd_set, pmd_clear;
>>> +
>>> +	/*
>>> +	 * See generic_max_swapfile_size(): probe the maximum offset, then
>>> +	 * create swap entry will all possible bits set and a swap entry will
>>> +	 * all bits clear.
>>> +	 */
>>> +	max_swap_offset = swp_offset(pmd_to_swp_entry(swp_entry_to_pmd(swp_entry(0, ~0UL))));
>>> +	swp_set = swp_entry((1 << MAX_SWAPFILES_SHIFT) - 1, max_swap_offset);
>>> +	swp_clear = swp_entry(0, 0);
>>> +
>>> +	/* Convert to pmd. */
>>> +	pmd_set = swp_entry_to_pmd(swp_set);
>>> +	pmd_clear = swp_entry_to_pmd(swp_clear);
>>
>> [    0.702163] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: valid: pmd_set=f800000000000000, pmd_clear=7fffffffffffe00
>>
>>> +
>>> +	/*
>>> +	 * Sanity check that the pmds are not-present, not-huge and swap entry
>>> +	 * is recoverable without corruption.
>>> +	 */
>>> +	WARN_ON(pmd_present(pmd_set));
>>> +	WARN_ON(pmd_trans_huge(pmd_set));
>>> +	swp_convert = pmd_to_swp_entry(pmd_set);
>>> +	WARN_ON(swp_type(swp_set) != swp_type(swp_convert));
>>> +	WARN_ON(swp_offset(swp_set) != swp_offset(swp_convert));
>>> +	WARN_ON(pmd_present(pmd_clear));
>>> +	WARN_ON(pmd_trans_huge(pmd_clear));
>>> +	swp_convert = pmd_to_swp_entry(pmd_clear);
>>> +	WARN_ON(swp_type(swp_clear) != swp_type(swp_convert));
>>> +	WARN_ON(swp_offset(swp_clear) != swp_offset(swp_convert));
>>> +
>>> +	/* Now invalidate the pmd. */
>>> +	pmd_set = pmd_mkinvalid(pmd_set);
>>> +	pmd_clear = pmd_mkinvalid(pmd_clear);
>>
>> [    0.704452] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: invalid: pmd_set=f800000000000000, pmd_clear=7ffffffffe00e00
>>
>>> +
>>> +	/*
>>> +	 * Since its a swap pmd, invalidation should effectively be a noop and
>>> +	 * the checks we already did should give the same answer. Check the
>>> +	 * invalidation didn't corrupt any fields.
>>> +	 */
>>> +	WARN_ON(pmd_present(pmd_set));
>>> +	WARN_ON(pmd_trans_huge(pmd_set));
>>> +	swp_convert = pmd_to_swp_entry(pmd_set);
>>
>> [    0.706461] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: set: swp=7c03ffffffffffff (1f, 3ffffffffffff), convert=7c03ffffffffffff (1f, 3ffffffffffff)
>>
>>> +	WARN_ON(swp_type(swp_set) != swp_type(swp_convert));
>>> +	WARN_ON(swp_offset(swp_set) != swp_offset(swp_convert));
>>> +	WARN_ON(pmd_present(pmd_clear));
>>> +	WARN_ON(pmd_trans_huge(pmd_clear));
>>> +	swp_convert = pmd_to_swp_entry(pmd_clear);
>>
>> [    0.708841] debug_vm_pgtable: [swp_pmd_mkinvalid_tests  ]: clear: swp=0 (0, 0), convert=ff8 (0, ff8)
>>
>>> +	WARN_ON(swp_type(swp_clear) != swp_type(swp_convert));
>>> +	WARN_ON(swp_offset(swp_clear) != swp_offset(swp_convert));
>>
>> This line fails on x86_64.
>>
>> The logs show that the offset is indeed being corrupted by pmd_mkinvalid(); 0 -> 0xff8.
>>
>> I think this is due to x86's pmd_mkinvalid() assuming the pmd is present; pmd_flags() and pmd_pfn() do all sorts of weird and wonderful things.
>>
>> So does this take us full circle? Are we now back to modifying the core-mm to never call pmd_mkinvalid() on a non-present entry? If so, then I guess we should remove the arm64 fix from for-next/fixes.

If x86_64's pmd_mkinvalid() also corrupts swap entries, yes, your original fix
is better. I will dig into the x86 code more to figure out what goes wrong.
Last time, I only checked PAGE_* bits in these pmd|pte_* operations.
Sorry for the misinformation.

>>
>>> +}
>>> +#else
>>> +static void __init swp_pmd_mkinvalid_tests(struct pgtable_debug_args *args) { }
>>> +#endif /* !__HAVE_ARCH_PMDP_INVALIDATE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
>>> +
>>>  static void __init pmd_thp_tests(struct pgtable_debug_args *args)
>>>  {
>>>  	pmd_t pmd;
>>> @@ -982,6 +1041,8 @@ static void __init pmd_thp_tests(struct pgtable_debug_args *args)
>>>  	WARN_ON(!pmd_trans_huge(pmd_mkinvalid(pmd_mkhuge(pmd))));
>>>  	WARN_ON(!pmd_present(pmd_mkinvalid(pmd_mkhuge(pmd))));
>>>  #endif /* __HAVE_ARCH_PMDP_INVALIDATE */
>>> +
>>> +	swp_pmd_mkinvalid_tests(args);
>>>  }
>>>
>>>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>> --
>>> 2.25.1
>>>
>>


--
Best Regards,
Yan, Zi
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 854 bytes
Desc: OpenPGP digital signature
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20240501/1f9af1c6/attachment-0001.sig>


More information about the linux-arm-kernel mailing list