[RFC PATCH 1/3] KVM: arm64: Fix possible memory leak in kvm stage2

Tue Dec 1 15:08:42 EST 2020

On 2020/12/2 2:15, Will Deacon wrote:
> On Wed, Dec 02, 2020 at 01:19:35AM +0800, wangyanan (Y) wrote:
>> On 2020/12/1 22:16, Will Deacon wrote:
>>> On Tue, Dec 01, 2020 at 03:21:23PM +0800, wangyanan (Y) wrote:
>>>> On 2020/11/30 21:21, Will Deacon wrote:
>>>>> On Mon, Nov 30, 2020 at 08:18:45PM +0800, Yanan Wang wrote:
>>>>>> @@ -476,6 +477,7 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>>>>     	/* There's an existing valid leaf entry, so perform break-before-make */
>>>>>>     	kvm_set_invalid_pte(ptep);
>>>>>>     	kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
>>>>>> +	put_page(virt_to_page(ptep));
>>>>>>     	kvm_set_valid_leaf_pte(ptep, phys, data->attr, level);
>>>>>>     out:
>>>>>>     	data->phys += granule;
>>>>> Isn't this hunk alone sufficient to solve the problem?
>>>>>
>>>> Not sufficient enough. When the old ptep is valid and old pte equlas new
>>>> pte, in this case, "True" is also returned by kvm_set_valid_leaf_pte()
>>>>
>>>> and get_page() will still be called.
>>> I had a go at fixing this without introducing refcounting to the hyp stage-1
>>> case, and ended up with the diff below. What do you think?
>> Functionally this diff looks fine to me. A small comment inline, please see
>> below.
>>
>> I had made an alternative fix (after sending v1) and it looks much more
>> concise.
>>
>> If you're ok with it, I can send it as v2 (together with patch#2 and #3)
>> after some tests.
> Thanks.
>
>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>> index 0271b4a3b9fe..b232bdd142a6 100644
>> --- a/arch/arm64/kvm/hyp/pgtable.c
>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>> @@ -470,6 +470,9 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64
>> end, u32 level,
>>          if (!kvm_block_mapping_supported(addr, end, phys, level))
>>                  return false;
>>
>> +       if (kvm_pte_valid(*ptep))
>> +               put_page(virt_to_page(ptep));
>> +
>>          if (kvm_set_valid_leaf_pte(ptep, phys, data->attr, level))
>>                  goto out;
> This is certainly smaller, but see below.
>
>>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
>>> index 0271b4a3b9fe..78e2c0dc47ae 100644
>>> --- a/arch/arm64/kvm/hyp/pgtable.c
>>> +++ b/arch/arm64/kvm/hyp/pgtable.c
>>> @@ -170,23 +170,16 @@ static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp)
>>>    	smp_store_release(ptep, pte);
>>>    }
>>> -static bool kvm_set_valid_leaf_pte(kvm_pte_t *ptep, u64 pa, kvm_pte_t attr,
>>> -				   u32 level)
>>> +static kvm_pte_t kvm_init_valid_leaf_pte(u64 pa, kvm_pte_t attr, u32 level)
>>>    {
>>> -	kvm_pte_t old = *ptep, pte = kvm_phys_to_pte(pa);
>>> +	kvm_pte_t pte = kvm_phys_to_pte(pa);
>>>    	u64 type = (level == KVM_PGTABLE_MAX_LEVELS - 1) ? KVM_PTE_TYPE_PAGE :
>>>    							   KVM_PTE_TYPE_BLOCK;
>>>    	pte |= attr & (KVM_PTE_LEAF_ATTR_LO | KVM_PTE_LEAF_ATTR_HI);
>>>    	pte |= FIELD_PREP(KVM_PTE_TYPE, type);
>>>    	pte |= KVM_PTE_VALID;
>>> -
>>> -	/* Tolerate KVM recreating the exact same mapping. */
>>> -	if (kvm_pte_valid(old))
>>> -		return old == pte;
>>> -
>>> -	smp_store_release(ptep, pte);
>>> -	return true;
>>> +	return pte;
>>>    }
>>>    static int kvm_pgtable_visitor_cb(struct kvm_pgtable_walk_data *data, u64 addr,
>>> @@ -341,12 +334,17 @@ static int hyp_map_set_prot_attr(enum kvm_pgtable_prot prot,
>>>    static bool hyp_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>    				    kvm_pte_t *ptep, struct hyp_map_data *data)
>>>    {
>>> +	kvm_pte_t new, old = *ptep;
>>>    	u64 granule = kvm_granule_size(level), phys = data->phys;
>>>    	if (!kvm_block_mapping_supported(addr, end, phys, level))
>>>    		return false;
>>> -	WARN_ON(!kvm_set_valid_leaf_pte(ptep, phys, data->attr, level));
>>> +	/* Tolerate KVM recreating the exact same mapping. */
>>> +	new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>>> +	if (old != new && !WARN_ON(kvm_pte_valid(old)))
>>> +		smp_store_release(ptep, new);
>>> +
>>>    	data->phys += granule;
>>>    	return true;
>>>    }
>>> @@ -465,19 +463,24 @@ static bool stage2_map_walker_try_leaf(u64 addr, u64 end, u32 level,
>>>    				       kvm_pte_t *ptep,
>>>    				       struct stage2_map_data *data)
>>>    {
>>> +	kvm_pte_t new, old = *ptep;
>>>    	u64 granule = kvm_granule_size(level), phys = data->phys;
>>>    	if (!kvm_block_mapping_supported(addr, end, phys, level))
>>>    		return false;
>>> -	if (kvm_set_valid_leaf_pte(ptep, phys, data->attr, level))
>>> -		goto out;
>>> +	new = kvm_init_valid_leaf_pte(phys, data->attr, level);
>>> +	if (kvm_pte_valid(old)) {
>>> +		/*
>>> +		 * There's an existing valid leaf entry, so perform
>>> +		 * break-before-make.
>>> +		 */
>>> +		kvm_set_invalid_pte(ptep);
>>> +		kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, data->mmu, addr, level);
>>> +		put_page(virt_to_page(ptep));
>> When old PTE is valid and equals to the new one, we will also perform
>> break-before-make and the new PTE installation. But they're actually not
>> necessary in this case.
> Agreed, but I don't think that case should happen for stage-2 mappings.
> That's why I prefer my diff here, as it makes that 'old == new' logic
> specific to the hyp mappings.
>
> But I'll leave it all up to you (these diffs are only intended to be
> helpful).
>
> Will
> .

Hi Will,

In my opinion, the 'old == new' case might happen too in stage-2 
translation,  especially in the condition of concurrent access of 
multiple vCPUs.

For example, when merging tables into a block, we will perform 
break-before-make for a valid old PTE in function stage2_map_walk_pre().

If the other vCPUs are trying to access the same GPA range, so the MMUs 
will target at the same PTE as above, and they might just catch the moment

when the target PTE is set invalid in BBM by the vCPU holding the 
mmu_lock, but the old PTE will be updated to valid later.

As a result, translation fault will be triggered in these vCPUs, then 
they will wait for the mmu_lock to map exactly the *same* mapping (old 
== new).

Thanks,

Yanan