[PATCH v3 06/21] KVM: arm64: Add support for stage-2 map()/unmap() in generic page-table

Thu Sep 3 20:47:40 EDT 2020

Hi Will,

On 9/4/20 2:15 AM, Will Deacon wrote:
> On Thu, Sep 03, 2020 at 01:30:32PM +0100, Will Deacon wrote:
>> On Thu, Sep 03, 2020 at 09:18:27PM +1000, Gavin Shan wrote:
>>> On 8/25/20 7:39 PM, Will Deacon wrote:
>>>> +static int stage2_map_walk_table_post(u64 addr, u64 end, u32 level,
>>>> +				      kvm_pte_t *ptep,
>>>> +				      struct stage2_map_data *data)
>>>> +{
>>>> +	int ret = 0;
>>>> +
>>>> +	if (!data->anchor)
>>>> +		return 0;
>>>> +
>>>> +	free_page((unsigned long)kvm_pte_follow(*ptep));
>>>> +	put_page(virt_to_page(ptep));
>>>> +
>>>> +	if (data->anchor == ptep) {
>>>> +		data->anchor = NULL;
>>>> +		ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
>>>> +	}
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>
>>> As discussed in another thread, *ptep has been invalidated in stage2_map_walk_table_pre().
>>> It means *ptep has value of zero. The following call to free_page() is going to release
>>> the page frame corresponding to physical address 0x0. It's not correct. We might cache
>>> the original value of this page table entry so that it can be used here.
>>
>> Ah, yes, I see what you mean. But it's odd that I haven't run into this
>> myself, so let me try to reproduce the issue first. Another solution is
>> to invalidate the table entry only by clearing the valid bit of the pte,
>> rather than zapping the entire thing to 0, which can be done later when we
>> clear the anchor.
> 
> Ok! There are a couple of issues here:
> 
>    1. As you point out, the kvm_pte_follow() above ends up chasing a zeroed
>       pte.
> 
>    2. The reason I'm not seeing this in testing is because the dirty logging
>       code isn't hitting the table -> block case as it should. This is
>       because I'm not handling permission faults properly when a write
>       hits a read-only block entry. In this case, we need to collapse the
>       entry if logging is active.
> 
> Diff below seems to clear all of this up. I'll fold it in for v4.
> 
> Thanks for reporting the problem and helping to debug it.
> 

I saw these changes have been fold to v4. So tried v4 directly and hugetlbfs
works fine with the changes. Thanks for the fixes.

> --->8
> 
> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> index dc76fdf31be3..9328830e9464 100644
> --- a/arch/arm64/kvm/hyp/pgtable.c
> +++ b/arch/arm64/kvm/hyp/pgtable.c
> @@ -155,8 +155,8 @@ static kvm_pte_t *kvm_pte_follow(kvm_pte_t pte)
>   
>   static void kvm_set_invalid_pte(kvm_pte_t *ptep)
>   {
> -       kvm_pte_t pte = 0;
> -       WRITE_ONCE(*ptep, pte);
> +       kvm_pte_t pte = *ptep;
> +       WRITE_ONCE(*ptep, pte & ~KVM_PTE_VALID);
>   }
>   
>   static void kvm_set_table_pte(kvm_pte_t *ptep, kvm_pte_t *childp)
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index f28e03dcb897..10b73da6abb2 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -737,11 +737,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>          bool exec_fault;
>          bool device = false;
>          unsigned long mmu_seq;
> -       gfn_t gfn = fault_ipa >> PAGE_SHIFT;
>          struct kvm *kvm = vcpu->kvm;
>          struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
>          struct vm_area_struct *vma;
>          short vma_shift;
> +       gfn_t gfn;
>          kvm_pfn_t pfn;
>          bool logging_active = memslot_is_logging(memslot);
>          unsigned long vma_pagesize;
> @@ -780,10 +780,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>          }
>   
>          if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
> -               gfn = (fault_ipa & huge_page_mask(hstate_vma(vma))) >> PAGE_SHIFT;
> +               fault_ipa &= huge_page_mask(hstate_vma(vma));
> +
> +       gfn = fault_ipa >> PAGE_SHIFT;
>          mmap_read_unlock(current->mm);
>   
> -       if (fault_status != FSC_PERM) {
> +       /*
> +        * Permission faults just need to update the existing leaf entry,
> +        * and so normally don't require allocations from the memcache. The
> +        * only exception to this is when dirty logging is enabled at runtime
> +        * and a write fault needs to collapse a block entry into a table.
> +        */
> +       if (fault_status != FSC_PERM || (logging_active && write_fault)) {
>                  ret = kvm_mmu_topup_memory_cache(memcache,
>                                                   kvm_mmu_cache_min_pages(kvm));
>                  if (ret)
> @@ -854,7 +862,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>          else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
>                  prot |= KVM_PGTABLE_PROT_X;
>   
> -       if (fault_status == FSC_PERM) {
> +       if (fault_status == FSC_PERM && !(logging_active && writable)) {
>                  ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
>          } else {
>                  ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
> 

Thanks,
Gavin