[PATCH v4 17/21] KVM: arm64: Convert user_mem_abort() to generic page-table API

Thu Sep 10 09:20:18 EDT 2020

Hi Marc,

On 9/9/20 6:12 PM, Marc Zyngier wrote:
> Hi Alex,
>
> On 2020-09-09 15:20, Alexandru Elisei wrote:
>> Hi Will,
>>
>> On 9/7/20 4:23 PM, Will Deacon wrote:
>>> Convert user_mem_abort() to call kvm_pgtable_stage2_relax_perms() when
>>> handling a stage-2 permission fault and kvm_pgtable_stage2_map() when
>>> handling a stage-2 translation fault, rather than walking the page-table
>>> manually.
>>>
>>> Cc: Marc Zyngier <maz at kernel.org>
>>> Cc: Quentin Perret <qperret at google.com>
>>> Reviewed-by: Gavin Shan <gshan at redhat.com>
>>> Signed-off-by: Will Deacon <will at kernel.org>
>>> ---
>>>  arch/arm64/kvm/mmu.c | 124 +++++++++++++++----------------------------
>>>  1 file changed, 44 insertions(+), 80 deletions(-)
>>>
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index 0af48f35c8dd..dc923e873dad 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -1496,18 +1496,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu,
>>> phys_addr_t fault_ipa,
>>>  {
>>>      int ret;
>>>      bool write_fault, writable, force_pte = false;
>>> -    bool exec_fault, needs_exec;
>>> +    bool exec_fault;
>>> +    bool device = false;
>>>      unsigned long mmu_seq;
>>> -    gfn_t gfn = fault_ipa >> PAGE_SHIFT;
>>>      struct kvm *kvm = vcpu->kvm;
>>>      struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
>>>      struct vm_area_struct *vma;
>>>      short vma_shift;
>>> +    gfn_t gfn;
>>>      kvm_pfn_t pfn;
>>> -    pgprot_t mem_type = PAGE_S2;
>>>      bool logging_active = memslot_is_logging(memslot);
>>> -    unsigned long vma_pagesize, flags = 0;
>>> -    struct kvm_s2_mmu *mmu = vcpu->arch.hw_mmu;
>>> +    unsigned long vma_pagesize;
>>> +    enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>>> +    struct kvm_pgtable *pgt;
>>>
>>>      write_fault = kvm_is_write_fault(vcpu);
>>>      exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
>>> @@ -1540,22 +1541,24 @@ static int user_mem_abort(struct kvm_vcpu *vcpu,
>>> phys_addr_t fault_ipa,
>>>          vma_pagesize = PAGE_SIZE;
>>>      }
>>>
>>> -    /*
>>> -     * The stage2 has a minimum of 2 level table (For arm64 see
>>> -     * kvm_arm_setup_stage2()). Hence, we are guaranteed that we can
>>> -     * use PMD_SIZE huge mappings (even when the PMD is folded into PGD).
>>> -     * As for PUD huge maps, we must make sure that we have at least
>>> -     * 3 levels, i.e, PMD is not folded.
>>> -     */
>>> -    if (vma_pagesize == PMD_SIZE ||
>>> -        (vma_pagesize == PUD_SIZE && kvm_stage2_has_pmd(kvm)))
>>> -        gfn = (fault_ipa & huge_page_mask(hstate_vma(vma))) >> PAGE_SHIFT;
>>> +    if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
>>> +        fault_ipa &= huge_page_mask(hstate_vma(vma));
>>
>> This looks correct to me - if !kvm_stage2_has_pmd(), then PMD is folded onto PUD
>> and PGD, and PMD_SIZE == PUD_SIZE. Also I like the fact that we update
>> gfn **and**
>> fault_ipa, the previous version updated only gfn, which made gfn !=
>> (fault_ipa >>
>> PAGE_SHIFT).
>>
>>> +
>>> +    gfn = fault_ipa >> PAGE_SHIFT;
>>>      mmap_read_unlock(current->mm);
>>>
>>> -    /* We need minimum second+third level pages */
>>> -    ret = kvm_mmu_topup_memory_cache(memcache, kvm_mmu_cache_min_pages(kvm));
>>> -    if (ret)
>>> -        return ret;
>>> +    /*
>>> +     * Permission faults just need to update the existing leaf entry,
>>> +     * and so normally don't require allocations from the memcache. The
>>> +     * only exception to this is when dirty logging is enabled at runtime
>>> +     * and a write fault needs to collapse a block entry into a table.
>>> +     */
>>> +    if (fault_status != FSC_PERM || (logging_active && write_fault)) {
>>> +        ret = kvm_mmu_topup_memory_cache(memcache,
>>> +                         kvm_mmu_cache_min_pages(kvm));
>>> +        if (ret)
>>> +            return ret;
>>> +    }
>>
>> I'm not 100% sure about this.
>>
>> I don't think we gain much over the previous code - if we had allocated cache
>> objects which we hadn't used, we would have used them next time user_mem_abort()
>> is called (kvm_mmu_topup_memory_cache() checks if we have the required number of
>> objects in the cache and returns early).
>>
>> I'm not sure the condition is entirely correct either - if stage 2 already has a
>> mapping for the IPA and we only need to set write permissions, according to the
>> condition above we still try to topup the cache, even though we don't strictly
>> need to.
>
> That's because if you are logging, you may have to split an existing block
> mapping and map a single page instead. This requires (at least) an extra
> level, and that's why you need to top-up the cache in this case.
>
>>
>>> [..]
>>>
>>> -
>>> -        if (needs_exec)
>>> -            new_pmd = kvm_s2pmd_mkexec(new_pmd);
>>> +    if (device)
>>> +        prot |= KVM_PGTABLE_PROT_DEVICE;
>>> +    else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
>>> +        prot |= KVM_PGTABLE_PROT_X;
>>>
>>> -        ret = stage2_set_pmd_huge(mmu, memcache, fault_ipa, &new_pmd);
>>> +    if (fault_status == FSC_PERM && !(logging_active && writable)) {
>>
>> I don't understand the second part of the condition (!(logging_active &&
>> writable)). With logging active, when we get a fault because of a
>> missing stage 2
>> entry, we map the IPA as read-only at stage 2. If I understand this code
>> correctly, when the guest then tries to write to the same IPA, writable == true
>> and we map the IPA again instead of relaxing the permissions. Why is that?
>
> See my reply above: logging means potentially adding a new level, so we
> treat it as a new mapping altogether (break the block mapping, TLBI, install
> the new mapping one level down).
>
> All the other cases are happily handled by just relaxing the permissions.

I think I understand now, I didn't realize that we never check if the IPA is
mapped when we get a write fault with a dirty page logging memslot. We
unconditionally map a new page marked writable, regardless if the IPA was mapped
or not, or if it was mapped using a block mapping or a page.

The code makes sense with that in mind. For what it's worth:

Reviewed-by: Alexandru Elisei <alexandru.elisei at arm.com>

Thanks,
Alex