[PATCH v4 17/21] KVM: arm64: Convert user_mem_abort() to generic page-table API
Alexandru Elisei
alexandru.elisei at arm.com
Thu Sep 10 09:20:18 EDT 2020
Hi Marc,
On 9/9/20 6:12 PM, Marc Zyngier wrote:
> Hi Alex,
>
> On 2020-09-09 15:20, Alexandru Elisei wrote:
>> Hi Will,
>>
>> On 9/7/20 4:23 PM, Will Deacon wrote:
>>> Convert user_mem_abort() to call kvm_pgtable_stage2_relax_perms() when
>>> handling a stage-2 permission fault and kvm_pgtable_stage2_map() when
>>> handling a stage-2 translation fault, rather than walking the page-table
>>> manually.
>>>
>>> Cc: Marc Zyngier <maz at kernel.org>
>>> Cc: Quentin Perret <qperret at google.com>
>>> Reviewed-by: Gavin Shan <gshan at redhat.com>
>>> Signed-off-by: Will Deacon <will at kernel.org>
>>> ---
>>> arch/arm64/kvm/mmu.c | 124 +++++++++++++++----------------------------
>>> 1 file changed, 44 insertions(+), 80 deletions(-)
>>>
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index 0af48f35c8dd..dc923e873dad 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -1496,18 +1496,19 @@ static int user_mem_abort(struct kvm_vcpu *vcpu,
>>> phys_addr_t fault_ipa,
>>> {
>>> int ret;
>>> bool write_fault, writable, force_pte = false;
>>> - bool exec_fault, needs_exec;
>>> + bool exec_fault;
>>> + bool device = false;
>>> unsigned long mmu_seq;
>>> - gfn_t gfn = fault_ipa >> PAGE_SHIFT;
>>> struct kvm *kvm = vcpu->kvm;
>>> struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
>>> struct vm_area_struct *vma;
>>> short vma_shift;
>>> + gfn_t gfn;
>>> kvm_pfn_t pfn;
>>> - pgprot_t mem_type = PAGE_S2;
>>> bool logging_active = memslot_is_logging(memslot);
>>> - unsigned long vma_pagesize, flags = 0;
>>> - struct kvm_s2_mmu *mmu = vcpu->arch.hw_mmu;
>>> + unsigned long vma_pagesize;
>>> + enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>>> + struct kvm_pgtable *pgt;
>>>
>>> write_fault = kvm_is_write_fault(vcpu);
>>> exec_fault = kvm_vcpu_trap_is_iabt(vcpu);
>>> @@ -1540,22 +1541,24 @@ static int user_mem_abort(struct kvm_vcpu *vcpu,
>>> phys_addr_t fault_ipa,
>>> vma_pagesize = PAGE_SIZE;
>>> }
>>>
>>> - /*
>>> - * The stage2 has a minimum of 2 level table (For arm64 see
>>> - * kvm_arm_setup_stage2()). Hence, we are guaranteed that we can
>>> - * use PMD_SIZE huge mappings (even when the PMD is folded into PGD).
>>> - * As for PUD huge maps, we must make sure that we have at least
>>> - * 3 levels, i.e, PMD is not folded.
>>> - */
>>> - if (vma_pagesize == PMD_SIZE ||
>>> - (vma_pagesize == PUD_SIZE && kvm_stage2_has_pmd(kvm)))
>>> - gfn = (fault_ipa & huge_page_mask(hstate_vma(vma))) >> PAGE_SHIFT;
>>> + if (vma_pagesize == PMD_SIZE || vma_pagesize == PUD_SIZE)
>>> + fault_ipa &= huge_page_mask(hstate_vma(vma));
>>
>> This looks correct to me - if !kvm_stage2_has_pmd(), then PMD is folded onto PUD
>> and PGD, and PMD_SIZE == PUD_SIZE. Also I like the fact that we update
>> gfn **and**
>> fault_ipa, the previous version updated only gfn, which made gfn !=
>> (fault_ipa >>
>> PAGE_SHIFT).
>>
>>> +
>>> + gfn = fault_ipa >> PAGE_SHIFT;
>>> mmap_read_unlock(current->mm);
>>>
>>> - /* We need minimum second+third level pages */
>>> - ret = kvm_mmu_topup_memory_cache(memcache, kvm_mmu_cache_min_pages(kvm));
>>> - if (ret)
>>> - return ret;
>>> + /*
>>> + * Permission faults just need to update the existing leaf entry,
>>> + * and so normally don't require allocations from the memcache. The
>>> + * only exception to this is when dirty logging is enabled at runtime
>>> + * and a write fault needs to collapse a block entry into a table.
>>> + */
>>> + if (fault_status != FSC_PERM || (logging_active && write_fault)) {
>>> + ret = kvm_mmu_topup_memory_cache(memcache,
>>> + kvm_mmu_cache_min_pages(kvm));
>>> + if (ret)
>>> + return ret;
>>> + }
>>
>> I'm not 100% sure about this.
>>
>> I don't think we gain much over the previous code - if we had allocated cache
>> objects which we hadn't used, we would have used them next time user_mem_abort()
>> is called (kvm_mmu_topup_memory_cache() checks if we have the required number of
>> objects in the cache and returns early).
>>
>> I'm not sure the condition is entirely correct either - if stage 2 already has a
>> mapping for the IPA and we only need to set write permissions, according to the
>> condition above we still try to topup the cache, even though we don't strictly
>> need to.
>
> That's because if you are logging, you may have to split an existing block
> mapping and map a single page instead. This requires (at least) an extra
> level, and that's why you need to top-up the cache in this case.
>
>>
>>> [..]
>>>
>>> -
>>> - if (needs_exec)
>>> - new_pmd = kvm_s2pmd_mkexec(new_pmd);
>>> + if (device)
>>> + prot |= KVM_PGTABLE_PROT_DEVICE;
>>> + else if (cpus_have_const_cap(ARM64_HAS_CACHE_DIC))
>>> + prot |= KVM_PGTABLE_PROT_X;
>>>
>>> - ret = stage2_set_pmd_huge(mmu, memcache, fault_ipa, &new_pmd);
>>> + if (fault_status == FSC_PERM && !(logging_active && writable)) {
>>
>> I don't understand the second part of the condition (!(logging_active &&
>> writable)). With logging active, when we get a fault because of a
>> missing stage 2
>> entry, we map the IPA as read-only at stage 2. If I understand this code
>> correctly, when the guest then tries to write to the same IPA, writable == true
>> and we map the IPA again instead of relaxing the permissions. Why is that?
>
> See my reply above: logging means potentially adding a new level, so we
> treat it as a new mapping altogether (break the block mapping, TLBI, install
> the new mapping one level down).
>
> All the other cases are happily handled by just relaxing the permissions.
I think I understand now, I didn't realize that we never check if the IPA is
mapped when we get a write fault with a dirty page logging memslot. We
unconditionally map a new page marked writable, regardless if the IPA was mapped
or not, or if it was mapped using a block mapping or a page.
The code makes sense with that in mind. For what it's worth:
Reviewed-by: Alexandru Elisei <alexandru.elisei at arm.com>
Thanks,
Alex
More information about the linux-arm-kernel
mailing list