[PATCH v3 02/21] KVM: arm64: Add stand-alone page-table walker infrastructure

Wed Sep 2 21:11:37 EDT 2020

Hi Will,

On 9/2/20 9:02 PM, Will Deacon wrote:
> On Wed, Sep 02, 2020 at 04:31:32PM +1000, Gavin Shan wrote:
>> On 8/25/20 7:39 PM, Will Deacon wrote:
>>> The KVM page-table code is intricately tied into the kernel page-table
>>> code and re-uses the pte/pmd/pud/p4d/pgd macros directly in an attempt
>>> to reduce code duplication. Unfortunately, the reality is that there is
>>> an awful lot of code required to make this work, and at the end of the
>>> day you're limited to creating page-tables with the same configuration
>>> as the host kernel. Furthermore, lifting the page-table code to run
>>> directly at EL2 on a non-VHE system (as we plan to to do in future
>>> patches) is practically impossible due to the number of dependencies it
>>> has on the core kernel.
>>>
>>> Introduce a framework for walking Armv8 page-tables configured
>>> independently from the host kernel.
>>>
>>> Cc: Marc Zyngier <maz at kernel.org>
>>> Cc: Quentin Perret <qperret at google.com>
>>> Signed-off-by: Will Deacon <will at kernel.org>
>>> ---
>>>    arch/arm64/include/asm/kvm_pgtable.h | 101 ++++++++++
>>>    arch/arm64/kvm/hyp/Makefile          |   2 +-
>>>    arch/arm64/kvm/hyp/pgtable.c         | 290 +++++++++++++++++++++++++++
>>>    3 files changed, 392 insertions(+), 1 deletion(-)
>>>    create mode 100644 arch/arm64/include/asm/kvm_pgtable.h
>>>    create mode 100644 arch/arm64/kvm/hyp/pgtable.c
> 
> [...]
> 
>>> +struct kvm_pgtable_walk_data {
>>> +	struct kvm_pgtable		*pgt;
>>> +	struct kvm_pgtable_walker	*walker;
>>> +
>>> +	u64				addr;
>>> +	u64				end;
>>> +};
>>> +
>>
>> Some of the following function might be worthy to be inlined, considering
>> their complexity :)
> 
> I'll leave that for the compiler to figure out :)
> 

Ok :)

>>> +static u32 kvm_pgd_pages(u32 ia_bits, u32 start_level)
>>> +{
>>> +	struct kvm_pgtable pgt = {
>>> +		.ia_bits	= ia_bits,
>>> +		.start_level	= start_level,
>>> +	};
>>> +
>>> +	return __kvm_pgd_page_idx(&pgt, -1ULL) + 1;
>>> +}
>>> +
>>
>> It seems @pgt.start_level is assigned with wrong value here.
>> For example, @start_level is 2 when @ia_bits and PAGE_SIZE
>> are 40 and 64KB separately. In this case, __kvm_pgd_page_idx()
>> always return zero. However, the extra page covers up the
>> issue. I think something like below might be needed:
>>
>> 	struct kvm_pgtable pgt = {
>> 		.ia_bits	= ia_bits,
>> 		.start_level	= KVM_PGTABLE_MAX_LEVELS - start_level + 1,
>> 	};
> 
> Hmm, we're pulling the start_level right out of the vtcr, so I don't see
> how it can be wrong. In your example, a start_level of 2 seems correct to
> me, as we'll translate 13 bits there, then 13 bits at level 3 which covers
> the 24 bits you need (with a 16-bit offset within the page).
> 
> Your suggestion would give us a start_level of 1, which has a redundant
> level of translation. Maybe you're looking at the levels upside-down? The
> top level is level 0 and each time you walk to a new level, that number
> increases.
> 
> But perhaps I'm missing something. Please could you elaborate if you think
> there's a problem here?
> 

Thanks for the explanation. I think I was understanding the code in wrong
way. In this particular path, __kvm_pgd_page_idx() is used to calculate
how many subordinate pages needed to hold PGDs. If I'm correct, there are
16 pages for PGDs to the maximal degree. So current implementation looks
correct to me.

There is another question, which might not relevant. I added some logs
around and hopefully my calculation is making sense. I have following
configuration (values) in my experiment. I'm including the kernel log
to make information complete:

    [ 5089.107147] kvm_arch_init_vm: kvm at 0xfffffe0028460000, type=0x0
    [ 5089.112973] kvm_arm_setup_stage2: kvm at 0xfffffe0028460000, type=0x0
    [ 5089.119157]    kvm_ipa_limit=0x2c, phys_shift=0x28
    [ 5089.123936]    kvm->arch.vtcr=0x00000000802c7558
    [ 5089.128552] kvm_init_stage2_mmu: kvm at 0xfffffe0028460000
    [ 5089.133765] kvm_pgtable_stage2_init: kvm at 0xfffffe0028460000, ia_bits=0x28,start_level=0x2

    PAGE_SIZE:       64KB
    @kvm->arch.vtcr: 0x00000000_802c7558
    @ipa_bits:       40
    @start_level:    2

    #define KVM_PGTABLE_MAX_LEVELS            4U

    static u64 kvm_granule_shift(u32 level)
    {
         return (KVM_PGTABLE_MAX_LEVELS - level) * (PAGE_SHIFT - 3) + 3;
    }

    static u32 __kvm_pgd_page_idx(struct kvm_pgtable *pgt, u64 addr)
    {
         u64 shift = kvm_granule_shift(pgt->start_level - 1); /* May underflow */
         u64 mask = BIT(pgt->ia_bits) - 1;

         return (addr & mask) >> shift;

         // shift = kvm_granule_shift(2 - 1) = ((3 * 13) + 3) = 42
         // mask  = ((1UL << 40) - 1)
         // return (0x000000ff_ffffffff >> 42) = 0
         //
         // QUESTION: Since we have 40-bits @ipa_bits, why we need shift 42-bits here.
    }

I was also thinking about the following case, which is making sense
to me. Note I didn't add logs to debug for this case.

    PAGE_SIZE:     4KB
    @ipa_bits:     40
    @start_level:  1

    static u32 __kvm_pgd_page_idx(struct kvm_pgtable *pgt, u64 addr)
    {
         u64 shift = kvm_granule_shift(pgt->start_level - 1); /* May underflow */
         u64 mask = BIT(pgt->ia_bits) - 1;

         return (addr & mask) >> shift;

         // shift = kvm_granule_shift(1 - 1) = ((4 * 9) + 3) = 39
         // mask  = ((1UL << 40) - 1)
         // return (0x000000ff_ffffffff >> 39) = 1
    }

>>> +static int _kvm_pgtable_walk(struct kvm_pgtable_walk_data *data)
>>> +{
>>> +	u32 idx;
>>> +	int ret = 0;
>>> +	struct kvm_pgtable *pgt = data->pgt;
>>> +	u64 limit = BIT(pgt->ia_bits);
>>> +
>>> +	if (data->addr > limit || data->end > limit)
>>> +		return -ERANGE;
>>> +
>>> +	if (!pgt->pgd)
>>> +		return -EINVAL;
>>> +
>>> +	for (idx = kvm_pgd_page_idx(data); data->addr < data->end; ++idx) {
>>> +		kvm_pte_t *ptep = &pgt->pgd[idx * PTRS_PER_PTE];
>>> +
>>> +		ret = __kvm_pgtable_walk(data, ptep, pgt->start_level);
>>> +		if (ret)
>>> +			break;
>>> +	}
>>> +
>>> +	return ret;
>>> +}
>>> +
>>
>> I guess we need bail on the following condition:
>>
>>          if (data->addr >= limit || data->end >= limit)
>>              return -ERANGE;
> 
> What's wrong with the existing check? In particular, I think we _want_
> to support data->end == limit (it's exclusive). If data->addr == limit,
> then we'll have a size of zero and the loop won't run.
> 

I was thinking @limit is exclusive, so we need bail when hitting the
ceiling. The @limit was figured out from @ia_bits. For example, it's
0x00000100_00000000 when @ia_bits is 40-bits, and it's invalid adress
to the guest, but I'm still wrong in this case :)

Thanks,
Gavin