[PATCH v11 17/43] KVM: arm64: nv: Support multiple nested Stage-2 mmu structures

Wed Jan 31 01:39:34 PST 2024

Hi Marc,

On 25-01-2024 02:28 pm, Marc Zyngier wrote:
> On Thu, 25 Jan 2024 08:14:32 +0000,
> Ganapatrao Kulkarni <gankulkarni at os.amperecomputing.com> wrote:
>>
>>
>> Hi Marc,
>>
>> On 23-01-2024 07:56 pm, Marc Zyngier wrote:
>>> Hi Ganapatrao,
>>>
>>> On Tue, 23 Jan 2024 09:55:32 +0000,
>>> Ganapatrao Kulkarni <gankulkarni at os.amperecomputing.com> wrote:
>>>>
>>>> Hi Marc,
>>>>
>>>>> +void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
>>>>> +{
>>>>> +	if (is_hyp_ctxt(vcpu)) {
>>>>> +		vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu;
>>>>> +	} else {
>>>>> +		write_lock(&vcpu->kvm->mmu_lock);
>>>>> +		vcpu->arch.hw_mmu = get_s2_mmu_nested(vcpu);
>>>>> +		write_unlock(&vcpu->kvm->mmu_lock);
>>>>> +	}
>>>>
>>>> Due to race, there is a non-existing L2's mmu table is getting loaded
>>>> for some of vCPU while booting L1(noticed with L1 boot using large
>>>> number of vCPUs). This is happening since at the early stage the
>>>> e2h(hyp-context) is not set and trap to eret of L1 boot-strap code
>>>> resulting in context switch as if it is returning to L2(guest enter)
>>>> and loading not initialized mmu table on those vCPUs resulting in
>>>> unrecoverable traps and aborts.
>>>
>>> I'm not sure I understand the problem you're describing here.
>>>
>>
>> IIUC, When the S2 fault happens, the faulted vCPU gets the pages from
>> qemu process and maps in S2 and copies the code to allocated
>> memory. Mean while other vCPUs which are in race to come online, when
>> they switches over to dummy S2 finds the mapping and returns to L1 and
>> subsequent execution does not fault instead fetches from memory where
>> no code exists yet(for some) and generates stage 1 instruction abort
>> and jumps to abort handler and even there is no code exist and keeps
>> aborting. This is happening on random vCPUs(no pattern).
> 
> Why is that any different from the way we handle faults in the
> non-nested case? If there is a case where we can map the PTE at S2
> before the data is available, this is a generic bug that can trigger
> irrespective of NV.
> 
>>
>>> What is the race exactly? Why isn't the shadow S2 good enough? Not
>>> having HCR_EL2.VM set doesn't mean we can use the same S2, as the TLBs
>>> are tagged by a different VMID, so staying on the canonical S2 seems
>>> wrong.
>>
>> IMO, it is unnecessary to switch-over for first ERET while L1 is
>> booting and repeat the faults and page allocation which is anyway
>> dummy once L1 switches to E2H.
> 
> It is mandated by the architecture. EL1 is, by definition, a different
> translation regime from EL2. So we *must* have a different S2, because
> that defines the boundaries of TLB creation and invalidation. The
> fact that these are the same pages is totally irrelevant.
> 
>> Let L1 use its S2 always which is created by L0. Even we should
>> consider avoiding the entry created for L1 in array(first entry in the
>> array) of S2-MMUs and avoid unnecessary iteration/lookup while unmap
>> of NestedVMs.
> 
> I'm sorry, but this is just wrong. You are merging the EL1 and EL2
> translation regimes, which is not acceptable.
> 
>> I am anticipating this unwanted switch-over wont happen when we have
>> NV2 only support in V12?
> 
> V11 is already NV2 only, so I really don't get what you mean here.
> Everything stays the same, and there is nothing to change here.
> 

I am using still V10 since V11(also V12/nv-6.9-sr-enforcement) has 
issues to boot with QEMU. Tried V11 with my local branch of QEMU which 
is 7.2 based and also with Eric's QEMU[1] which rebased on 8.2. The 
issue is QEMU crashes at the very beginning itself. Not sure about the 
issue and yet to debug.

[1] https://github.com/eauger/qemu/tree/v8.2-nv

> What you describe looks like a terrible bug somewhere on the
> page-fault path that has the potential to impact non-NV, and I'd like
> to focus on that.

I found the bug/issue and fixed it.
The problem was so random and was happening when tried booting L1 with 
large cores(200 to 300+).

I have implemented(yet to send to ML for review) to fix the performance 
issue[2] due to unmapping of Shadow tables by implementing the lookup 
table to unmap only the mapped Shadow IPAs instead of unmapping complete 
Shadow S2 of all active NestedVMs.

This lookup table was not adding the mappings created for the L1 when it 
is using the shadow S2-MMU(my bad, missed to notice that the L1 hops 
between vEL2 and EL1 at the booting stage), hence when there is a page 
migration, the unmap was not getting done for those pages and resulting 
in access of stale pages/memory by the some of the VCPUs of L1.

I have modified the check while adding the Shadow-IPA to PA mapping to a 
lookup table to check, is this page is getting mapped to NestedVMs or to 
  a L1 while it is using Shadow S2.

[2] https://www.spinics.net/lists/kvm/msg326638.html

> 
> I've been booting my L1 with a fairly large number of vcpus (32 vcpu
> for 6 physical CPUs), and I don't see this.
> 
> Since you seem to have a way to trigger it on your HW, can you please
> pinpoint the situation where we map the page without having the
> corresponding data?
> 
> Thanks,
> 
> 	M.
> 

Thanks,
Ganapat