[RFC] KVM: arm64: Don't force broadcast tlbi when guest is running

Thu Oct 22 21:26:00 EDT 2020

Hi Marc，

在 2020/10/22 20:03, Marc Zyngier 写道:
> On 2020-10-22 02:57, Shaokun Zhang wrote:
>> From: Nianyao Tang <tangnianyao at huawei.com>
>>
>> Now HCR_EL2.FB is set to force broadcast tlbi to all online pcpus,
>> even those vcpu do not resident on. It would get worse as system
>> gets larger and more pcpus get online.
>> Let's disable force-broadcast. We flush tlbi when move vcpu to a
>> new pcpu, in case old page mapping still exists on new pcpu.
>>
>> Cc: Marc Zyngier <maz at kernel.org>
>> Signed-off-by: Nianyao Tang <tangnianyao at huawei.com>
>> Signed-off-by: Shaokun Zhang <zhangshaokun at hisilicon.com>
>> ---
>>  arch/arm64/include/asm/kvm_arm.h | 2 +-
>>  arch/arm64/kvm/arm.c             | 4 +++-
>>  2 files changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>> index 64ce29378467..f85ea9c649cb 100644
>> --- a/arch/arm64/include/asm/kvm_arm.h
>> +++ b/arch/arm64/include/asm/kvm_arm.h
>> @@ -75,7 +75,7 @@
>>   * PTW:        Take a stage2 fault if a stage1 walk steps in device memory
>>   */
>>  #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
>> -             HCR_BSU_IS | HCR_FB | HCR_TAC | \
>> +             HCR_BSU_IS | HCR_TAC | \
>>               HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \
>>               HCR_FMO | HCR_IMO | HCR_PTW )
>>  #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF)
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index acf9a993dfb6..845be911f885 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -334,8 +334,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>      /*
>>       * We might get preempted before the vCPU actually runs, but
>>       * over-invalidation doesn't affect correctness.
>> +     * Dirty tlb might still exist when vcpu ran on other pcpu
>> +     * and modified page mapping.
>>       */
>> -    if (*last_ran != vcpu->vcpu_id) {
>> +    if (*last_ran != vcpu->vcpu_id || vcpu->cpu != cpu) {
>>          kvm_call_hyp(__kvm_tlb_flush_local_vmid, mmu);
>>          *last_ran = vcpu->vcpu_id;
>>      }
> 
> This breaks uniprocessor semantics for I-cache invalidation. What could

Oops, It shall consider this.

> possibly go wrong? You also fail to provide any data that would back up
> your claim, as usual.

Test Unixbench spawn in each 4U8G vm on Kunpeng 920:
(1) 24 4U8G VMs, each vcpu is taskset to one pcpu to make sure
each vm seperated.
(2) 1 4U8G VM
Result:
(1) 800 * 24
(2) 3200
By restricting DVM broadcasting across NUMA, result (1) can
be improved to 2300 * 24. In spawn testcase, tlbiis used
in creating process.
Further, we consider restricting tlbi broadcast range and make
tlbi more accurate.
One scheme is replacing tlbiis with ipi + tlbi + HCR_EL2.FB=0.
Consider I-cache invalidation, KVM also needs to invalid icache
when moving vcpu to a new pcpu.
Do we miss any cases that need HCR_EL2.FB == 1?
Eventually we expect HCR_EL2.FB == 0 if it is possible.

Thanks,
Shaokun

> 
>         M.