[RFC] KVM: arm64: Don't force broadcast tlbi when guest is running

Tue Nov 3 06:31:38 EST 2020

Hi Marc,

Apologies that reply later because of out of office some days.

在 2020/10/23 17:17, Marc Zyngier 写道:
> On 2020-10-23 02:26, Shaokun Zhang wrote:
>> Hi Marc，
>>
>> 在 2020/10/22 20:03, Marc Zyngier 写道:
>>> On 2020-10-22 02:57, Shaokun Zhang wrote:
>>>> From: Nianyao Tang <tangnianyao at huawei.com>
>>>>
>>>> Now HCR_EL2.FB is set to force broadcast tlbi to all online pcpus,
>>>> even those vcpu do not resident on. It would get worse as system
>>>> gets larger and more pcpus get online.
>>>> Let's disable force-broadcast. We flush tlbi when move vcpu to a
>>>> new pcpu, in case old page mapping still exists on new pcpu.
>>>>
>>>> Cc: Marc Zyngier <maz at kernel.org>
>>>> Signed-off-by: Nianyao Tang <tangnianyao at huawei.com>
>>>> Signed-off-by: Shaokun Zhang <zhangshaokun at hisilicon.com>
>>>> ---
>>>>  arch/arm64/include/asm/kvm_arm.h | 2 +-
>>>>  arch/arm64/kvm/arm.c             | 4 +++-
>>>>  2 files changed, 4 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>>>> index 64ce29378467..f85ea9c649cb 100644
>>>> --- a/arch/arm64/include/asm/kvm_arm.h
>>>> +++ b/arch/arm64/include/asm/kvm_arm.h
>>>> @@ -75,7 +75,7 @@
>>>>   * PTW:        Take a stage2 fault if a stage1 walk steps in device memory
>>>>   */
>>>>  #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWE | HCR_TWI | HCR_VM | \
>>>> -             HCR_BSU_IS | HCR_FB | HCR_TAC | \
>>>> +             HCR_BSU_IS | HCR_TAC | \
>>>>               HCR_AMO | HCR_SWIO | HCR_TIDCP | HCR_RW | HCR_TLOR | \
>>>>               HCR_FMO | HCR_IMO | HCR_PTW )
>>>>  #define HCR_VIRT_EXCP_MASK (HCR_VSE | HCR_VI | HCR_VF)
>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>> index acf9a993dfb6..845be911f885 100644
>>>> --- a/arch/arm64/kvm/arm.c
>>>> +++ b/arch/arm64/kvm/arm.c
>>>> @@ -334,8 +334,10 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>>      /*
>>>>       * We might get preempted before the vCPU actually runs, but
>>>>       * over-invalidation doesn't affect correctness.
>>>> +     * Dirty tlb might still exist when vcpu ran on other pcpu
>>>> +     * and modified page mapping.
>>>>       */
>>>> -    if (*last_ran != vcpu->vcpu_id) {
>>>> +    if (*last_ran != vcpu->vcpu_id || vcpu->cpu != cpu) {
>>>>          kvm_call_hyp(__kvm_tlb_flush_local_vmid, mmu);
>>>>          *last_ran = vcpu->vcpu_id;
>>>>      }
>>>
>>> This breaks uniprocessor semantics for I-cache invalidation. What could
>>
>> Oops, It shall consider this.
>>
>>> possibly go wrong? You also fail to provide any data that would back up
>>> your claim, as usual.
>>
>> Test Unixbench spawn in each 4U8G vm on Kunpeng 920:
>> (1) 24 4U8G VMs, each vcpu is taskset to one pcpu to make sure
>> each vm seperated.
> 
> That's a very narrow use case.
> 
>> (2) 1 4U8G VM
>> Result:
>> (1) 800 * 24
>> (2) 3200
> 
> I don't know what these numbers are.

These are unixbench scores, the higher the better.

> 
>> By restricting DVM broadcasting across NUMA, result (1) can
>> be improved to 2300 * 24. In spawn testcase, tlbiis used
>> in creating process.
> 
> Linux always use TLBI IS, except in some very rare cases.
> If your HW broadcasts invalidations across the whole system
> without any filtering, it is bound to be bad.
> 
>> Further, we consider restricting tlbi broadcast range and make
>> tlbi more accurate.
>> One scheme is replacing tlbiis with ipi + tlbi + HCR_EL2.FB=0.
> 
> Ah. That old thing again. No, thank you. The right thing to do
> is to build CPUs and interconnects that properly deals with
> IS invalidations by not sending DVM messages to places where
> things don't run (snoop filters?). I also doubt the arm64
> maintainers would be sympathetic to the idea anyway.

We also do the same test on intel 6248 and get better result,
less performance drop in multi-vm case compare to single vm.
Intel use ipi + flush tlb scenario, do you think this scenario is
meaningful on Arm architect hardware?

> 
>> Consider I-cache invalidation, KVM also needs to invalid icache
>> when moving vcpu to a new pcpu.
>> Do we miss any cases that need HCR_EL2.FB == 1?
>> Eventually we expect HCR_EL2.FB == 0 if it is possible.
> 
> I have no intention to ever set FB to 0, as this would resu> in over-invalidation in the general case, and worse performance

The reason that we want to disable FB is that IPI solution
is needed if it is accepted.

Thanks,
Shaokun

> on well designed systems.
> 
>         M.