[External] Re: [v2 0/6] KVM: arm64: implement vcpu_is_preempted check

Mon Dec 5 05:43:12 PST 2022

On 24/11/2022 13:55, Usama Arif wrote:
> 
> 
> On 18/11/2022 00:20, Marc Zyngier wrote:
>> On Mon, 07 Nov 2022 12:00:44 +0000,
>> Usama Arif <usama.arif at bytedance.com> wrote:
>>>
>>>
>>>
>>> On 06/11/2022 16:35, Marc Zyngier wrote:
>>>> On Fri, 04 Nov 2022 06:20:59 +0000,
>>>> Usama Arif <usama.arif at bytedance.com> wrote:
>>>>>
>>>>> This patchset adds support for vcpu_is_preempted in arm64, which
>>>>> allows the guest to check if a vcpu was scheduled out, which is
>>>>> useful to know incase it was holding a lock. vcpu_is_preempted can
>>>>> be used to improve performance in locking (see owner_on_cpu usage in
>>>>> mutex_spin_on_owner, mutex_can_spin_on_owner, rtmutex_spin_on_owner
>>>>> and osq_lock) and scheduling (see available_idle_cpu which is used
>>>>> in several places in kernel/sched/fair.c for e.g. in wake_affine to
>>>>> determine which CPU can run soonest):
>>>>
>>>> [...]
>>>>
>>>>> pvcy shows a smaller overall improvement (50%) compared to
>>>>> vcpu_is_preempted (277%).  Host side flamegraph analysis shows that
>>>>> ~60% of the host time when using pvcy is spent in kvm_handle_wfx,
>>>>> compared with ~1.5% when using vcpu_is_preempted, hence
>>>>> vcpu_is_preempted shows a larger improvement.
>>>>
>>>> And have you worked out *why* we spend so much time handling WFE?
>>>>
>>>>     M.
>>>
>>> Its from the following change in pvcy patchset:
>>>
>>> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
>>> index e778eefcf214..915644816a85 100644
>>> --- a/arch/arm64/kvm/handle_exit.c
>>> +++ b/arch/arm64/kvm/handle_exit.c
>>> @@ -118,7 +118,12 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
>>>          }
>>>
>>>          if (esr & ESR_ELx_WFx_ISS_WFE) {
>>> -               kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>>> +               int state;
>>> +               while ((state = kvm_pvcy_check_state(vcpu)) == 0)
>>> +                       schedule();
>>> +
>>> +               if (state == -1)
>>> +                       kvm_vcpu_on_spin(vcpu, vcpu_mode_priv(vcpu));
>>>          } else {
>>>                  if (esr & ESR_ELx_WFx_ISS_WFxT)
>>>                          vcpu_set_flag(vcpu, IN_WFIT);
>>>
>>>
>>> If my understanding is correct of the pvcy changes, whenever pvcy
>>> returns an unchanged vcpu state, we would schedule to another
>>> vcpu. And its the constant scheduling where the time is spent. I guess
>>> the affects are much higher when the lock contention is very
>>> high. This can be seem from the pvcy host side flamegraph as well with
>>> (~67% of the time spent in the schedule() call in kvm_handle_wfx), For
>>> reference, I have put the graph at:
>>> https://uarif1.github.io/pvlock/perf_host_pvcy_nmi.svg
>>
>> The real issue here is that we don't try to pick the right vcpu to
>> run, and strictly rely on schedule() to eventually pick something that
>> can run.
>>
>> An interesting to do would be to try and fit the directed yield
>> mechanism there. It would be a lot more interesting than the one-off
>> vcpu_is_preempted hack, as it gives us a low-level primitive on which
>> to construct things (pvcy is effectively a mwait-like primitive).
> 
> We could use kvm_vcpu_yield_to to yield to a specific vcpu, but how 
> would we determine which vcpu to yield to?
> 
> IMO vcpu_is_preempted is very well integrated in a lot of core kernel 
> code, i.e. mutex, rtmutex, rwsem and osq_lock. It is also used in 
> scheduler to determine better which vCPU we can run on soonest, select 
> idle core, etc. I am not sure if all of these cases will be optimized by 
> pvcy? Also, with vcpu_is_preempted, some of the lock heavy benchmarks 
> come down from spending around 50% of the time in lock to less than 1% 
> (so not sure how much more room is there for improvement).
> 
> We could also use vcpu_is_preempted to optimize IPI performance (along 
> with directed yield to target IPI vCPU) similar to how its done in x86 
> (https://lore.kernel.org/all/1560255830-8656-2-git-send-email-wanpengli@tencent.com/). 
> This case definitely wont be covered by pvcy.
> 
> Considering all the above, i.e. the core kernel integration already 
> present and possible future usecases of vcpu_is_preempted, maybe its 
> worth making vcpu_is_preempted work on arm independently of pvcy?
> 

Hi,

Just wanted to check if there are any comments on above? I can send a v3 
with the doc and code fixes suggested in the earlier reviews if it makes 
sense?

Thanks,
Usama

> Thanks,
> Usama
> 
>>
>>     M.
>>