[PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE

Wed Oct 9 11:17:36 EDT 2013

On Wed, Oct 9, 2013 at 8:40 PM, Anup Patel <anup at brainfault.org> wrote:
> On Wed, Oct 9, 2013 at 8:29 PM, Marc Zyngier <marc.zyngier at arm.com> wrote:
>> On 09/10/13 15:50, Anup Patel wrote:
>>> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier at arm.com> wrote:
>>>> On 09/10/13 14:26, Gleb Natapov wrote:
>>>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>>>>>
>>>>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb at redhat.com> wrote:
>>>>>>
>>>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>>>>>
>>>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier at arm.com> wrote:
>>>>>>>>
>>>>>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>>>>>
>>>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier at arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly
>>>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>>>>>>>>>>> lock to be released, while the vcpu holding the lock may not be
>>>>>>>>>>> running at all.
>>>>>>>>>>>
>>>>>>>>>>> This creates contention, and the observed slowdown is 40x for
>>>>>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>>>>>
>>>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>>>>>> allowing the lock to be released more quickly.
>>>>>>>>>>>
>>>>>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 host (baseline):  1.843s
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 guest w/o patch:  2.083s 4xA15 guest w/o patch:   80.212s
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 guest w/ patch:   2.072s 4xA15 guest w/ patch:    3.202s
>>>>>>>>>>
>>>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>>>>>> see any significant drop in performance.
>>>>>>>>>
>>>>>>>>> The key is in the ARM ARM:
>>>>>>>>>
>>>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>>>>>> permit the processor to suspend execution."
>>>>>>>>>
>>>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>>>>>> hence not trapping. Otherwise, performance would go down the drain very
>>>>>>>>> quickly.
>>>>>>>>
>>>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>>>>>>>>
>>>>>>> It will hurt performance if vcpu that holds the lock is running.
>>>>>>
>>>>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.
>>>>
>>>> Yes. I basically assume that contention should be rare, and that ending
>>>> up in a *blocking* WFE is a sign that we're in thrashing mode already
>>>> (no event is pending).
>>>>
>>>>>>
>>>>> For not contended locks it make sense. We need to recheck if x86
>>>>> assumption is still true there, but x86 lock is ticketing which
>>>>> has not only lock holder preemption, but also lock waiter
>>>>> preemption problem which make overcommit problem even worse.
>>>>
>>>> Locks are ticketing on ARM as well. But there is one key difference here
>>>> with x86 (or at least what I understand of it, which is very close to
>>>> none): We only trap if we would have blocked anyway. In our case, it is
>>>> almost always better to give up the CPU to someone else rather than
>>>> waiting for some event to take the CPU out of sleep.
>>>
>>> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
>>> 1. How spin lock is implemented in Guest OS?
>>> we cannot assume
>>>     that underlying Guest OS is always Linux.
>>> 2. How bad/good is spin
>>
>> We do *not* spin. We *sleep*. So instead of taking a nap on a physical
>> CPU (which is slightly less than useful), we go and run some real
>> workload. If your guest OS is executing WFE (I'm not implying a lock
>> here), *and* that WFE is blocking, then I maintain it will be a gain in
>> the vast majority of the cases.
>
> What if VCPU A was about to release lock and VCPU B tries to grab
> same lock. In this case VCPU B gets Yielded due to WFE causing
> unnecessary delay for VCPU B in acquiring lock. This situation can
> happen quite often because spin locks are generally used for protecting
> very small portion of code.

It will be interesting to see what hackbench number you get if you
don't restrict all Guest VCPUs to same Host CPU? Lets say a Guest
with 8 VCPUs running on Host (with > 2 CPUs).

>
>>
>>> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE
>>
>> Not until someone has shown me a (real) workload when this is actually
>> detrimental.
>
> The gains by "Yield CPU when vcpu executes a WFE" are not-significant
> and we dont have consistent improvement when tried multiple times. Please
> look at number you reported for multiple runs. Due to this fact it makes
> more sense to have Kconfig option for this.
>
> --Anup
>
>>
>>         M.
>> --
>> Jazz is not dead. It just smells funny...
>>