[PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE

Wed Oct 9 10:59:43 EDT 2013

On 09/10/13 15:50, Anup Patel wrote:
> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyngier at arm.com> wrote:
>> On 09/10/13 14:26, Gleb Natapov wrote:
>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>>>
>>>> On 07.10.2013, at 18:53, Gleb Natapov <gleb at redhat.com> wrote:
>>>>
>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>>>
>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyngier at arm.com> wrote:
>>>>>>
>>>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>>>
>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyngier at arm.com> wrote:
>>>>>>>>
>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly
>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>>>>>>>>> lock to be released, while the vcpu holding the lock may not be
>>>>>>>>> running at all.
>>>>>>>>>
>>>>>>>>> This creates contention, and the observed slowdown is 40x for
>>>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>>>
>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>>>> allowing the lock to be released more quickly.
>>>>>>>>>
>>>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>>>
>>>>>>>>> 2xA15 host (baseline):  1.843s
>>>>>>>>>
>>>>>>>>> 2xA15 guest w/o patch:  2.083s 4xA15 guest w/o patch:   80.212s
>>>>>>>>>
>>>>>>>>> 2xA15 guest w/ patch:   2.072s 4xA15 guest w/ patch:    3.202s
>>>>>>>>
>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>>>> see any significant drop in performance.
>>>>>>>
>>>>>>> The key is in the ARM ARM:
>>>>>>>
>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>>>> permit the processor to suspend execution."
>>>>>>>
>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>>>> hence not trapping. Otherwise, performance would go down the drain very
>>>>>>> quickly.
>>>>>>
>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have special hardware features to only ever exit after n number of turnarounds. I wonder why we have those when we could just as easily exit on every blocking path.
>>>>>>
>>>>> It will hurt performance if vcpu that holds the lock is running.
>>>>
>>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. I'm not sure what exactly that means. Basically his logic is "if we spin, the holder must have been preempted". And it seems to work out surprisingly well.
>>
>> Yes. I basically assume that contention should be rare, and that ending
>> up in a *blocking* WFE is a sign that we're in thrashing mode already
>> (no event is pending).
>>
>>>>
>>> For not contended locks it make sense. We need to recheck if x86
>>> assumption is still true there, but x86 lock is ticketing which
>>> has not only lock holder preemption, but also lock waiter
>>> preemption problem which make overcommit problem even worse.
>>
>> Locks are ticketing on ARM as well. But there is one key difference here
>> with x86 (or at least what I understand of it, which is very close to
>> none): We only trap if we would have blocked anyway. In our case, it is
>> almost always better to give up the CPU to someone else rather than
>> waiting for some event to take the CPU out of sleep.
> 
> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
> 1. How spin lock is implemented in Guest OS?
> we cannot assume
>     that underlying Guest OS is always Linux.
> 2. How bad/good is spin

We do *not* spin. We *sleep*. So instead of taking a nap on a physical
CPU (which is slightly less than useful), we go and run some real
workload. If your guest OS is executing WFE (I'm not implying a lock
here), *and* that WFE is blocking, then I maintain it will be a gain in
the vast majority of the cases.

> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE

Not until someone has shown me a (real) workload when this is actually
detrimental.

	M.
-- 
Jazz is not dead. It just smells funny...