[v3 0/6] KVM: arm64: implement vcpu_is_preempted check

Tue Feb 14 08:06:26 PST 2023

On 17/01/2023 10:29, Usama Arif wrote:
> This patchset adds support for vcpu_is_preempted in arm64, which allows the guest
> to check if a vcpu was scheduled out, which is useful to know incase it was
> holding a lock. vcpu_is_preempted is well integrated in core kernel code and can
> be used to improve performance in locking (owner_on_cpu usage in mutex_spin_on_owner,
> mutex_can_spin_on_owner, rtmutex_spin_on_owner and osq_lock) and scheduling
> (available_idle_cpu which is used in several places in kernel/sched/fair.c
> for e.g. in wake_affine to determine which CPU can run soonest).
> 
> This patchset shows significant improvement on overcommitted hosts (vCPUs > pCPUS),
> as waiting for preempted vCPUs reduces performance.
> 

Hi,

Just wanted to check if there are any comments for this?

Thanks,
Usama

> If merged, vcpu_is_preempted could also be used to optimize IPI performance (along
> with directed yield to target IPI vCPU) similar to how its done in x86
> (https://lore.kernel.org/all/1560255830-8656-2-git-send-email-wanpengli@tencent.com/)
> 
> All the results in the below experiments are done on an aws r6g.metal instance
> which has 64 pCPUs.
> 
> The following table shows the index results of UnixBench running on a 128 vCPU VM
> with (6.0+vcpu_is_preempted) and without (6.0 base) the patchset.
> TestName                                6.0 base    6.0+vcpu_is_preempted      % improvement for vcpu_is_preempted
> Dhrystone 2 using register variables    187761      191274.7                   1.871368389
> Double-Precision Whetstone              96743.6     98414.4                    1.727039308
> Execl Throughput                        689.3       10426                      1412.548963
> File Copy 1024 bufsize 2000 maxblocks   549.5       3165                       475.978162
> File Copy 256 bufsize 500 maxblocks     400.7       2084.7                     420.2645371
> File Copy 4096 bufsize 8000 maxblocks   894.3       5003.2                     459.4543218
> Pipe Throughput                         76819.5     78601.5                    2.319723508
> Pipe-based Context Switching            3444.8      13414.5                    289.4130283
> Process Creation                        301.1       293.4                      -2.557289937
> Shell Scripts (1 concurrent)            1248.1      28300.6                    2167.494592
> Shell Scripts (8 concurrent)            781.2       26222.3                    3256.669227
> System Call Overhead                    3426        3729.4                     8.855808523
> 
> System Benchmarks Index Score           3053        11534                      277.7923354
> 
> This shows a 278% overall improvement using these patches.
> 
> The biggest improvement is in the shell scripts benchmark, which forks a lot of processes.
> This acquires rwsem lock where a large chunk of time is spent in base kernel.
> This can be seen from one of the callstack of the perf output of the shell
> scripts benchmark on base (pseudo NMI enabled for perf numbers below):
> - 33.79% el0_svc
>     - 33.43% do_el0_svc
>        - 33.43% el0_svc_common.constprop.3
>           - 33.30% invoke_syscall
>              - 17.27% __arm64_sys_clone
>                 - 17.27% __do_sys_clone
>                    - 17.26% kernel_clone
>                       - 16.73% copy_process
>                          - 11.91% dup_mm
>                             - 11.82% dup_mmap
>                                - 9.15% down_write
>                                   - 8.87% rwsem_down_write_slowpath
>                                      - 8.48% osq_lock
> 
> Just under 50% of the total time in the shell script benchmarks ends up being
> spent in osq_lock in the base kernel:
>    Children      Self  Command   Shared Object        Symbol
>     17.19%    10.71%  sh      [kernel.kallsyms]  [k] osq_lock
>      6.17%     4.04%  sort    [kernel.kallsyms]  [k] osq_lock
>      4.20%     2.60%  multi.  [kernel.kallsyms]  [k] osq_lock
>      3.77%     2.47%  grep    [kernel.kallsyms]  [k] osq_lock
>      3.50%     2.24%  expr    [kernel.kallsyms]  [k] osq_lock
>      3.41%     2.23%  od      [kernel.kallsyms]  [k] osq_lock
>      3.36%     2.15%  rm      [kernel.kallsyms]  [k] osq_lock
>      3.28%     2.12%  tee     [kernel.kallsyms]  [k] osq_lock
>      3.16%     2.02%  wc      [kernel.kallsyms]  [k] osq_lock
>      0.21%     0.13%  looper  [kernel.kallsyms]  [k] osq_lock
>      0.01%     0.00%  Run     [kernel.kallsyms]  [k] osq_lock
> 
> and this comes down to less than 1% total with 6.0+vcpu_is_preempted kernel:
>    Children      Self  Command   Shared Object        Symbol
>       0.26%     0.21%  sh      [kernel.kallsyms]  [k] osq_lock
>       0.10%     0.08%  multi.  [kernel.kallsyms]  [k] osq_lock
>       0.04%     0.04%  sort    [kernel.kallsyms]  [k] osq_lock
>       0.02%     0.01%  grep    [kernel.kallsyms]  [k] osq_lock
>       0.02%     0.02%  od      [kernel.kallsyms]  [k] osq_lock
>       0.01%     0.01%  tee     [kernel.kallsyms]  [k] osq_lock
>       0.01%     0.00%  expr    [kernel.kallsyms]  [k] osq_lock
>       0.01%     0.01%  looper  [kernel.kallsyms]  [k] osq_lock
>       0.00%     0.00%  wc      [kernel.kallsyms]  [k] osq_lock
>       0.00%     0.00%  rm      [kernel.kallsyms]  [k] osq_lock
> 
> To make sure, there is no change in performance when vCPUs < pCPUs, UnixBench
> was run on a 32 CPU VM. The kernel with vcpu_is_preempted implemented
> performed 0.9% better overall than base kernel, and the individual benchmarks
> were within +/-2% improvement over 6.0 base.
> Hence the patches have no negative affect when vCPUs < pCPUs.
> 
> The respective QEMU change to test this is at
> https://github.com/uarif1/qemu/commit/2da2c2927ae8de8f03f439804a0dad9cf68501b6.
> 
> Looking forward to your response!
> Thanks,
> Usama
> ---
> v2->v3
> - Updated the patchset from 6.0 to 6.2-rc3
> - Made pv_lock_init an early_initcall
> - Improved documentation
> - Changed pvlock_vcpu_state to aligned struct
> - Minor improvevments
> 
> RFC->v2
> - Fixed table and code referencing in pvlock documentation
> - Switched to using a single hypercall similar to ptp_kvm and made check
>    for has_kvm_pvlock simpler
> 
> Usama Arif (6):
>    KVM: arm64: Document PV-lock interface
>    KVM: arm64: Add SMCCC paravirtualised lock calls
>    KVM: arm64: Support pvlock preempted via shared structure
>    KVM: arm64: Provide VCPU attributes for PV lock
>    KVM: arm64: Support the VCPU preemption check
>    KVM: selftests: add tests for PV time specific hypercall
> 
>   Documentation/virt/kvm/arm/hypercalls.rst     |   3 +
>   Documentation/virt/kvm/arm/index.rst          |   1 +
>   Documentation/virt/kvm/arm/pvlock.rst         |  54 +++++++++
>   Documentation/virt/kvm/devices/vcpu.rst       |  25 ++++
>   arch/arm64/include/asm/kvm_host.h             |  25 ++++
>   arch/arm64/include/asm/paravirt.h             |   2 +
>   arch/arm64/include/asm/pvlock-abi.h           |  15 +++
>   arch/arm64/include/asm/spinlock.h             |  16 ++-
>   arch/arm64/include/uapi/asm/kvm.h             |   3 +
>   arch/arm64/kernel/paravirt.c                  | 113 ++++++++++++++++++
>   arch/arm64/kvm/Makefile                       |   2 +-
>   arch/arm64/kvm/arm.c                          |   8 ++
>   arch/arm64/kvm/guest.c                        |   9 ++
>   arch/arm64/kvm/hypercalls.c                   |   8 ++
>   arch/arm64/kvm/pvlock.c                       | 100 ++++++++++++++++
>   include/linux/arm-smccc.h                     |   8 ++
>   include/uapi/linux/kvm.h                      |   2 +
>   tools/arch/arm64/include/uapi/asm/kvm.h       |   1 +
>   tools/include/linux/arm-smccc.h               |   8 ++
>   .../selftests/kvm/aarch64/hypercalls.c        |   2 +
>   20 files changed, 403 insertions(+), 2 deletions(-)
>   create mode 100644 Documentation/virt/kvm/arm/pvlock.rst
>   create mode 100644 arch/arm64/include/asm/pvlock-abi.h
>   create mode 100644 arch/arm64/kvm/pvlock.c
>