[RFC PATCH 0/4] KVM: arm64: Improve efficiency of stage2 page table

Tue Feb 23 21:35:07 EST 2021

Hi Alex,

On 2021/2/23 23:55, Alexandru Elisei wrote:
> Hi Yanan,
>
> I wanted to review the patches, but unfortunately I get an error when trying to
> apply the first patch in the series:
>
> Applying: KVM: arm64: Move the clean of dcache to the map handler
> error: patch failed: arch/arm64/kvm/hyp/pgtable.c:464
> error: arch/arm64/kvm/hyp/pgtable.c: patch does not apply
> error: patch failed: arch/arm64/kvm/mmu.c:882
> error: arch/arm64/kvm/mmu.c: patch does not apply
> Patch failed at 0001 KVM: arm64: Move the clean of dcache to the map handler
> hint: Use 'git am --show-current-patch=diff' to see the failed patch
> When you have resolved this problem, run "git am --continue".
> If you prefer to skip this patch, run "git am --skip" instead.
> To restore the original branch and stop patching, run "git am --abort".
>
> Tried this with Linux tags v5.11-rc1 to v5.11-rc7. It looks like pgtable.c and
> mmu.c from your patch is different than what is found on upstream master. Did you
> use another branch as the base for your patches?
Thanks for your attention.
Indeed, this series was  more or less based on the patches I post before 
(Link: 
https://lore.kernel.org/r/20210114121350.123684-4-wangyanan55@huawei.com).
And they have already been merged into up-to-data upstream master 
(commit: 509552e65ae8287178a5cdea2d734dcd2d6380ab), but not into tags 
v5.11-rc1 to v5.11-rc7.
Could you please try the newest upstream master(since commit: 
509552e65ae8287178a5cdea2d734dcd2d6380ab) ? I have tested on my local 
and no apply errors occur.

Thanks,

Yanan.

> Thanks,
>
> Alex
>
> On 2/8/21 11:22 AM, Yanan Wang wrote:
>> Hi,
>>
>> This series makes some efficiency improvement of stage2 page table code,
>> and there are some test results to present the performance changes, which
>> were tested by a kvm selftest [1] that I have post:
>> [1] https://lore.kernel.org/lkml/20210208090841.333724-1-wangyanan55@huawei.com/
>>
>> About patch 1:
>> We currently uniformly clean dcache in user_mem_abort() before calling the
>> fault handlers, if we take a translation fault and the pfn is cacheable.
>> But if there are concurrent translation faults on the same page or block,
>> clean of dcache for the first time is necessary while the others are not.
>>
>> By moving clean of dcache to the map handler, we can easily identify the
>> conditions where CMOs are really needed and avoid the unnecessary ones.
>> As it's a time consuming process to perform CMOs especially when flushing
>> a block range, so this solution reduces much load of kvm and improve the
>> efficiency of creating mappings.
>>
>> Test results:
>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM create block mappings time: 52.83s -> 3.70s
>> KVM recover block mappings time(after dirty-logging): 52.0s -> 2.87s
>>
>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM creating block mappings time: 104.56s -> 3.70s
>> KVM recover block mappings time(after dirty-logging): 103.93s -> 2.96s
>>
>> About patch 2, 3:
>> When KVM needs to coalesce the normal page mappings into a block mapping,
>> we currently invalidate the old table entry first followed by invalidation
>> of TLB, then unmap the page mappings, and install the block entry at last.
>>
>> It will cost a lot of time to unmap the numerous page mappings, which means
>> the table entry will be left invalid for a long time before installation of
>> the block entry, and this will cause many spurious translation faults.
>>
>> So let's quickly install the block entry at first to ensure uninterrupted
>> memory access of the other vCPUs, and then unmap the page mappings after
>> installation. This will reduce most of the time when the table entry is
>> invalid, and avoid most of the unnecessary translation faults.
>>
>> Test results based on patch 1:
>> (1) when 20 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM recover block mappings time(after dirty-logging): 2.87s -> 0.30s
>>
>> (2) when 40 vCPUs concurrently access 20G ram (all 1G hugepages):
>> KVM recover block mappings time(after dirty-logging): 2.96s -> 0.35s
>>
>> So combined with patch 1, it makes a big difference of KVM creating mappings
>> and recovering block mappings with not much code change.
>>
>> About patch 4:
>> A new method to distinguish cases of memcache allocations is introduced.
>> By comparing fault_granule and vma_pagesize, cases that require allocations
>> from memcache and cases that don't can be distinguished completely.
>>
>> ---
>>
>> Details of test results
>> platform: HiSilicon Kunpeng920 (FWB not supported)
>> host kernel: Linux mainline (v5.11-rc6)
>>
>> (1) performance change of patch 1
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_CREATE_MAPPINGS: 52.8338s 52.8327s 52.8336s 52.8255s 52.8303s
>> After  patch: KVM_CREATE_MAPPINGS:  3.7022s  3.7031s  3.7028s  3.7012s  3.7024s
>>
>> Before patch: KVM_ADJUST_MAPPINGS: 52.0466s 52.0473s 52.0550s 52.0518s 52.0467s
>> After  patch: KVM_ADJUST_MAPPINGS:  2.8787s  2.8781s  2.8785s  2.8742s  2.8759s
>>
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_CREATE_MAPPINGS: 104.560s 104.556s 104.554s 104.556s 104.550s
>> After  patch: KVM_CREATE_MAPPINGS:  3.7011s  3.7103s  3.7005s  3.7024s  3.7106s
>>
>> Before patch: KVM_ADJUST_MAPPINGS: 103.931s 103.936s 103.927s 103.942s 103.927s
>> After  patch: KVM_ADJUST_MAPPINGS:  2.9621s  2.9648s  2.9474s  2.9587s  2.9603s
>>
>> (2) performance change of patch 2, 3(based on patch 1)
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 1
>> 	   (1 vcpu, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_ADJUST_MAPPINGS: 2.8241s 2.8234s 2.8245s 2.8230s 2.8652s
>> After  patch: KVM_ADJUST_MAPPINGS: 0.2444s 0.2442s 0.2423s 0.2441s 0.2429s
>>
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 20
>> 	   (20 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_ADJUST_MAPPINGS: 2.8787s 2.8781s 2.8785s 2.8742s 2.8759s
>> After  patch: KVM_ADJUST_MAPPINGS: 0.3008s 0.3004s 0.2974s 0.2917s 0.2900s
>>
>> cmdline: ./kvm_page_table_test -m 4 -t 2 -g 1G -s 20G -v 40
>> 	   (40 vcpus, 20G memory, block mappings(granule 1G))
>> Before patch: KVM_ADJUST_MAPPINGS: 2.9621s 2.9648s 2.9474s 2.9587s 2.9603s
>> After  patch: KVM_ADJUST_MAPPINGS: 0.3541s 0.3694s 0.3656s 0.3693s 0.3687s
>>
>> ---
>>
>> Yanan Wang (4):
>>    KVM: arm64: Move the clean of dcache to the map handler
>>    KVM: arm64: Add an independent API for coalescing tables
>>    KVM: arm64: Install the block entry before unmapping the page mappings
>>    KVM: arm64: Distinguish cases of memcache allocations completely
>>
>>   arch/arm64/include/asm/kvm_mmu.h | 16 -------
>>   arch/arm64/kvm/hyp/pgtable.c     | 82 +++++++++++++++++++++-----------
>>   arch/arm64/kvm/mmu.c             | 39 ++++++---------
>>   3 files changed, 69 insertions(+), 68 deletions(-)
>>
> .