[RFC][PATCH] arm64: tlb: call kvm_call_hyp once during kvm_tlb_flush_vmid_range

Thu Feb 12 04:02:33 PST 2026

Thanks for your review.

On 2026/2/9 22:35, Marc Zyngier wrote:
> On Mon, 09 Feb 2026 13:14:07 +0000,
> "yezhenyu (A)" <yezhenyu2 at huawei.com> wrote:
>>
>>  From 9982be89f55bd99b3683337223284f0011ed248e Mon Sep 17 00:00:00 2001
>> From: eillon <yezhenyu2 at huawei.com>
>> Date: Mon, 9 Feb 2026 19:48:46 +0800
>> Subject: [RFC][PATCH v1] arm64: tlb: call kvm_call_hyp once during
>>   kvm_tlb_flush_vmid_range
>>
>> The kvm_tlb_flush_vmid_range() function is performance-critical
>> during live migration, but there is a while loop when the system
>> support flush tlb by range when the size is larger than MAX_TLBI_RANGE_PAGES.
>>
>> This results in frequent entry to kvm_call_hyp() and then a large
> 
> What is the cost of kvm_call_hyp()?
> 

Most cost of kvm_tlb_flush_vmid_range() is __tlb_switch_to_host(), which
is called in every __kvm_tlb_flush_vmid/__kvm_tlb_flush_vmid_range.

>> amount of time is spent in kvm_clear_dirty_log_protect() during
>> migration(more than 50%).
> 
> 50% of what time? The guest's run-time? The time spent doing TLBIs
> compared to the time spent in kvm_clear_dirty_log_protect()?
> 

kvm_clear_dirty_log_protect() cost more than 50% time during
ram_find_and_save_block(), but not every time.
I captured the flame graph during the live migration, and the
distribution of several key functions is as follows(sorry I
cannot transfer the SVG files outside my company):

     ram_find_and_save_block(): 84.01%
         memory_region_clear_dirty_bitmap(): 33.40%
             kvm_clear_dirty_log_protect(): 26.74%
                 kvm_arch_flush_remote_tlbs_range(): 9.67%
                     __tlb_switch_to_host(): 9.51%
                 kvm_arch_mmu_enable_log_dirty_pt_masked(): 9.38%
         ram_save_target_page_legacy(): 43.41%

The memory_region_clear_dirty_bitmap() cost about 40% of
ram_find_and_save_block(), and the kvm_arch_flush_remote_tlbs_range()
cost about 29% of memory_region_clear_dirty_bitmap().

And after the patch apply, the distribution of several key functions is
as follows:

     ram_find_and_save_block(): 53.84%
         memory_region_clear_dirty_bitmap(): 2.28%
             kvm_clear_dirty_log_protect(): 1.75%
                 kvm_arch_flush_remote_tlbs_range(): 0.03%
                     __tlb_switch_to_host(): 0.03%
                 kvm_arch_mmu_enable_log_dirty_pt_masked(): 0.96%
         ram_save_target_page_legacy(): 38.97%

The memory_region_clear_dirty_bitmap() cost about 4% of
ram_find_and_save_block(), and the kvm_arch_flush_remote_tlbs_range()
cost about 1% of memory_region_clear_dirty_bitmap().

>> So, when the address range is large than
>> MAX_TLBI_RANGE_PAGES, directly call __kvm_tlb_flush_vmid to
>> optimize performance.
> 
> Multiple things here:
> 
> - there is no SoB, which means that patch cannot be considered for
>    merging
> 
If there are no other issues with this patch, I can resend it with the
SoB (Signed-off-by) tag.

> - there is no data showing how this change improves the situation for
>    a large enough set of workloads
> 
> - there is no description of a test that could be run on multiple
>    implementations to check whether this change has a positive or
>    negative impact

This patch affected the migration bandwidth during the live migration.
With the same physical bandwidth, the optimization effect of this patch
can be observed by monitoring the real live migration bandwidth.

I have test this in an RDMA-like environment, the physical bandwidth is
about 100GBps; without this patch, the migration bandwidth is below 10
GBps, and after this patch apply, the migration bandwidth can reach 50
GBps.

> 
> If you want to progress this sort of things, you will need to address
> these points.
> 
> Thanks,
> 
> 	M.
>