[PATCH v4 4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()
Nikita Kalyazin
kalyazin at amazon.com
Mon Feb 16 09:53:53 PST 2026
On 13/02/2026 23:20, Sean Christopherson wrote:
> On Fri, Feb 13, 2026, Nikita Kalyazin wrote:
>>
>>
>> On 09/09/2025 11:00, Keir Fraser wrote:
>>> Device MMIO registration may happen quite frequently during VM boot,
>>> and the SRCU synchronization each time has a measurable effect
>>> on VM startup time. In our experiments it can account for around 25%
>>> of a VM's startup time.
>>>
>>> Replace the synchronization with a deferred free of the old kvm_io_bus
>>> structure.
>>
>>
>> Hi,
>>
>> We noticed that this change introduced a regression of ~20 ms to the first
>> KVM_CREATE_VCPU call of a VM, which is significant for our use case.
>>
>> Before the patch:
>> 45726 14:45:32.914330 ioctl(25, KVM_CREATE_VCPU, 0) = 28 <0.000137>
>> 45726 14:45:32.914533 ioctl(25, KVM_CREATE_VCPU, 1) = 30 <0.000046>
>>
>> After the patch:
>> 30295 14:47:08.057412 ioctl(25, KVM_CREATE_VCPU, 0) = 28 <0.025182>
>> 30295 14:47:08.082663 ioctl(25, KVM_CREATE_VCPU, 1) = 30 <0.000031>
>>
>> The reason, as I understand, it happens is call_srcu() called from
>> kvm_io_bus_register_dev() are adding callbacks to be called after a normal
>> GP, which is 10 ms with HZ=100. The subsequent synchronize_srcu_expedited()
>> called from kvm_swap_active_memslots() (from KVM_CREATE_VCPU) has to wait
>> for the normal GP to complete before making progress. I don't fully
>> understand why the delay is consistently greater than 1 GP, but that's what
>> we see across our testing scenarios.
>>
>> I verified that the problem is relaxed if the GP is reduced by configuring
>> HZ=1000. In that case, the regression is in the order of 1 ms.
>>
>> It looks like in our case we don't benefit much from the intended
>> optimisation as the number of device MMIO registrations is limited and and
>> they don't cost us much (each takes at most 16 us, but most commonly ~6 us):
>
> Maybe differences in platforms for arm64 vs x86?
Tested on ARM, and indeed kvm_io_bus_register_dev are occurring after
KVM_CREATE_VCPU, and the patch produces a visible optimisation:
Without the patch (15-23 us per call):
firecracker 19916 [033] 404.518430:
probe:kvm_vm_ioctl_create_vcpu: (ffff800080059b18)
firecracker 19916 [033] 404.518446:
probe:kvm_vm_ioctl_create_vcpu: (ffff800080059b18)
firecracker 19916 [033] 404.518462:
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
firecracker 19916 [032] 404.518495:
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <-
ffff8000800a198c)
firecracker 19916 [032] 404.518498:
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
firecracker 19916 [033] 404.518521:
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <-
ffff8000800a198c)
firecracker 19916 [033] 404.518524:
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
firecracker 19916 [032] 404.518539:
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <-
ffff8000800a6d2c)
firecracker 19916 [032] 404.526900:
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
firecracker 19916 [033] 404.526924:
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <-
ffff800080060168)
firecracker 19916 [033] 404.526926:
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
firecracker 19916 [032] 404.526941:
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <-
ffff800080060168)
fc_vcpu 0 19924 [035] 404.530829:
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
fc_vcpu 0 19924 [035] 404.530848:
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <-
ffff80008009f6b4)
With the patch (1-6 us per call):
firecracker 22806 [032] 427.687157:
probe:kvm_vm_ioctl_create_vcpu: (ffff800080059b38)
firecracker 22806 [032] 427.687174:
probe:kvm_vm_ioctl_create_vcpu: (ffff800080059b38)
firecracker 22806 [032] 427.687193:
probe:kvm_io_bus_register_dev: (ffff80008005f128)
firecracker 22806 [032] 427.687196:
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <-
ffff8000800a19cc)
firecracker 22806 [032] 427.687196:
probe:kvm_io_bus_register_dev: (ffff80008005f128)
firecracker 22806 [032] 427.687197:
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <-
ffff8000800a19cc)
firecracker 22806 [032] 427.687201:
probe:kvm_io_bus_register_dev: (ffff80008005f128)
firecracker 22806 [032] 427.687202:
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <-
ffff8000800a6d6c)
firecracker 22806 [029] 427.707660:
probe:kvm_io_bus_register_dev: (ffff80008005f128)
firecracker 22806 [029] 427.707666:
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <-
ffff8000800601a8)
firecracker 22806 [029] 427.707667:
probe:kvm_io_bus_register_dev: (ffff80008005f128)
firecracker 22806 [029] 427.707668:
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <-
ffff8000800601a8)
fc_vcpu 0 22829 [030] 427.711642:
probe:kvm_io_bus_register_dev: (ffff80008005f128)
fc_vcpu 0 22829 [030] 427.711645:
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <-
ffff80008009f6f4)
Also, it is the KVM_SET_USER_MEMORY_REGION (not KVM_CREATE_VCPU) that is
hit on ARM (but seems to be for the same reason):
45736 17:30:10.251430 ioctl(17, KVM_SET_USER_MEMORY_REGION, {slot=0,
flags=0, guest_phys_addr=0x80000000, memory_size=12884901888,
userspace_addr=0xfffcbedd6000}) = 0 <0.021021>
vs
30694 17:33:01.128985 ioctl(17, KVM_SET_USER_MEMORY_REGION, {slot=0,
flags=0, guest_phys_addr=0x80000000, memory_size=12884901888,
userspace_addr=0xfffc91fc9000}) = 0 <0.000016>
>
>> I am not aware of way to make it fast for both use cases and would be more
>> than happy to hear about possible solutions.
>
> What if we key off of vCPUS being created? The motivation for Keir's change was
> to avoid stalling during VM boot, i.e. *after* initial VM creation.
It doesn't work as is on x86 because the delay we're seeing occurs after
the created_cpus gets incremented so it doesn't allow to differentiate
the two cases (below is kvm_vm_ioctl_create_vcpu):
kvm->created_vcpus++; // <===== incremented here
mutex_unlock(&kvm->lock);
vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL_ACCOUNT);
if (!vcpu) {
r = -ENOMEM;
goto vcpu_decrement;
}
BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
if (!page) {
r = -ENOMEM;
goto vcpu_free;
}
vcpu->run = page_address(page);
kvm_vcpu_init(vcpu, kvm, id);
r = kvm_arch_vcpu_create(vcpu); // <===== the delay is here
firecracker 583 [001] 151.297145:
probe:synchronize_srcu_expedited: (ffffffff813e5cf0)
ffffffff813e5cf1 synchronize_srcu_expedited+0x1 ([kernel.kallsyms])
ffffffff81234986 kvm_swap_active_memslots+0x136 ([kernel.kallsyms])
ffffffff81236cdd kvm_set_memslot+0x1cd ([kernel.kallsyms])
ffffffff81237518 kvm_set_memory_region.part.0+0x478 ([kernel.kallsyms])
ffffffff81264dbc __x86_set_memory_region+0xec ([kernel.kallsyms])
ffffffff8127e2dc kvm_alloc_apic_access_page+0x5c ([kernel.kallsyms])
ffffffff812b9ed3 vmx_vcpu_create+0x193 ([kernel.kallsyms])
ffffffff8126788a kvm_arch_vcpu_create+0x1da ([kernel.kallsyms])
ffffffff8123c54c kvm_vm_ioctl+0x5fc ([kernel.kallsyms])
ffffffff8167b331 __x64_sys_ioctl+0x91 ([kernel.kallsyms])
ffffffff8251a89c do_syscall_64+0x4c ([kernel.kallsyms])
ffffffff8100012b entry_SYSCALL_64_after_hwframe+0x76
([kernel.kallsyms])
6512de ioctl+0x32 (/mnt/host/firecracker)
d99a7 std::rt::lang_start+0x37 (/mnt/host/firecracker)
Also, given that it stumbles after the KVM_CREATE_VCPU on ARM (in
KVM_SET_USER_MEMORY_REGION), it doesn't look like a universal solution.
>
> --
> From: Sean Christopherson <seanjc at google.com>
> Date: Fri, 13 Feb 2026 15:15:01 -0800
> Subject: [PATCH] KVM: Synchronize SRCU on I/O device registration if vCPUs
> haven't been created
>
> TODO: Write a changelog if this works.
>
> Fixes: 7d9a0273c459 ("KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()")
> Reported-by: Nikita Kalyazin <kalyazin at amazon.com>
> Closes: https://lkml.kernel.org/r/a84ddba8-12da-489a-9dd1-ccdf7451a1ba%40amazon.com
> Cc: stable at vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc at google.com>
> ---
> virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++++-
> 1 file changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 571cf0d6ec01..043b1c3574ab 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -6027,7 +6027,30 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
> memcpy(new_bus->range + i + 1, bus->range + i,
> (bus->dev_count - i) * sizeof(struct kvm_io_range));
> rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
> - call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
> +
> + /*
> + * To optimize VM creation *and* boot time, use different tactics for
> + * safely freeing the old bus based on where the VM is at in its
> + * lifecycle. If vCPUs haven't yet been created, simply synchronize
> + * and free, as there are unlikely to be active SRCU readers; if not,
> + * defer freeing the bus via SRCU callback.
> + *
> + * If there are active SRCU readers, synchronizing will stall until the
> + * current grace period completes, which can meaningfully impact boot
> + * time for VMs that trigger a large number of registrations.
> + *
> + * If there aren't SRCU readers, using an SRCU callback can be a net
> + * negative due to starting a grace period of its own, which in turn
> + * can unnecessarily cause a future synchronization to stall. E.g. if
> + * devices are registered before memslots are created, then creating
> + * the first memslot will have to wait for a superfluous grace period.
> + */
> + if (!READ_ONCE(kvm->created_vcpus)) {
> + synchronize_srcu_expedited(&kvm->srcu);
> + kfree(bus);
> + } else {
> + call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
> + }
>
> return 0;
> }
>
> base-commit: 183bb0ce8c77b0fd1fb25874112bc8751a461e49
> --
More information about the linux-arm-kernel
mailing list