[PATCH v4 4/4] KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()

Nikita Kalyazin kalyazin at amazon.com
Mon Feb 16 09:53:53 PST 2026



On 13/02/2026 23:20, Sean Christopherson wrote:
> On Fri, Feb 13, 2026, Nikita Kalyazin wrote:
>>
>>
>> On 09/09/2025 11:00, Keir Fraser wrote:
>>> Device MMIO registration may happen quite frequently during VM boot,
>>> and the SRCU synchronization each time has a measurable effect
>>> on VM startup time. In our experiments it can account for around 25%
>>> of a VM's startup time.
>>>
>>> Replace the synchronization with a deferred free of the old kvm_io_bus
>>> structure.
>>
>>
>> Hi,
>>
>> We noticed that this change introduced a regression of ~20 ms to the first
>> KVM_CREATE_VCPU call of a VM, which is significant for our use case.
>>
>> Before the patch:
>> 45726 14:45:32.914330 ioctl(25, KVM_CREATE_VCPU, 0) = 28 <0.000137>
>> 45726 14:45:32.914533 ioctl(25, KVM_CREATE_VCPU, 1) = 30 <0.000046>
>>
>> After the patch:
>> 30295 14:47:08.057412 ioctl(25, KVM_CREATE_VCPU, 0) = 28 <0.025182>
>> 30295 14:47:08.082663 ioctl(25, KVM_CREATE_VCPU, 1) = 30 <0.000031>
>>
>> The reason, as I understand, it happens is call_srcu() called from
>> kvm_io_bus_register_dev() are adding callbacks to be called after a normal
>> GP, which is 10 ms with HZ=100.  The subsequent synchronize_srcu_expedited()
>> called from kvm_swap_active_memslots() (from KVM_CREATE_VCPU) has to wait
>> for the normal GP to complete before making progress.  I don't fully
>> understand why the delay is consistently greater than 1 GP, but that's what
>> we see across our testing scenarios.
>>
>> I verified that the problem is relaxed if the GP is reduced by configuring
>> HZ=1000.  In that case, the regression is in the order of 1 ms.
>>
>> It looks like in our case we don't benefit much from the intended
>> optimisation as the number of device MMIO registrations is limited and and
>> they don't cost us much (each takes at most 16 us, but most commonly ~6 us):
> 
> Maybe differences in platforms for arm64 vs x86?

Tested on ARM, and indeed kvm_io_bus_register_dev are occurring after 
KVM_CREATE_VCPU, and the patch produces a visible optimisation:

Without the patch (15-23 us per call):

      firecracker 19916 [033]   404.518430: 
probe:kvm_vm_ioctl_create_vcpu: (ffff800080059b18)
      firecracker 19916 [033]   404.518446: 
probe:kvm_vm_ioctl_create_vcpu: (ffff800080059b18)
      firecracker 19916 [033]   404.518462: 
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
      firecracker 19916 [032]   404.518495: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- 
ffff8000800a198c)
      firecracker 19916 [032]   404.518498: 
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
      firecracker 19916 [033]   404.518521: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- 
ffff8000800a198c)
      firecracker 19916 [033]   404.518524: 
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
      firecracker 19916 [032]   404.518539: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- 
ffff8000800a6d2c)
      firecracker 19916 [032]   404.526900: 
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
      firecracker 19916 [033]   404.526924: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- 
ffff800080060168)
      firecracker 19916 [033]   404.526926: 
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
      firecracker 19916 [032]   404.526941: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- 
ffff800080060168)
        fc_vcpu 0 19924 [035]   404.530829: 
probe:kvm_io_bus_register_dev: (ffff80008005f0e8)
        fc_vcpu 0 19924 [035]   404.530848: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f0e8 <- 
ffff80008009f6b4)

With the patch (1-6 us per call):

      firecracker 22806 [032]   427.687157: 
probe:kvm_vm_ioctl_create_vcpu: (ffff800080059b38)
      firecracker 22806 [032]   427.687174: 
probe:kvm_vm_ioctl_create_vcpu: (ffff800080059b38)
      firecracker 22806 [032]   427.687193: 
probe:kvm_io_bus_register_dev: (ffff80008005f128)
      firecracker 22806 [032]   427.687196: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- 
ffff8000800a19cc)
      firecracker 22806 [032]   427.687196: 
probe:kvm_io_bus_register_dev: (ffff80008005f128)
      firecracker 22806 [032]   427.687197: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- 
ffff8000800a19cc)
      firecracker 22806 [032]   427.687201: 
probe:kvm_io_bus_register_dev: (ffff80008005f128)
      firecracker 22806 [032]   427.687202: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- 
ffff8000800a6d6c)
      firecracker 22806 [029]   427.707660: 
probe:kvm_io_bus_register_dev: (ffff80008005f128)
      firecracker 22806 [029]   427.707666: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- 
ffff8000800601a8)
      firecracker 22806 [029]   427.707667: 
probe:kvm_io_bus_register_dev: (ffff80008005f128)
      firecracker 22806 [029]   427.707668: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- 
ffff8000800601a8)
        fc_vcpu 0 22829 [030]   427.711642: 
probe:kvm_io_bus_register_dev: (ffff80008005f128)
        fc_vcpu 0 22829 [030]   427.711645: 
probe:kvm_io_bus_register_dev__return: (ffff80008005f128 <- 
ffff80008009f6f4)


Also, it is the KVM_SET_USER_MEMORY_REGION (not KVM_CREATE_VCPU) that is 
hit on ARM (but seems to be for the same reason):

45736 17:30:10.251430 ioctl(17, KVM_SET_USER_MEMORY_REGION, {slot=0, 
flags=0, guest_phys_addr=0x80000000, memory_size=12884901888, 
userspace_addr=0xfffcbedd6000}) = 0 <0.021021>

vs

30694 17:33:01.128985 ioctl(17, KVM_SET_USER_MEMORY_REGION, {slot=0, 
flags=0, guest_phys_addr=0x80000000, memory_size=12884901888, 
userspace_addr=0xfffc91fc9000}) = 0 <0.000016>

> 
>> I am not aware of way to make it fast for both use cases and would be more
>> than happy to hear about possible solutions.
> 
> What if we key off of vCPUS being created?  The motivation for Keir's change was
> to avoid stalling during VM boot, i.e. *after* initial VM creation.

It doesn't work as is on x86 because the delay we're seeing occurs after 
the created_cpus gets incremented so it doesn't allow to differentiate 
the two cases (below is kvm_vm_ioctl_create_vcpu):

	kvm->created_vcpus++; // <===== incremented here
	mutex_unlock(&kvm->lock);

	vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL_ACCOUNT);
	if (!vcpu) {
		r = -ENOMEM;
		goto vcpu_decrement;
	}

	BUILD_BUG_ON(sizeof(struct kvm_run) > PAGE_SIZE);
	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
	if (!page) {
		r = -ENOMEM;
		goto vcpu_free;
	}
	vcpu->run = page_address(page);

	kvm_vcpu_init(vcpu, kvm, id);

	r = kvm_arch_vcpu_create(vcpu); // <===== the delay is here


firecracker   583 [001]   151.297145: 
probe:synchronize_srcu_expedited: (ffffffff813e5cf0)
     ffffffff813e5cf1 synchronize_srcu_expedited+0x1 ([kernel.kallsyms])
     ffffffff81234986 kvm_swap_active_memslots+0x136 ([kernel.kallsyms])
     ffffffff81236cdd kvm_set_memslot+0x1cd ([kernel.kallsyms])
     ffffffff81237518 kvm_set_memory_region.part.0+0x478 ([kernel.kallsyms])
     ffffffff81264dbc __x86_set_memory_region+0xec ([kernel.kallsyms])
     ffffffff8127e2dc kvm_alloc_apic_access_page+0x5c ([kernel.kallsyms])
     ffffffff812b9ed3 vmx_vcpu_create+0x193 ([kernel.kallsyms])
     ffffffff8126788a kvm_arch_vcpu_create+0x1da ([kernel.kallsyms])
     ffffffff8123c54c kvm_vm_ioctl+0x5fc ([kernel.kallsyms])
     ffffffff8167b331 __x64_sys_ioctl+0x91 ([kernel.kallsyms])
     ffffffff8251a89c do_syscall_64+0x4c ([kernel.kallsyms])
     ffffffff8100012b entry_SYSCALL_64_after_hwframe+0x76 
([kernel.kallsyms])
               6512de ioctl+0x32 (/mnt/host/firecracker)
                d99a7 std::rt::lang_start+0x37 (/mnt/host/firecracker)


Also, given that it stumbles after the KVM_CREATE_VCPU on ARM (in 
KVM_SET_USER_MEMORY_REGION), it doesn't look like a universal solution.


> 
> --
> From: Sean Christopherson <seanjc at google.com>
> Date: Fri, 13 Feb 2026 15:15:01 -0800
> Subject: [PATCH] KVM: Synchronize SRCU on I/O device registration if vCPUs
>   haven't been created
> 
> TODO: Write a changelog if this works.
> 
> Fixes: 7d9a0273c459 ("KVM: Avoid synchronize_srcu() in kvm_io_bus_register_dev()")
> Reported-by: Nikita Kalyazin <kalyazin at amazon.com>
> Closes: https://lkml.kernel.org/r/a84ddba8-12da-489a-9dd1-ccdf7451a1ba%40amazon.com
> Cc: stable at vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc at google.com>
> ---
>   virt/kvm/kvm_main.c | 25 ++++++++++++++++++++++++-
>   1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 571cf0d6ec01..043b1c3574ab 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -6027,7 +6027,30 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
>          memcpy(new_bus->range + i + 1, bus->range + i,
>                  (bus->dev_count - i) * sizeof(struct kvm_io_range));
>          rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
> -       call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
> +
> +       /*
> +        * To optimize VM creation *and* boot time, use different tactics for
> +        * safely freeing the old bus based on where the VM is at in its
> +        * lifecycle.  If vCPUs haven't yet been created, simply synchronize
> +        * and free, as there are unlikely to be active SRCU readers; if not,
> +        * defer freeing the bus via SRCU callback.
> +        *
> +        * If there are active SRCU readers, synchronizing will stall until the
> +        * current grace period completes, which can meaningfully impact boot
> +        * time for VMs that trigger a large number of registrations.
> +        *
> +        * If there aren't SRCU readers, using an SRCU callback can be a net
> +        * negative due to starting a grace period of its own, which in turn
> +        * can unnecessarily cause a future synchronization to stall.  E.g. if
> +        * devices are registered before memslots are created, then creating
> +        * the first memslot will have to wait for a superfluous grace period.
> +        */
> +       if (!READ_ONCE(kvm->created_vcpus)) {
> +               synchronize_srcu_expedited(&kvm->srcu);
> +               kfree(bus);
> +       } else {
> +               call_srcu(&kvm->srcu, &bus->rcu, __free_bus);
> +       }
> 
>          return 0;
>   }
> 
> base-commit: 183bb0ce8c77b0fd1fb25874112bc8751a461e49
> --




More information about the linux-arm-kernel mailing list