[PATCH] KVM: arm64: vgic: Fix soft lockup during VM teardown
Shanker Donthineni
sdonthineni at nvidia.com
Wed Jan 18 11:24:01 PST 2023
On 1/18/23 05:54, Marc Zyngier wrote:
> External email: Use caution opening links or attachments
>
>
> Shanker,
>
> Please Cc all the KVM/arm64 reviewers when sending KVM/arm64 patches.
>
> On Wed, 18 Jan 2023 02:23:48 +0000,
> Shanker Donthineni <sdonthineni at nvidia.com> wrote:
>>
>> Getting intermittent CPU soft lockups during the virtual machines
>> teardown on a system with GICv4 features enabled. The function
>> __synchronize_hardirq() has been waiting for IRQD_IRQ_INPROGRESS
>> to be cleared forever as per the current implementation.
>>
>> CPU stuck here for a long time leads to soft lockup:
>> while (irqd_irq_inprogress(&desc->irq_data))
>> cpu_relax();
>
> Is it a soft-lockup from which the system recovers? or a livelock
> which leaves the system dead?
>
The system is not recovering, did a power cycle to recover.
[ 211.018214] watchdog: BUG: soft lockup - CPU#8 stuck for 45s! [qemu-system-aar:1261]
[ 211.026164] Modules linked in:
[ 211.029283] CPU: 8 PID: 1261 Comm: qemu-system-aar Kdump: loaded Not tainted 6.0.0 #7
[ 211.037338] Hardware name: /th500_silicon, BIOS v1.1.1-4638c6c9 01/17/2023
[ 211.044453] pstate: 62400009 (nZCv daif +PAN -UAO +TCO -DIT -SSBS BTYPE=--)
[ 211.051586] pc : __synchronize_hardirq+0x54/0x140
[ 211.056421] lr : free_irq+0x138/0x424
[ 211.060160] sp : ffff80001504f990
[ 211.063551] x29: ffff80001504f990 x28: 0000000000000008 x27: ffff00036c880000
[ 211.070850] x26: ffff00036f888c00 x25: 0000000000000000 x24: 0000000000000001
[ 211.078219] x23: ffff0000835659d0 x22: 0000000000000001 x21: 0000000000000000
[ 211.085513] x20: ffff00008356591c x19: ffff0000835658a0 x18: 000000000000003f
[ 211.092807] x17: ffff001feafabf28 x16: ffffffc0012c3388 x15: ffff80001504fab0
[ 211.100103] x14: 0000000000000008 x13: ffff800008b41718 x12: 00000000f5257d14
[ 211.107489] x11: 0000000000000040 x10: ffff8000091f1680 x9 : ffff8000091f1678
[ 211.114783] x8 : 0000000000000001 x7 : 0000000000000000 x6 : 0000000000000001
[ 211.122115] x5 : ffff8000083524e4 x4 : ffffffc000db5e20 x3 : 0000000081550146
[ 211.129409] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000010071200
[ 211.136736] Call trace:
[ 211.139232] __synchronize_hardirq+0x54/0x140
[ 211.143746] free_irq+0x138/0x424
[ 211.147139] vgic_v4_teardown+0xa4/0xe0
[ 211.151072] __kvm_vgic_destroy+0x18c/0x194
[ 211.155351] kvm_vgic_destroy+0x28/0x3c
[ 211.159310] kvm_arch_destroy_vm+0x24/0x44
[ 211.163579] kvm_destroy_vm+0x158/0x2c4
[ 211.167495] kvm_vm_release+0x6c/0x98
[ 211.171257] __fput+0x70/0x220
[ 211.174373] ____fput+0x10/0x20
[ 211.177588] task_work_run+0xb4/0x23c
[ 211.181327] do_exit+0x2bc/0x8a4
[ 211.184622] do_group_exit+0x34/0xb0
[ 211.188271] get_signal+0x878/0x8a0
[ 211.191856] do_notify_resume+0x138/0x1530
[ 211.196075] el0_svc+0xdc/0xf0
[ 211.199211] el0t_64_sync_handler+0xf0/0x11c
[ 211.203635] el0t_64_sync+0x18c/0x190
[ 211.207374] Kernel panic - not syncing: softlockup: hung tasks
[ 211.213335] CPU: 8 PID: 1261 Comm: qemu-system-aar Kdump: loaded Tainted: G L 6.0.0 #7
[ 211.222851] Hardware name: /th500_silicon, BIOS v1.1.1-4638c6c9 01/17/2023
[ 211.229993] Call trace:
[ 211.232498] dump_backtrace.part.0+0xc8/0xe0
[ 211.236890] show_stack+0x18/0x54
[ 211.240330] dump_stack_lvl+0x64/0x7c
[ 211.244097] dump_stack+0x18/0x30
[ 211.247479] panic+0x160/0x328
[ 211.250622] watchdog_nmi_enable+0x0/0x14
[ 211.254818] __hrtimer_run_queues+0x118/0x244
[ 211.259288] hrtimer_interrupt+0x114/0x300
[ 211.263471] arch_timer_handler_phys+0x2c/0x40
[ 211.268034] handle_percpu_devid_irq+0x84/0x138
[ 211.272671] generic_handle_domain_irq+0x2c/0x40
[ 211.277394] gic_handle_irq+0x50/0x128
[ 211.281221] call_on_irq_stack+0x2c/0x38
[ 211.285226] do_interrupt_handler+0x7c/0x88
[ 211.289552] el1_interrupt+0x34/0x54
[ 211.293239] el1h_64_irq_handler+0x18/0x20
[ 211.297433] el1h_64_irq+0x64/0x68
[ 211.300935] __synchronize_hardirq+0x54/0x140
[ 211.305385] free_irq+0x138/0x424
[ 211.308767] vgic_v4_teardown+0xa4/0xe0
[ 211.312709] __kvm_vgic_destroy+0x18c/0x194
[ 211.317007] kvm_vgic_destroy+0x28/0x3c
[ 211.320927] kvm_arch_destroy_vm+0x24/0x44
[ 211.325204] kvm_destroy_vm+0x158/0x2c4
[ 211.329125] kvm_vm_release+0x6c/0x98
[ 211.332872] __fput+0x70/0x220
[ 211.335988] ____fput+0x10/0x20
[ 211.339192] task_work_run+0xb4/0x23c
[ 211.342980] do_exit+0x2bc/0x8a4
[ 211.346310] do_group_exit+0x34/0xb0
[ 211.349960] get_signal+0x878/0x8a0
[ 211.353520] do_notify_resume+0x138/0x1530
[ 211.357703] el0_svc+0xdc/0xf0
[ 211.360818] el0t_64_sync_handler+0xf0/0x11c
[ 211.365197] el0t_64_sync+0x18c/0x190
[ 211.368947] SMP: stopping secondary CPUs
> What kernel version is that? Please provide all the relevant context
> that could help analysing the issue.
>
Tested with v6.0+ and v6.2.rc4.
>>
>> Call trace from the lockup CPU:
>> [ 87.238866] watchdog: BUG: soft lockup - CPU#37 stuck for 23s!
>> [ 87.250025] CPU: 37 PID: 1031 Comm: qemu-system-aarch64
>> [ 87.358397] Call trace:
>> [ 87.360891] __synchronize_hardirq+0x48/0x140
>> [ 87.365343] free_irq+0x138/0x424
>> [ 87.368727] vgic_v4_teardown+0xa4/0xe0
>> [ 87.372649] __kvm_vgic_destroy+0x18c/0x194
>> [ 87.376922] kvm_vgic_destroy+0x28/0x3c
>> [ 87.380839] kvm_arch_destroy_vm+0x24/0x44
>> [ 87.385024] kvm_destroy_vm+0x158/0x2c4
>> [ 87.388943] kvm_vm_release+0x6c/0x98
>> [ 87.392681] __fput+0x70/0x220
>> [ 87.395800] ____fput+0x10/0x20
>> [ 87.399005] task_work_run+0xb4/0x23c
>> [ 87.402746] do_exit+0x2bc/0x8a4
>> [ 87.406042] do_group_exit+0x34/0xb0
>> [ 87.409693] get_signal+0x878/0x8a0
>> [ 87.413254] do_notify_resume+0x138/0x1530
>> [ 87.417440] el0_svc+0xdc/0xf0
>> [ 87.420559] el0t_64_sync_handler+0xf0/0x11c
>> [ 87.424919] el0t_64_sync+0x18c/0x190
>>
>> The state of the IRQD_IRQ_INPROGRESS information is lost inside
>> irq_domain_activate_irq() which happens before calling free_irq().
>> Instrumented the code and confirmed, the IRQD state is changed from
>> 0x10401400 to 0x10441600 instead of 0x10401600 causing problem.
>>
>> Call trace from irqd_set_activated():
>> [ 78.983544] irqd_set_activated: lost IRQD_IRQ_INPROGRESS
>> old=0x10401400, new=0x10441600
>> [ 78.992093] CPU: 19 PID: 1511 Comm: qemu-system-aarch64
>> [ 79.008461] Call trace:
>> [ 79.010956] dump_backtrace.part.0+0xc8/0xe0
>> [ 79.015328] show_stack+0x18/0x54
>> [ 79.018713] dump_stack_lvl+0x64/0x7c
>> [ 79.022459] dump_stack+0x18/0x30
>> [ 79.025842] irq_domain_activate_irq+0x88/0x94
>> [ 79.030385] vgic_v3_save_pending_tables+0x260/0x29c
>> [ 79.035463] vgic_set_common_attr+0xac/0x23c
>> [ 79.039826] vgic_v3_set_attr+0x48/0x60
>> [ 79.043742] kvm_device_ioctl+0x120/0x19c
>> [ 79.047840] __arm64_sys_ioctl+0x42c/0xe00
>> [ 79.052027] invoke_syscall.constprop.0+0x50/0xe0
>> [ 79.056835] do_el0_svc+0x58/0x180
>> [ 79.060308] el0_svc+0x38/0xf0
>> [ 79.063425] el0t_64_sync_handler+0xf0/0x11c
>> [ 79.067785] el0t_64_sync+0x18c/0x190
>
> Are these two traces an indication of concurrent events? Or are they
> far apart?
>
Concurrent.
>>
>> irqreturn_t handle_irq_event(struct irq_desc *desc)
>> {
>> irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
>> raw_spin_unlock(&desc->lock);
>>
>> ret = handle_irq_event_percpu(desc);
>>
>> raw_spin_lock(&desc->lock);
>> irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
>> }
>
> How is that relevant to this trace? Do you see this function running
> concurrently with the teardown? If it matters here, it must be a VPE
> doorbell, right? But you claim that this is on a GICv4 platform, while
> this would only affect GICv4.1... Or are you using GICv4.1?
>
handle_irq_event() is running concurrently with irq_domain_activate_irq()
which happens before free_irq() called. Corruption at [78.983544] and
teardown started at [87.360891].
[ 78.983544] irqd_set_activated: lost IRQD_IRQ_INPROGRESS old=0x10401400, new=0x10441600
[ 87.360891] __synchronize_hardirq+0x48/0x140
Yes, I'm using GICv4.1, used these 2 functions to trace the issue.
static inline void irqd_set_activated(struct irq_data *d)
{
atomic_t *state = (atomic_t *)(&__irqd_to_state(d));
unsigned int new, old = __irqd_to_state(d);
new = atomic_xchg(state, old | IRQD_ACTIVATED);
if ((old ^ new) & IRQD_IRQ_INPROGRESS) {
pr_err("%s: lost IRQD_IRQ_INPROGRESS old=0x%x new=0x%x\n",
__func__, old, old | IRQD_ACTIVATED);
dump_stack();
}
}
static inline void irqd_clr_activated(struct irq_data *d)
{
atomic_t *state = (atomic_t *)(&__irqd_to_state(d));
unsigned int new, old = __irqd_to_state(d);
new = atomic_xchg(state, old & ~IRQD_ACTIVATED);
if ((old ^ new) & IRQD_IRQ_INPROGRESS) {
pr_err("%s: lost IRQD_IRQ_INPROGRESS old=0x%x new=0x%x\n",
__func__, old, old & ~IRQD_ACTIVATED);
dump_stack();
}
}
>>
>> In this particular failed case and based on traces, the two functions
>> irqd_set_activated() and handle_irq_event() are concurrently modifying
>> IRQD state without both holding desc->lock. The irqd_set_activated()
>> execution path is reading memory 'state_use_accessors' in between set
>> & clear of IRQD_IRQ_INPROGRESS state change and writing the modified
>> data after executing 'irqd_clear(desc->irq_data, IRQD_IRQ_INPROGRESS)'.
>>
>> To fix the lockup issue, hold desc->lock when calling functions
>> irq_domain_activate_irq() and irq_domain_deactivate_irq).
>
> For that particular issue, the main problem is that we are abusing the
> interrupt startup/teardown code. The locking is only a consequence of
> this.
>
>>
>> Signed-off-by: Shanker Donthineni <sdonthineni at nvidia.com>
>> ---
>> arch/arm64/kvm/vgic/vgic-v3.c | 6 ++++++
>> arch/arm64/kvm/vgic/vgic-v4.c | 4 ++++
>> 2 files changed, 10 insertions(+)
>>
>> diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
>> index 2074521d4a8c..e6aa909fcbe2 100644
>> --- a/arch/arm64/kvm/vgic/vgic-v3.c
>> +++ b/arch/arm64/kvm/vgic/vgic-v3.c
>> @@ -353,22 +353,28 @@ int vgic_v3_lpi_sync_pending_status(struct kvm *kvm, struct vgic_irq *irq)
>> static void unmap_all_vpes(struct vgic_dist *dist)
>> {
>> struct irq_desc *desc;
>> + unsigned long flags;
>> int i;
>>
>> for (i = 0; i < dist->its_vm.nr_vpes; i++) {
>> desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
>> + raw_spin_lock_irqsave(&desc->lock, flags);
>> irq_domain_deactivate_irq(irq_desc_get_irq_data(desc));
>> + raw_spin_unlock_irqrestore(&desc->lock, flags);
>
> I guess this is the guilty one, based on your analysis, and assuming
> this is a v4.1 issue, not v4.
>
>> }
>> }
>>
>> static void map_all_vpes(struct vgic_dist *dist)
>> {
>> struct irq_desc *desc;
>> + unsigned long flags;
>> int i;
>>
>> for (i = 0; i < dist->its_vm.nr_vpes; i++) {
>> desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
>> + raw_spin_lock_irqsave(&desc->lock, flags);
>> irq_domain_activate_irq(irq_desc_get_irq_data(desc), false);
>> + raw_spin_unlock_irqrestore(&desc->lock, flags);
>> }
>> }
>>
>> diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
>> index ad06ba6c9b00..a01b8313e82c 100644
>> --- a/arch/arm64/kvm/vgic/vgic-v4.c
>> +++ b/arch/arm64/kvm/vgic/vgic-v4.c
>> @@ -139,8 +139,10 @@ static void vgic_v4_enable_vsgis(struct kvm_vcpu *vcpu)
>> /* Transfer the full irq state to the vPE */
>> vgic_v4_sync_sgi_config(vpe, irq);
>> desc = irq_to_desc(irq->host_irq);
>> + raw_spin_lock(&desc->lock);
>> ret = irq_domain_activate_irq(irq_desc_get_irq_data(desc),
>> false);
>> + raw_spin_unlock(&desc->lock);
>
> This one looks wrong. The interrupt never fires on the host (that's
> the whole point of this stuff). There isn't even a handler attached to
> it. How can that result in a problem?
Thanks for catching mistake.
>
> My take on the whole thing is that we should stop playing games with
> the underlying interrupt infrastructure. How about the following hack,
> which is only compile-tested. Let me know if that helps.
>
> M.
>
> diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
> index 2074521d4a8c..2624963cb95b 100644
> --- a/arch/arm64/kvm/vgic/vgic-v3.c
> +++ b/arch/arm64/kvm/vgic/vgic-v3.c
> @@ -350,26 +350,23 @@ int vgic_v3_lpi_sync_pending_status(struct kvm *kvm, struct vgic_irq *irq)
> * The deactivation of the doorbell interrupt will trigger the
> * unmapping of the associated vPE.
> */
> -static void unmap_all_vpes(struct vgic_dist *dist)
> +static void unmap_all_vpes(struct kvm *kvm)
> {
> - struct irq_desc *desc;
> + struct vgic_dist *dist = &kvm->arch.vgic;
> int i;
>
> - for (i = 0; i < dist->its_vm.nr_vpes; i++) {
> - desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
> - irq_domain_deactivate_irq(irq_desc_get_irq_data(desc));
> - }
> + for (i = 0; i < dist->its_vm.nr_vpes; i++)
> + free_irq(dist->its_vm.vpes[i]->irq, kvm_get_vcpu(kvm, i));
> }
>
> -static void map_all_vpes(struct vgic_dist *dist)
> +static void map_all_vpes(struct kvm *kvm)
> {
> - struct irq_desc *desc;
> + struct vgic_dist *dist = &kvm->arch.vgic;
> int i;
>
> - for (i = 0; i < dist->its_vm.nr_vpes; i++) {
> - desc = irq_to_desc(dist->its_vm.vpes[i]->irq);
> - irq_domain_activate_irq(irq_desc_get_irq_data(desc), false);
> - }
> + for (i = 0; i < dist->its_vm.nr_vpes; i++)
> + WARN_ON(vgic_v4_request_vpe_irq(kvm_get_vcpu(kvm, i),
> + dist->its_vm.vpes[i]->irq));
> }
>
> /**
> @@ -394,7 +391,7 @@ int vgic_v3_save_pending_tables(struct kvm *kvm)
> * and enabling of the doorbells have already been done.
> */
> if (kvm_vgic_global_state.has_gicv4_1) {
> - unmap_all_vpes(dist);
> + unmap_all_vpes(kvm);
> vlpi_avail = true;
> }
>
> @@ -444,7 +441,7 @@ int vgic_v3_save_pending_tables(struct kvm *kvm)
>
> out:
> if (vlpi_avail)
> - map_all_vpes(dist);
> + map_all_vpes(kvm);
>
> return ret;
> }
> diff --git a/arch/arm64/kvm/vgic/vgic-v4.c b/arch/arm64/kvm/vgic/vgic-v4.c
> index ad06ba6c9b00..a413718be92b 100644
> --- a/arch/arm64/kvm/vgic/vgic-v4.c
> +++ b/arch/arm64/kvm/vgic/vgic-v4.c
> @@ -222,6 +222,11 @@ void vgic_v4_get_vlpi_state(struct vgic_irq *irq, bool *val)
> *val = !!(*ptr & mask);
> }
>
> +int vgic_v4_request_vpe_irq(struct kvm_vcpu *vcpu, int irq)
> +{
> + return request_irq(irq, vgic_v4_doorbell_handler, 0, "vcpu", vcpu);
> +}
> +
> /**
> * vgic_v4_init - Initialize the GICv4 data structures
> * @kvm: Pointer to the VM being initialized
> @@ -283,8 +288,7 @@ int vgic_v4_init(struct kvm *kvm)
> irq_flags &= ~IRQ_NOAUTOEN;
> irq_set_status_flags(irq, irq_flags);
>
> - ret = request_irq(irq, vgic_v4_doorbell_handler,
> - 0, "vcpu", vcpu);
> + ret = vgic_v4_request_vpe_irq(vcpu, irq);
> if (ret) {
> kvm_err("failed to allocate vcpu IRQ%d\n", irq);
> /*
> diff --git a/arch/arm64/kvm/vgic/vgic.h b/arch/arm64/kvm/vgic/vgic.h
> index 0c8da72953f0..23e280fa0a16 100644
> --- a/arch/arm64/kvm/vgic/vgic.h
> +++ b/arch/arm64/kvm/vgic/vgic.h
> @@ -331,5 +331,6 @@ int vgic_v4_init(struct kvm *kvm);
> void vgic_v4_teardown(struct kvm *kvm);
> void vgic_v4_configure_vsgis(struct kvm *kvm);
> void vgic_v4_get_vlpi_state(struct vgic_irq *irq, bool *val);
> +int vgic_v4_request_vpe_irq(struct kvm_vcpu *vcpu, int irq);
>
> #endif
>
I ran stress test launch/teardown multiple VMs for 3hrs. The issue
is not reproducible. The same test fails in 10-30min without code
changes.
> --
> Without deviation from the norm, progress is not possible.
More information about the linux-arm-kernel
mailing list