[PATCH] KVM: arm64: vgic: Fix soft lockup during VM teardown
Shanker Donthineni
sdonthineni at nvidia.com
Thu Jan 19 05:00:49 PST 2023
On 1/19/23 01:11, Marc Zyngier wrote:
> External email: Use caution opening links or attachments
>
>
> [dropping the now dead old kvmarm list]
>
> On Wed, 18 Jan 2023 19:24:01 +0000,
> Shanker Donthineni <sdonthineni at nvidia.com> wrote:
>>
>>
>>
>> On 1/18/23 05:54, Marc Zyngier wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> Shanker,
>>>
>>> Please Cc all the KVM/arm64 reviewers when sending KVM/arm64 patches.
>>>
>>> On Wed, 18 Jan 2023 02:23:48 +0000,
>>> Shanker Donthineni <sdonthineni at nvidia.com> wrote:
>>>>
>>>> Getting intermittent CPU soft lockups during the virtual machines
>>>> teardown on a system with GICv4 features enabled. The function
>>>> __synchronize_hardirq() has been waiting for IRQD_IRQ_INPROGRESS
>>>> to be cleared forever as per the current implementation.
>>>>
>>>> CPU stuck here for a long time leads to soft lockup:
>>>> while (irqd_irq_inprogress(&desc->irq_data))
>>>> cpu_relax();
>>>
>>> Is it a soft-lockup from which the system recovers? or a livelock
>>> which leaves the system dead?
>>>
>> The system is not recovering, did a power cycle to recover.
>
> This isn't a soft-lockup then. This is at best a livelock.
>
>>> Are these two traces an indication of concurrent events? Or are they
>>> far apart?
>>>
>> Concurrent.
>
> So you can see the VM being torn down while the vgic save sequence is
> still in progress?
>
> If you can actually see that, then this is a much bigger bug than the
> simple race you are describing, and we're missing a reference on the
> kvm structure. This would be a *MAJOR* bug.
>
How do we know vGIC save sequence is in progress while VM is being teardown?
I'm launching/terminating ~32 VMs in a loop to reproduce the issue.
> Please post the full traces, not snippets. The absolutely full kernel
> log, the configuration, what you run, how you run it, *EVERYTHING*. I
> need to be able to reproduce this.
Sure, I'll share the complete boot log messages of host kernel next run.
>
>>
>>>>
>>>> irqreturn_t handle_irq_event(struct irq_desc *desc)
>>>> {
>>>> irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
>>>> raw_spin_unlock(&desc->lock);
>>>>
>>>> ret = handle_irq_event_percpu(desc);
>>>>
>>>> raw_spin_lock(&desc->lock);
>>>> irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
>>>> }
>>>
>>> How is that relevant to this trace? Do you see this function running
>>> concurrently with the teardown? If it matters here, it must be a VPE
>>> doorbell, right? But you claim that this is on a GICv4 platform, while
>>> this would only affect GICv4.1... Or are you using GICv4.1?
>>>
>> handle_irq_event() is running concurrently with irq_domain_activate_irq()
>> which happens before free_irq() called. Corruption at [78.983544] and
>> teardown started at [87.360891].
>
> But that doesn't match the description you made of concurrent
> events. Does it take more than 9 seconds for the vgic state to be
> saved to memory?
Are there any other possibilities of corrupting IRQD_IRQ_INPROGRESS
state bit other than concurrent accesses?
I did another experiment, changed the bit manipulation of the IRQD
state to atomic operations, the issue is not reproducible.
struct irq_common_data {
- unsigned int __private state_use_accessors;
+ atomic_t __private state_use_accessors;
+#define __irqd_to_state(d) ACCESS_PRIVATE((d)->common, state_use_accessors)
+#define __irqd_get_state(d) atomic_read(&__irqd_to_state(d))
+#define __irqd_set_state(d, m) atomic_or((m), &__irqd_to_state(d))
+#define __irqd_clear_state(d, m) atomic_andnot((m), &__irqd_to_state(d))
>
>>
>> [ 78.983544] irqd_set_activated: lost IRQD_IRQ_INPROGRESS old=0x10401400, new=0x10441600
>>
>> [ 87.360891] __synchronize_hardirq+0x48/0x140
>>
>> Yes, I'm using GICv4.1, used these 2 functions to trace the issue.
>
> Then *please* be precise in your descriptions. You send people in the
> wrong direction.
>
It was my mistake not to mention the exact GIC version.
> [...]
>
>> I ran stress test launch/teardown multiple VMs for 3hrs. The issue
>> is not reproducible. The same test fails in 10-30min without code
>> changes.
>
> That doesn't add up with the earlier description of concurrent events,
> with the VM teardown and the vgic saving running in parallel. My patch
> doesn't prevent this.
>
> So either your testing is insufficient, or your description of
> concurrent events is inaccurate.
>
I'm using the unmodified 6.2.rc3/v6.0 from kernel.org and QEMU to
reproduce the issue.
Using the below steps for launching/terminating 32 VMs in loop. The
failure is intermittent. The same issue is reproducible with KVMTOOL
also.
numvms=32
socketcnt=1
iterations=100
while [ ${iterations} -ne 0 ]; do
iterations=$(( iterations - 1 ))
for idx in `seq 1 ${numvms}`; do
vmname=vm${idx}
cpunode=$((idx % socketcnt))
cpuset=`lscpu | grep "NUMA node${cpunode} CPU(s)" | awk -F' ' '{ print $4 }'`
qemu-img create -f qcow2 -F raw -b /var/lib/libvirt/images/cloudimg.raw /var/lib/libvirt/images/${vmname}.qcow2
virt-sysprep -a /var/lib/libvirt/images/${vmname}.qcow2 --hostname ${vmname}
virt-install \
--name ${vmname} \
--numatune ${cpunode},mode=preferred \
--vcpus vcpus=16,maxvcpus=16,cpuset=${cpuset} \
--memory 4096 \
--import \
--noautoconsole \
--os-variant name=ubuntu22.04 \
--disk path=/var/lib/libvirt/images/${vmname}.qcow2 \
--boot loader=/usr/share/AAVMF/AAVMF_CODE.fd,loader_ro=yes,loader_type=pflash
if [ $? -ne 0 ]; then
echo "Failed to create vm${idx}"
while true ; do sleep 1; done
fi
sleep 5
done
for idx in `seq 1 ${numvms}`; do
echo -n "Checking vm${idx} boot: "
while true; do
ghostname=`virsh guestinfo vm${idx} --hostname 2>/dev/null | awk '{ print $3 }'`
[ "${ghostname}" == "vm${idx}" ] && echo "done" && break
sleep 1
done
done
for idx in `seq 1 ${numvms}`; do
virsh domstate vm${idx} >/dev/null 2>&1 || continue
echo "Started shutdown & undefine vm${idx}..."
virsh shutdown "vm${idx}" --mode agent >/dev/null 2>&1
while true; do
virsh domstate vm${idx} 2>/dev/null 2>&1 | grep -w "^shut off" >/dev/null 2>&1
[ $? -eq 0 ] && break
sleep 1
virsh shutdown "vm${idx}" --mode agent >/dev/null 2>&1
done
virsh undefine --nvram --remove-all-storage vm${idx} 2>/dev/null
done
done
> M.
>
> --
> Without deviation from the norm, progress is not possible.
More information about the linux-arm-kernel
mailing list