[PATCH] KVM: arm64: nv: Set ISTATUS for emulated timers, If timer expired
Eric Auger
eauger at redhat.com
Mon Dec 9 08:46:30 PST 2024
Hi,
On 12/9/24 16:39, Ganapatrao Kulkarni wrote:
>
>
> On 09-12-2024 06:50 pm, Marc Zyngier wrote:
>> On Mon, 09 Dec 2024 12:25:34 +0000,
>> Ganapatrao Kulkarni <gankulkarni at os.amperecomputing.com> wrote:
>>>>>
>>>>> During automated testing of Nested Virtualization using avocado-vt,
>>>>
>>>> Which is not merged upstream. So what branch are you using? Based on
>>>> what kernel version? On what HW? With which virtualisation features?
>>>>
>>>
>>> Testing is done on Ampere's AmpereOne platform using 6.10 based kernel
>>> with NV patches from your repo.
>>
>> Grmbl... *Which* patches? At least give me the SHA1 of the branch,
>> because I have no idea what you are running. And 6.10 is definitely
>> not something I care about. If you're using the NV patches, the
>> *minimum* you should run is 6.13-rc1, because that's what the current
>> code is based on.
>>
>
> I tried 6.13-rc1 based nv-next branch today, which failed to boot
> UEFI as L1. Yet to debug this.
I confirm am stuck with the same issue with nv-next on AmpereOne.
Thanks
Eric
>
>
> QEMU from Eric's repo is used for the testing.
> https://github.com/eauger/qemu/tree/v9.0-nv-rfcv3
>
>> Also, does this machine have FEAT_ECV?
>
> Yes!
>>
>>>
>>>>> it has been observed that during some boot test iterations,
>>>>> the Guest-Hypervisor boot was getting crashed with a
>>>>> synchronous exception while it is still booting EDK2.
>>>>>
>>>>> The test is launching Multiple instances of Guest-Hypervisor boot
>>>>
>>>> Is the multiple instance aspect really relevant to the reproduction of
>>>> the problem?
>>>
>>> Not really, but it requires multiple attempts/iterations to hit the
>>> issue. Even with automated test, it was seen at some iteration out of
>>> 10 to 15 iterations.
>>>
>>>>
>>>>> and while booting, QEMU monitor issued the command "info register"
>>>>> at regular intervals to take a register dump. To execute this
>>>>> command, QEMU stops the run and does the register read of various
>>>>> registers. While resuming the run, the function kvm_arm_timer_write()
>>>>> writes back the saved CNTV_CTL_EL0 register with ISTATUS cleared
>>>>> always
>>>>
>>>> It is userspace that causes this write-back, right? AFAICT, KVM never
>>>> does that on its own.
>>>>
>>>>> and resulting in the loss of pending interrupt for emulated timers.
>>>>
>>>> How does a missing interrupt result in a synchronous exception in
>>>> EDK2? In my experience, EDK2 panics if it sees spurious interrupts,
>>>> not when it is missing interrupts (it just locks up, which is
>>>> expected).
>>>
>>> Not sure, why it is hitting exception, rather than hang at EDK2.
>>> However, EDK2 timer handler code is ignoring the interrupt since
>>> ISTATUS is not set and not moving CVAL forward.
>>
>> How is EDK2 getting this exception? Is this injected by KVM? Or is
>> that some EDK2 bug?
>>
>>>
>>>>
>>>>> In hardware based timers, ISTATUS is a RO/WI bit and gets set by the
>>>>> h/w, if the condition is still met. However, in Nested-Virtualization
>>>>> case, the Guest-Hypervisor's EDK2 is using an emulated virtual timer
>>>>> and losing ISTATUS state and the interrupt forever.
>>>>
>>>> Why is this specific to NV? Can't the same thing happen to the
>>>> physical timer in a non-VHE configuration?
>>>>
>>>
>>> You mean, emulated v-timer in non-VHE boot?
>>
>> Emulated *physical* timer.
>>
>>> It might impact non-VHE case as well, not tried though.
>>
>> Can you please try?
>
> Sure, I will try non-VHE as well.
>>
>> [...]
>>
>>>> But overall, this very much looks like it is only papering over the
>>>> real issue, which is that the *emulation* should regenerate the
>>>> pending bit, and not rely on the userspace interface.
>>>>
>>>> As far as I can tell, we already correctly compute the status bit on
>>>> read (read_timer_ctl()), so the guest should always observe something
>>>> consistent when it traps. We also *never* use the status bit as an
>>>> input to the emulation, and always recompute it from scratch (it is
>>>> only there for the benefit of the guest or userspace).
>>>>
>>>
>>> For emulated timers, we are not asserting again by calling
>>> kvm_timer_update_irq in timer_emulate() until the level is down and
>>> ready for trigger again. This was done to fix high rate of spurious
>>> interrupts getting generated to V-Timer. Hence we are not able to
>>> recover, if once ISTATUS is lost.
>>
>> Again, a trapping read should see the correct value, since we populate
>> that bit at read time.
>>
>>>> So I can't see how upstream is broken in at the moment, and you need
>>>> to explain how this actually triggers. Ideally, with a standalone
>>>> reproducer or a selftest.
>>>
>>> We could reproduce the issue with the simple test/script.
>>> On one shell, launch L1 using qemu with add-on option
>>>
>>> "-monitor unix:gh_monitor,server,nowait
>>>
>>> On another shell, while L1 boots and still in UEFI, run repeatedly the
>>> command (or put in a while 1 loop script)
>>>
>>> "echo "info registers" | socat - unix-connect:gh_monitor >
>>> /tmp/info_registers"
>>>
>>> With above steps we were able to hit the issue within few attempts.
>>
>> That's not a standalone reproducer. QEMU doesn't support NV, and
>> kvmtool doesn't have this sort of interface. I was asking for a bit of
>> C code that I could run directly, not something that requires me to
>> drag even more experimental code.
>>
>> So here's my current guess, since you don't give me the needed
>> information. For what you describe to happen, I can only see two
>> possibilities:
>>
>> - either your HW doesn't have FEAT_ECV, in which case the guest
>> directly reads from memory
>>
>
> We do have the FEAT_ECV on AmpereOne, I was the one reported/fixed bug
> with FEAT_ECV(CNTPOFF offset issue) in the past.
>
>> - or you are running with something like this patch [1], and we serve
>> the guest by reading from memory very early, without returning to
>> the bulk of the emulation code
>
> I see the kernel I am testing has this patch[1].
>>
>> In either case, we only publish the updated status if the current IRQ
>> state is different from the computed output of the timer while
>> performing the emulation.
>>
>> So if you were writing back a status bit set to 0 while the interrupt
>> was already pending, we'd deliver an interrupt, but not recompute the
>> status. The guest would consider the interrupt as spurious, not touch
>> the timer, and we'd never make forward progress. Rinse, repeat.
>>
>> Assuming I got the analysis right,
>
> Yes, this is what I tried to explain. LR shows pending, but UEFI is not
> consuming and treating it as spurious since ISTATUS is not set.
>
> it would only be a matter of
>> hoisting the publication of the status into timer_emulate(), so that
>> it is made up to date on load.
>>
>> Please give the fixup below a go.
>>
>
> Sure, I will give a try with below diff and let you know tomorrow.
> This should work, I remember, this was the one of the option/fix that we
> tried as fix while debugging.
>
>>
>> [1] https://lore.kernel.org/all/20241202172134.384923-6-maz@kernel.org/
>>
>> From 2bbd6f9b41a20ad573376c20c158ff3c12db5009 Mon Sep 17 00:00:00 2001
>> From: Marc Zyngier <maz at kernel.org>
>> Date: Mon, 9 Dec 2024 10:58:08 +0000
>> Subject: [PATCH] fixup! KVM: arm64: nv: Publish emulated timer
>> interrupt state
>> in the in-memory state
>>
>> ---
>> arch/arm64/kvm/arch_timer.c | 32 +++++++++++++-------------------
>> 1 file changed, 13 insertions(+), 19 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/arch_timer.c b/arch/arm64/kvm/arch_timer.c
>> index 895f09658ef83..91bda986c344b 100644
>> --- a/arch/arm64/kvm/arch_timer.c
>> +++ b/arch/arm64/kvm/arch_timer.c
>> @@ -432,25 +432,6 @@ static void kvm_timer_update_irq(struct kvm_vcpu
>> *vcpu, bool new_level,
>> {
>> int ret;
>> - /*
>> - * Paper over NV2 brokenness by publishing the interrupt status
>> - * bit. This still results in a poor quality of emulation (guest
>> - * writes will have no effect until the next exit).
>> - *
>> - * But hey, it's fast, right?
>> - */
>> - if (is_hyp_ctxt(vcpu) &&
>> - (timer_ctx == vcpu_vtimer(vcpu) || timer_ctx ==
>> vcpu_ptimer(vcpu))) {
>> - u32 ctl = timer_get_ctl(timer_ctx);
>> -
>> - if (new_level)
>> - ctl |= ARCH_TIMER_CTRL_IT_STAT;
>> - else
>> - ctl &= ~ARCH_TIMER_CTRL_IT_STAT;
>> -
>> - timer_set_ctl(timer_ctx, ctl);
>> - }
>> -
>> timer_ctx->irq.level = new_level;
>> trace_kvm_timer_update_irq(vcpu->vcpu_id, timer_irq(timer_ctx),
>> timer_ctx->irq.level);
>> @@ -471,6 +452,19 @@ static void timer_emulate(struct
>> arch_timer_context *ctx)
>> trace_kvm_timer_emulate(ctx, should_fire);
>> + /*
>> + * Paper over NV2 brokenness by publishing the interrupt status
>> + * bit. This still results in a poor quality of emulation (guest
>> + * writes will have no effect until the next exit).
>> + *
>> + * But hey, it's fast, right?
>> + */
>> + if (is_hyp_ctxt(ctx->vcpu)) {
>> + unsigned long val = timer_get_ctl(ctx);
>> + __assign_bit(__ffs(ARCH_TIMER_CTRL_IT_STAT), &val, should_fire);
>> + timer_set_ctl(ctx, val);
>> + }
>> +
>> if (should_fire != ctx->irq.level) {
>> kvm_timer_update_irq(ctx->vcpu, should_fire, ctx);
>> return;
>
More information about the linux-arm-kernel
mailing list