Possible nohz-full/RCU issue in arm64 KVM

Mon Dec 20 06:28:30 PST 2021

On Fri, 17 Dec 2021 13:21:39 +0000,
Mark Rutland <mark.rutland at arm.com> wrote:
> 
> On Fri, Dec 17, 2021 at 12:51:57PM +0100, Nicolas Saenz Julienne wrote:
> > Hi All,
> 
> Hi,
> 
> > arm64's guest entry code does the following:
> > 
> > int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> > {
> > 	[...]
> > 
> > 	guest_enter_irqoff();
> > 
> > 	ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
> > 
> > 	[...]
> > 
> > 	local_irq_enable();
> > 
> > 	/*
> > 	 * We do local_irq_enable() before calling guest_exit() so
> > 	 * that if a timer interrupt hits while running the guest we
> > 	 * account that tick as being spent in the guest.  We enable
> > 	 * preemption after calling guest_exit() so that if we get
> > 	 * preempted we make sure ticks after that is not counted as
> > 	 * guest time.
> > 	 */
> > 	guest_exit();
> > 	[...]
> > }
> > 
> > 
> > On a nohz-full CPU, guest_{enter,exit}() delimit an RCU extended quiescent
> > state (EQS). Any interrupt happening between local_irq_enable() and
> > guest_exit() should disable that EQS. Now, AFAICT all el0 interrupt handlers
> > do the right thing if trggered in this context, but el1's won't. Is it
> > possible to hit an el1 handler (for example __el1_irq()) there?
> 
> I think you're right that the EL1 handlers can trigger here and
> won't exit the EQS.
> 
> I'm not immediately sure what we *should* do here. What does x86 do
> for an IRQ taken from a guest mode? I couldn't spot any handling of
> that case, but I'm not familiar enough with the x86 exception model
> to know if I'm looking in the right place.
> 
> Note that the EL0 handlers *cannot* trigger for an exception taken
> from a guest. We use separate vectors while running a guest (for
> both VHE and nVHE modes), and from the main kernel's PoV we return
> from kvm_call_hyp_ret(). We can ony take IRQ from EL1 *after* that
> returns.
> 
> We *might* need to audit the KVM vector handlers to make sure they're not
> dependent on RCU protection (I assume they're not, but it's possible something
> has leaked into the VHE code).

The *intent* certainly is that whatever is used in the VHE code to
handle exceptions arising whilst running in guest context must be
independent from RCU, if only because we share a bunch with the !VHE
code, and RCU is, unfortunately, not a thing there.

My most immediate concern is that the VHE/nVHE split now allows all
sort of instrumentation in VHE, which may rely on RCU. At the very
least, we should make most of the VHE switch code noinstr.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.