[PATCH] KVM: arm64: Avoid corrupting vCPU context register in guest exit

Will Deacon will at kernel.org
Fri Feb 26 14:05:52 EST 2021


On Fri, Feb 26, 2021 at 06:35:42PM +0000, Marc Zyngier wrote:
> On 2021-02-26 18:12, Will Deacon wrote:
> > Commit 7db21530479f ("KVM: arm64: Restore hyp when panicking in guest
> > context") tracks the currently running vCPU, clearing the pointer to
> > NULL on exit from a guest.
> > 
> > Unfortunately, the use of 'set_loaded_vcpu' clobbers x1 to point at the
> > kvm_hyp_ctxt instead of the vCPU context, causing the subsequent RAS
> > code to go off into the weeds when it saves the DISR assuming that the
> > CPU context is embedded in a struct vCPU.
> > 
> > Leave x1 alone and use x3 as a temporary register instead when clearing
> > the vCPU on the guest exit path.
> > 
> > Cc: Marc Zyngier <maz at kernel.org>
> > Cc: Andrew Scull <ascull at google.com>
> > Cc: <stable at vger.kernel.org>
> > Fixes: 7db21530479f ("KVM: arm64: Restore hyp when panicking in guest
> > context")
> > Suggested-by: Quentin Perret <qperret at google.com>
> > Signed-off-by: Will Deacon <will at kernel.org>
> > ---
> > 
> > This was pretty awful to debug!
> > 
> >  arch/arm64/kvm/hyp/entry.S | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
> > index b0afad7a99c6..0c66a1d408fd 100644
> > --- a/arch/arm64/kvm/hyp/entry.S
> > +++ b/arch/arm64/kvm/hyp/entry.S
> > @@ -146,7 +146,7 @@ SYM_INNER_LABEL(__guest_exit, SYM_L_GLOBAL)
> >  	// Now restore the hyp regs
> >  	restore_callee_saved_regs x2
> > 
> > -	set_loaded_vcpu xzr, x1, x2
> > +	set_loaded_vcpu xzr, x2, x3
> > 
> >  alternative_if ARM64_HAS_RAS_EXTN
> >  	// If we have the RAS extensions we can consume a pending error
> 
> Grmbl... How comes we have never seen that for the past 5 months,
> including on CPUs that implement RAS?

I think it's probably a combination of (a) not having a massive testing
community (b) not having tools that would scream about this (e.g. I don't
think you could detect this with KASAN) and (c) the nature of the
corruption being mostly benign in practice.

We found it in pKVM development because it landed on the vtcr we were
restoring when coming out of suspend, which then meant the page-table
code went wonky on the next stage-2 fault because it got the wrong start
level and kept returning -EAGAIN because it thought a table was a leaf.
So even then, the failure mode is horribly subtle.

Will



More information about the linux-arm-kernel mailing list