[PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation

Alexandru Elisei alexandru.elisei at arm.com
Mon Jul 15 08:30:19 PDT 2024


Hi Marc,

On Thu, Jul 11, 2024 at 01:16:42PM +0100, Marc Zyngier wrote:
> On Thu, 11 Jul 2024 11:56:13 +0100,
> Alexandru Elisei <alexandru.elisei at arm.com> wrote:
> > 
> > Hi,
> > 
> > On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> > > In order to plug the brokenness of our current AT implementation,
> > > we need a SW walker that is going to... err.. walk the S1 tables
> > > and tell us what it finds.
> > > 
> > > Of course, it builds on top of our S2 walker, and share similar
> > > concepts. The beauty of it is that since it uses kvm_read_guest(),
> > > it is able to bring back pages that have been otherwise evicted.
> > > 
> > > This is then plugged in the two AT S1 emulation functions as
> > > a "slow path" fallback. I'm not sure it is that slow, but hey.
> > > [..]
> > >  	switch (op) {
> > >  	case OP_AT_S1E1RP:
> > >  	case OP_AT_S1E1WP:
> > > +		retry_slow = false;
> > >  		fail = check_at_pan(vcpu, vaddr, &par);
> > >  		break;
> > >  	default:
> > >  		goto nopan;
> > >  	}
> > 
> > For context, this is what check_at_pan() does:
> > 
> > static int check_at_pan(struct kvm_vcpu *vcpu, u64 vaddr, u64 *res)
> > {
> >         u64 par_e0;
> >         int error;
> > 
> >         /*
> >          * For PAN-involved AT operations, perform the same translation,
> >          * using EL0 this time. Twice. Much fun.
> >          */
> >         error = __kvm_at(OP_AT_S1E0R, vaddr);
> >         if (error)
> >                 return error;
> > 
> >         par_e0 = read_sysreg_par();
> >         if (!(par_e0 & SYS_PAR_EL1_F))
> >                 goto out;
> > 
> >         error = __kvm_at(OP_AT_S1E0W, vaddr);
> >         if (error)
> >                 return error;
> > 
> >         par_e0 = read_sysreg_par();
> > out:
> >         *res = par_e0;
> >         return 0;
> > }
> > 
> > I'm having a hard time understanding why KVM is doing both AT S1E0R and AT S1E0W
> > regardless of the type of the access (read/write) in the PAN-aware AT
> > instruction. Would you mind elaborating on that?
> 
> Because that's the very definition of an AT S1E1{W,R}P instruction
> when PAN is set. If *any* EL0 permission is set, then the translation
> must equally fail. Just like a load or a store from EL1 would fail if
> any EL0 permission is set when PSTATE.PAN is set.
> 
> Since we cannot check for both permissions at once, we do it twice.
> It is worth noting that we don't quite handle the PAN3 case correctly
> (because we can't retrieve the *execution* property using AT). I'll
> add that to the list of stuff to fix.
> 
> > 
> > > +	if (fail) {
> > > +		vcpu_write_sys_reg(vcpu, SYS_PAR_EL1_F, PAR_EL1);
> > > +		goto nopan;
> > > +	}
> > > [..]
> > > +	if (par & SYS_PAR_EL1_F) {
> > > +		u8 fst = FIELD_GET(SYS_PAR_EL1_FST, par);
> > > +
> > > +		/*
> > > +		 * If we get something other than a permission fault, we
> > > +		 * need to retry, as we're likely to have missed in the PTs.
> > > +		 */
> > > +		if ((fst & ESR_ELx_FSC_TYPE) != ESR_ELx_FSC_PERM)
> > > +			retry_slow = true;
> > > +	} else {
> > > +		/*
> > > +		 * The EL0 access succeded, but we don't have the full
> > > +		 * syndrom information to synthetize the failure. Go slow.
> > > +		 */
> > > +		retry_slow = true;
> > > +	}
> > 
> > This is what PSTATE.PAN controls:
> > 
> > If the Effective value of PSTATE.PAN is 1, then a privileged data access from
> > any of the following Exception levels to a virtual memory address that is
> > accessible to data accesses at EL0 generates a stage 1 Permission fault:
> > 
> > - A privileged data access from EL1.
> > - If HCR_EL2.E2H is 1, then a privileged data access from EL2.
> > 
> > With that in mind, I am really struggling to understand the logic.
> 
> I don't quite see what you don't understand, you'll have to be more
> precise. Are you worried about the page tables we're looking at, the
> value of PSTATE.PAN, the permission fault, or something else?
> 
> It also doesn't help that you're looking at the patch that contains
> the integration with the slow-path, which is pretty hard to read (I
> have a reworked version that's a bit better). You probably want to
> look at the "fast" path alone.

I was referring to checking both unprivileged read and write permissions.

And you are right, sorry, I managed to get myself terribly confused. For
completeness sake, this matches AArch64.S1DirectBasePermissions(), where if PAN
&& (UnprivRead || UnprivWrite) then PrivRead = False and PrivWrite = False. So
you need to check that both UnprivRead and UnprivWrite are false for the PAN
variants of AT to succeed.

> 
> > 
> > If AT S1E0{R,W} (from check_at_pan()) failed, doesn't that mean that the virtual
> > memory address is not accessible to EL0? Add that to the fact that the AT
> > S1E1{R,W} (from the beginning of __kvm_at_s1e01()) succeeded, doesn't that mean
> > that AT S1E1{R,W}P should succeed, and furthermore the PAR_EL1 value should be
> > the one KVM got from AT S1E1{R,W}?
> 
> There are plenty of ways for AT S1E0 to fail when AT S1E1 succeeded:
> 
> - no EL0 permission: that's the best case, and the PAR_EL1 obtained
>   from the AT S1E1 is the correct one. That's what we return.

Yes, that is correct, the place where VCPUs PAR_EL1 register is set is far
enough from this code that I didn't make the connection.

> 
> - The EL0 access failed, but for another reason than a permission
>   fault. This contradicts the EL1 walk, and is a sure sign that
>   someone is playing behind our back. We fail.
> 
> - exception from AT S1E0: something went wrong (again the guest
>   playing with the PTs behind our back). We fail as well.
> 
> Do you at least agree with these as goals? If you do, what in
> the implementation does not satisfy these goals? If you don't, what in
> these goals seem improper to you?

I agree with the goals.

In this patch, if I'm reading the code right (and I'm starting to doubt myself)
if PAR_EL1.F is set and PAR_EL1 doesn't indicate a permissions fault, then KVM
falls back to walking the S1 tables:

        if (par & SYS_PAR_EL1_F) {
                u8 fst = FIELD_GET(SYS_PAR_EL1_FST, par);

                /*
                 * If we get something other than a permission fault, we
                 * need to retry, as we're likely to have missed in the PTs.
                 */
                if ((fst & ESR_ELx_FSC_TYPE) != ESR_ELx_FSC_PERM)
                        retry_slow = true;
	}

I suppose that's because KVM cannot distinguish between two very different
reasons for AT failing: 1, because of something being wrong with the stage 1
tables when the AT S1E0* instruction was executed and 2, because of missing
entries at stage 2, as per the comment. Is that correct?

Thanks,
Alex



More information about the linux-arm-kernel mailing list