[PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
Alexandru Elisei
alexandru.elisei at arm.com
Mon Jul 15 08:30:19 PDT 2024
Hi Marc,
On Thu, Jul 11, 2024 at 01:16:42PM +0100, Marc Zyngier wrote:
> On Thu, 11 Jul 2024 11:56:13 +0100,
> Alexandru Elisei <alexandru.elisei at arm.com> wrote:
> >
> > Hi,
> >
> > On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> > > In order to plug the brokenness of our current AT implementation,
> > > we need a SW walker that is going to... err.. walk the S1 tables
> > > and tell us what it finds.
> > >
> > > Of course, it builds on top of our S2 walker, and share similar
> > > concepts. The beauty of it is that since it uses kvm_read_guest(),
> > > it is able to bring back pages that have been otherwise evicted.
> > >
> > > This is then plugged in the two AT S1 emulation functions as
> > > a "slow path" fallback. I'm not sure it is that slow, but hey.
> > > [..]
> > > switch (op) {
> > > case OP_AT_S1E1RP:
> > > case OP_AT_S1E1WP:
> > > + retry_slow = false;
> > > fail = check_at_pan(vcpu, vaddr, &par);
> > > break;
> > > default:
> > > goto nopan;
> > > }
> >
> > For context, this is what check_at_pan() does:
> >
> > static int check_at_pan(struct kvm_vcpu *vcpu, u64 vaddr, u64 *res)
> > {
> > u64 par_e0;
> > int error;
> >
> > /*
> > * For PAN-involved AT operations, perform the same translation,
> > * using EL0 this time. Twice. Much fun.
> > */
> > error = __kvm_at(OP_AT_S1E0R, vaddr);
> > if (error)
> > return error;
> >
> > par_e0 = read_sysreg_par();
> > if (!(par_e0 & SYS_PAR_EL1_F))
> > goto out;
> >
> > error = __kvm_at(OP_AT_S1E0W, vaddr);
> > if (error)
> > return error;
> >
> > par_e0 = read_sysreg_par();
> > out:
> > *res = par_e0;
> > return 0;
> > }
> >
> > I'm having a hard time understanding why KVM is doing both AT S1E0R and AT S1E0W
> > regardless of the type of the access (read/write) in the PAN-aware AT
> > instruction. Would you mind elaborating on that?
>
> Because that's the very definition of an AT S1E1{W,R}P instruction
> when PAN is set. If *any* EL0 permission is set, then the translation
> must equally fail. Just like a load or a store from EL1 would fail if
> any EL0 permission is set when PSTATE.PAN is set.
>
> Since we cannot check for both permissions at once, we do it twice.
> It is worth noting that we don't quite handle the PAN3 case correctly
> (because we can't retrieve the *execution* property using AT). I'll
> add that to the list of stuff to fix.
>
> >
> > > + if (fail) {
> > > + vcpu_write_sys_reg(vcpu, SYS_PAR_EL1_F, PAR_EL1);
> > > + goto nopan;
> > > + }
> > > [..]
> > > + if (par & SYS_PAR_EL1_F) {
> > > + u8 fst = FIELD_GET(SYS_PAR_EL1_FST, par);
> > > +
> > > + /*
> > > + * If we get something other than a permission fault, we
> > > + * need to retry, as we're likely to have missed in the PTs.
> > > + */
> > > + if ((fst & ESR_ELx_FSC_TYPE) != ESR_ELx_FSC_PERM)
> > > + retry_slow = true;
> > > + } else {
> > > + /*
> > > + * The EL0 access succeded, but we don't have the full
> > > + * syndrom information to synthetize the failure. Go slow.
> > > + */
> > > + retry_slow = true;
> > > + }
> >
> > This is what PSTATE.PAN controls:
> >
> > If the Effective value of PSTATE.PAN is 1, then a privileged data access from
> > any of the following Exception levels to a virtual memory address that is
> > accessible to data accesses at EL0 generates a stage 1 Permission fault:
> >
> > - A privileged data access from EL1.
> > - If HCR_EL2.E2H is 1, then a privileged data access from EL2.
> >
> > With that in mind, I am really struggling to understand the logic.
>
> I don't quite see what you don't understand, you'll have to be more
> precise. Are you worried about the page tables we're looking at, the
> value of PSTATE.PAN, the permission fault, or something else?
>
> It also doesn't help that you're looking at the patch that contains
> the integration with the slow-path, which is pretty hard to read (I
> have a reworked version that's a bit better). You probably want to
> look at the "fast" path alone.
I was referring to checking both unprivileged read and write permissions.
And you are right, sorry, I managed to get myself terribly confused. For
completeness sake, this matches AArch64.S1DirectBasePermissions(), where if PAN
&& (UnprivRead || UnprivWrite) then PrivRead = False and PrivWrite = False. So
you need to check that both UnprivRead and UnprivWrite are false for the PAN
variants of AT to succeed.
>
> >
> > If AT S1E0{R,W} (from check_at_pan()) failed, doesn't that mean that the virtual
> > memory address is not accessible to EL0? Add that to the fact that the AT
> > S1E1{R,W} (from the beginning of __kvm_at_s1e01()) succeeded, doesn't that mean
> > that AT S1E1{R,W}P should succeed, and furthermore the PAR_EL1 value should be
> > the one KVM got from AT S1E1{R,W}?
>
> There are plenty of ways for AT S1E0 to fail when AT S1E1 succeeded:
>
> - no EL0 permission: that's the best case, and the PAR_EL1 obtained
> from the AT S1E1 is the correct one. That's what we return.
Yes, that is correct, the place where VCPUs PAR_EL1 register is set is far
enough from this code that I didn't make the connection.
>
> - The EL0 access failed, but for another reason than a permission
> fault. This contradicts the EL1 walk, and is a sure sign that
> someone is playing behind our back. We fail.
>
> - exception from AT S1E0: something went wrong (again the guest
> playing with the PTs behind our back). We fail as well.
>
> Do you at least agree with these as goals? If you do, what in
> the implementation does not satisfy these goals? If you don't, what in
> these goals seem improper to you?
I agree with the goals.
In this patch, if I'm reading the code right (and I'm starting to doubt myself)
if PAR_EL1.F is set and PAR_EL1 doesn't indicate a permissions fault, then KVM
falls back to walking the S1 tables:
if (par & SYS_PAR_EL1_F) {
u8 fst = FIELD_GET(SYS_PAR_EL1_FST, par);
/*
* If we get something other than a permission fault, we
* need to retry, as we're likely to have missed in the PTs.
*/
if ((fst & ESR_ELx_FSC_TYPE) != ESR_ELx_FSC_PERM)
retry_slow = true;
}
I suppose that's because KVM cannot distinguish between two very different
reasons for AT failing: 1, because of something being wrong with the stage 1
tables when the AT S1E0* instruction was executed and 2, because of missing
entries at stage 2, as per the comment. Is that correct?
Thanks,
Alex
More information about the linux-arm-kernel
mailing list