[PATCH v2 13/17] KVM: arm64: nv: Add SW walker for AT S1 emulation
Alexandru Elisei
alexandru.elisei at arm.com
Mon Aug 12 08:11:02 PDT 2024
Hi Marc,
On Sat, Aug 10, 2024 at 11:16:15AM +0100, Marc Zyngier wrote:
> Hi Alex,
>
> On Fri, 09 Aug 2024 13:43:33 +0100,
> Alexandru Elisei <alexandru.elisei at arm.com> wrote:
> >
> > Hi Marc,
> >
> > Finally managed to go through the patch. Bunch of nitpicks from me (can be
> > safely ignored), and some corner cases where KVM deviates from the spec.
>
> Thanks for taking the time to go through this mess.
>
> [...]
>
> > > + break;
> > > + default:
> > > + return (vcpu_el2_e2h_is_set(vcpu) &&
> > > + vcpu_el2_tge_is_set(vcpu)) ? TR_EL20 : TR_EL10;
> >
> > This also looks correct to me. Following the pseudocode was not trivial, so I'm
> > leaving this here in case I made a mistake somewhere.
> >
> > For the S1E0* variants: AArch64.AT(el_in=EL0) => TranslationRegime(el=EL0) =>
> > ELIsInHost(el=EL0); ELIsInHost(el=EL0) is true if {E2H,TGE} = {1,1}, and in this
> > case TranslationRegime(el=EL0) returns Regime_EL20, otherwise Regime_EL10.
> >
> > For the S1E1* variants: AArch64.AT(el_in=EL1), where:
> >
> > - if ELIsInHost(el=EL0) is true, then 'el' is changed to EL2, where
> > ELIsInHost(el=EL0) is true if {E2H,TGE} = {1,1}. In this case,
> > TranslationRegime(el=EL2) will return Regime_EL20.
> >
> > - if ELIsInHost(el=EL0) is false, then 'el' remains EL1, and
> > TranslationRegime(el=EL1) returns Regime_EL10.
>
> Yup. Somehow, the C version is far easier to understand!
>
> >
> > > + }
> > > +}
> > > +
> > > +static int setup_s1_walk(struct kvm_vcpu *vcpu, u32 op, struct s1_walk_info *wi,
> > > + struct s1_walk_result *wr, u64 va)
> > > +{
> > > + u64 sctlr, tcr, tg, ps, ia_bits, ttbr;
> > > + unsigned int stride, x;
> > > + bool va55, tbi, lva, as_el0;
> > > +
> > > + wi->regime = compute_translation_regime(vcpu, op);
> > > + as_el0 = (op == OP_AT_S1E0R || op == OP_AT_S1E0W);
> > > +
> > > + va55 = va & BIT(55);
> > > +
> > > + if (wi->regime == TR_EL2 && va55)
> > > + goto addrsz;
> > > +
> > > + wi->s2 = wi->regime == TR_EL10 && (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_VM);
> >
> > According to AArch64.NSS2TTWParams(), stage 2 is enabled if HCR_EL2.VM or
> > HCR_EL2.DC.
> >
> > From the description of HCR_EL2.DC, when the bit is set:
> >
> > 'The PE behaves as if the value of the HCR_EL2.VM field is 1 for all purposes
> > other than returning the value of a direct read of HCR_EL2.'
>
> Yup, good point. And as you noticed further down, the HCR_EL2.DC
> handling is currently a mess.
>
> [...]
>
> > > + /* Let's put the MMU disabled case aside immediately */
> > > + if (!(sctlr & SCTLR_ELx_M) ||
> > > + (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> >
> > I think the condition doesn't match AArch64.S1Enabled():
> >
> > - if regime is Regime_EL20 or Regime_EL2, then S1 is disabled if and only if
> > SCTLR_EL2.M is not set; it doesn't matter if HCR_EL2.DC is set, because "[..]
> > this field has no effect on the EL2, EL2&0, and EL3 translation regimes."
> > (HCR_EL2.DC bit field description).
> >
> > - if regime is Regime_EL10, then S1 is disabled if SCTLR_EL1.M == 0 or
> > HCR_EL2.TGE = 0 or the effective value of HCR_EL2.DC* is 0.
> >
> > From the description of HCR_EL2.TGE, when the bit is set:
> >
> > 'If EL1 is using AArch64, the SCTLR_EL1.M field is treated as being 0 for all
> > purposes other than returning the result of a direct read of SCTLR_EL1.'
> >
> > From the description of HCR_EL2.DC, when the bit is set:
> >
> > 'When EL1 is using AArch64, the PE behaves as if the value of the SCTLR_EL1.M
> > field is 0 for all purposes other than returning the value of a direct read of
> > SCTLR_EL1.'
> >
> > *The description of HCR_EL2.DC states:
> >
> > 'When the Effective value of HCR_EL2.{E2H,TGE} is {1,1}, this field behaves as
> > 0 for all purposes other than a direct read of the value of this field.'
> >
> > But if {E2H,TGE} = {1,1} then the regime is Regime_EL20, which ignores the DC
> > bit. If we're looking at the DC bit, then that means that the regime is EL10,
> > ({E2H,TGE} != {1,1} in compute_translation_regime()) and the effective value of
> > HCR_EL2.DC is the same as the DC bit.
>
> Yup. That's what I've stashed on top:
>
> @@ -136,12 +137,22 @@ static int setup_s1_walk(struct kvm_vcpu *vcpu, u32 op, struct s1_walk_info *wi,
> va = (u64)sign_extend64(va, 55);
>
> /* Let's put the MMU disabled case aside immediately */
> - if (!(sctlr & SCTLR_ELx_M) ||
> - (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> + switch (wi->regime) {
> + case TR_EL10:
> + if (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)
> + wr->level = S1_MMU_DISABLED;
In compute_translation_regime(), for AT instructions other than AT S1E2*, when
{E2H,TGE} = {0,1}, regime is Regime_EL10. As far as I can tell, when regime is
Regime_EL10 and TGE is set, stage 1 is disabled, according to
AArch64.S1Enabled() and the decription of the TGE bit.
> + fallthrough;
> + case TR_EL2:
> + case TR_EL20:
> + if (!(sctlr & SCTLR_ELx_M))
> + wr->level = S1_MMU_DISABLED;
> + break;
> + }
> +
> + if (wr->level == S1_MMU_DISABLED) {
> if (va >= BIT(kvm_get_pa_bits(vcpu->kvm)))
> goto addrsz;
>
> - wr->level = S1_MMU_DISABLED;
> wr->pa = va;
> return 0;
> }
>
> [...]
>
> > > + /* R_PLCGL, R_YXNYW */
> > > + if ((!kvm_has_feat_enum(vcpu->kvm, ID_AA64MMFR2_EL1, ST, 48_47) &&
> > > + wi->txsz > 39) ||
> > > + (wi->pgshift == 16 && wi->txsz > 47) || wi->txsz > 48)
> > > + goto transfault_l0;
> >
> > It's probably just me, but I find this hard to parse. There are three cases when
> > the condition (!FEAT_TTST && TxSZ > 39) evaluates to false. But the other two
> > conditions make sense to check only if !FEAT_TTST is false and wi->txsz > 39 is
> > true.
> >
> > I find this easier to read:
> >
> > if (!kvm_has_feat_enum(vcpu->kvm, ID_AA64MMFR2_EL1, ST, 48_47)) {
> > if (wi->txsz > 39)
> > goto transfault_l0;
> > } else {
> > if (wi->txsz > 48 || (wi->pgshift == 16 && wi->txsz > 47))
> > goto transfault_l0;
> > }
> >
> > What do you think?
>
> Sure, happy to rewrite it like this if it helps.
Cool, thanks.
>
> [...]
>
> > > + /* R_YYVYV, I_THCZK */
> > > + if ((!va55 && va > GENMASK(ia_bits - 1, 0)) ||
> > > + (va55 && va < GENMASK(63, ia_bits)))
> > > + goto transfault_l0;
> > > +
> > > + /* R_ZFSYQ */
> >
> > This is rather pedantic, but I think that should be I_ZFSYQ.
>
> Well, these references are supposed to help. If they are incorrect,
> they serve no purpose. Thanks for spotting this.
>
> [...]
>
> > > + if (!(desc & 1) || ((desc & 3) == 1 && level == 3))
> > > + goto transfault;
> >
> > The check for block mapping at level 3 is replicated below, when the level of
> > the block is checked for correctnes.
> >
> > Also, would you consider rewriting the valid descriptor check to
> > (desc & BIT(0)), to match the block level check?
> >
> > > +
> > > + /* We found a leaf, handle that */
> > > + if ((desc & 3) == 1 || level == 3)
> > > + break;
> >
> > Here we know that (desc & 1), do you think rewriting the check to !(desc &
> > BIT(1)), again matching the block level check, would be a good idea?
>
> Other that the BIT() stuff, I'm not completely clear what you are
> asking for. Are you objecting to the fact that the code is slightly
> redundant? If so, I may be able to clean things up for you:
Yes, I was referring to the fact that the code is slightly redundant.
>
> @@ -309,13 +323,19 @@ static int walk_s1(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
> else
> desc = le64_to_cpu((__force __le64)desc);
>
> - if (!(desc & 1) || ((desc & 3) == 1 && level == 3))
> + /* Invalid descriptor */
> + if (!(desc & BIT(0)))
> goto transfault;
>
> - /* We found a leaf, handle that */
> - if ((desc & 3) == 1 || level == 3)
> + /* Block mapping, check validity down the line */
> + if (!(desc & BIT(1)))
> + break;
> +
> + /* Page mapping */
> + if (level == 3)
> break;
>
> + /* Table handling */
> if (!wi->hpd) {
> wr->APTable |= FIELD_GET(S1_TABLE_AP, desc);
> wr->UXNTable |= FIELD_GET(PMD_TABLE_UXN, desc);
>
> Do you like this one better? Each bit only gets tested once.
switch (desc & GENMASK_ULL(1, 0)) {
case 0b00:
case 0b10:
goto transfault;
case 0b01:
/* Block mapping */
break;
default:
if (level == 3)
break;
}
Is this better? Perhaps slightly easier to match against the descriptor layouts,
but I'm not sure it's an improvement over your suggestion. Up to you, no point
in bikeshedding over it.
>
> [...]
>
> > > +
> > > + if (check_output_size(desc & GENMASK(47, va_bottom), wi))
> > > + goto addrsz;
> > > +
> > > + wr->failed = false;
> > > + wr->level = level;
> > > + wr->desc = desc;
> > > + wr->pa = desc & GENMASK(47, va_bottom);
> > > + if (va_bottom > 12)
> > > + wr->pa |= va & GENMASK_ULL(va_bottom - 1, 12);
> >
> > I'm looking at StageOA(), and I don't think this matches what happens when the
> > contiguous bit is set and the contiguous OA isn't aligned as per Table D8-98.
> > Yes, I know, that's something super niche and unlikely to happen in practice.
> >
> > Let's assume 4K pages, level = 3 (so va_bottom = 12), the first page in the
> > contiguous OA range is 0x23_000 (so not aligned to 64K), and the input address
> > that maps to the first page from the contiguous OA range is 0xf0_000 (aligned to
> > 64K).
> >
> > According to the present code:
> >
> > wr->pa = 0x23_000
> >
> > According to StageOA():
> >
> > tsize = 12
> > csize = 4
> > oa = 0x23_000 & GENMASK_ULL(47, 16) | 0xf0_000 & GENMASK_ULL(15, 0) = 0x20_000
> > wr->pa = oa & ~GENMASK_ULL(11, 0) = 0x20_000
> >
> > If the stage 1 is correctly programmed and the OA aligned to the required
> > alignment, the two outputs match
>
> Huh, that's another nice catch. I had the (dumb) idea that if we
> didn't use the Contiguous bit as a TLB hint, we didn't need to do
> anything special when it came to the alignment of the OA.
>
> But clearly, the spec says that this alignment must be honoured, and
> there is no way out of it. Which means that the S2 walker also has a
> bit of a problem on that front.
>
> > On a different topic, but still regarding wr->pa. I guess the code aligns wr->pa
> > to 4K because that's how the OA in PAR_EL1 is reported.
> >
> > Would it make more sense to have wr->pa represent the real output address, i.e,
> > also contain the 12 least significant bits of the input address? It wouldn't
> > change how PAR_EL1 is computed (bits 11:0 are already masked out), but it would
> > make wr->pa consistent with what the output address of a given input address
> > should be (according to StageOA()).
>
> Yup. That'd be consistent with the way wr->pa is reported when S1 is disabled.
>
> I ended-up with this (untested):
>
> @@ -354,12 +374,24 @@ static int walk_s1(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
> if (check_output_size(desc & GENMASK(47, va_bottom), wi))
> goto addrsz;
>
> + if (desc & PTE_CONT) {
> + switch (BIT(wi->pgshift)) {
> + case SZ_4K:
> + va_bottom += 4;
> + break;
> + case SZ_16K:
> + va_bottom += level == 2 ? 5 : 7;
> + break;
> + case SZ_64K:
> + va_bottom += 5;
> + }
> + }
> +
Matches ContiguousSize().
> wr->failed = false;
> wr->level = level;
> wr->desc = desc;
> wr->pa = desc & GENMASK(47, va_bottom);
> - if (va_bottom > 12)
> - wr->pa |= va & GENMASK_ULL(va_bottom - 1, 12);
> + wr->pa |= va & GENMASK_ULL(va_bottom - 1, 0);
Matches StageOA().
>
> return 0;
>
>
> Note that I'm still checking for the address size before the
> contiguous bit alignment, as per R_JHQPP.
Nicely spotted.
>
> [...]
>
> > > + if (!(__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> > > + par |= FIELD_PREP(SYS_PAR_EL1_ATTR, 0); /* nGnRnE */
> > > + par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b10); /* OS */
> >
> > s/0b10/ATTR_OSH ?
> >
> > > + } else {
> > > + par |= FIELD_PREP(SYS_PAR_EL1_ATTR,
> > > + MEMATTR(WbRaWa, WbRaWa));
> > > + par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b00); /* NS */
> >
> > s/0b00/ATTR_NSH ?
> >
> > > + }
> >
> > HCR_EL2.DC applies only to Regime_EL10, I think the bit should be ignored for
> > the EL2 and EL20 regimes.
>
> Yup, now fixed.
>
> >
> > > + } else {
> > > + u64 mair, sctlr;
> > > + u8 sh;
> > > +
> > > + mair = (regime == TR_EL10 ?
> > > + vcpu_read_sys_reg(vcpu, MAIR_EL1) :
> > > + vcpu_read_sys_reg(vcpu, MAIR_EL2));
> > > +
> > > + mair >>= FIELD_GET(PTE_ATTRINDX_MASK, wr->desc) * 8;
> > > + mair &= 0xff;
> > > +
> > > + sctlr = (regime == TR_EL10 ?
> > > + vcpu_read_sys_reg(vcpu, SCTLR_EL1) :
> > > + vcpu_read_sys_reg(vcpu, SCTLR_EL2));
> > > +
> > > + /* Force NC for memory if SCTLR_ELx.C is clear */
> > > + if (!(sctlr & SCTLR_EL1_C) && !MEMATTR_IS_DEVICE(mair))
> > > + mair = MEMATTR(NC, NC);
> > > +
> > > + par = FIELD_PREP(SYS_PAR_EL1_ATTR, mair);
> > > + par |= wr->pa & GENMASK_ULL(47, 12);
> > > +
> > > + sh = compute_sh(mair, wr->desc);
> > > + par |= FIELD_PREP(SYS_PAR_EL1_SH, sh);
> > > + }
> >
> > When PAR_EL1.F = 0 and !FEAT_RME, bit 11 (NSE) is RES1, according to the
> > description of the register and EncodePAR().
>
> Fixed.
>
> [...]
>
> > > void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
> > > {
> > > u64 par = __kvm_at_s1e01_fast(vcpu, op, vaddr);
> > >
> > > + /*
> > > + * If we see a permission fault at S1, then the fast path
> > > + * succeedded. By failing. Otherwise, take a walk on the wild
> > > + * side.
> >
> > This is rather ambiguous. Maybe something along the lines would be more helpful:
> >
> > 'If PAR_EL1 encodes a permission fault, we know for sure that the hardware
> > translation table walker was able to walk the stage 1 tables and there's nothing
> > else that KVM needs to do. On the other hand, if the AT instruction failed for
> > any other reason, then KVM must walk the guest stage 1 tables (and possibly the
> > virtual stage 2 tables) to emulate the instruction.'
>
> Sure. I've adopted a slightly less verbose version of this:
>
> @@ -930,9 +966,12 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
> u64 par = __kvm_at_s1e01_fast(vcpu, op, vaddr);
>
> /*
> - * If we see a permission fault at S1, then the fast path
> - * succeedded. By failing. Otherwise, take a walk on the wild
> - * side.
> + * If PAR_EL1 reports that AT failed on a S1 permission fault, we
> + * know for sure that the PTW was able to walk the S1 tables and
> + * there's nothing else to do.
> + *
> + * If AT failed for any other reason, then we must walk the guest S1
> + * to emulate the instruction.
Looks good.
Thanks,
Alex
> */
> if ((par & SYS_PAR_EL1_F) && !par_check_s1_perm_fault(par))
> par = handle_at_slow(vcpu, op, vaddr);
>
>
> I'll retest things over the weekend and post a new version early next
> week.
>
> Thanks again for your review, this is awesome work!
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.
More information about the linux-arm-kernel
mailing list