[PATCH v2 00/19] arm64: Enable LPA2 support for 4k and 16k pages

Thu Dec 1 05:43:14 PST 2022

On Thu, 1 Dec 2022 at 13:22, Ryan Roberts <ryan.roberts at arm.com> wrote:
>
> Hi Ard,
>
> I wanted to provide a quick update on the debugging I've been doing from my end.
> See below...
>
>
> On 29/11/2022 16:56, Ard Biesheuvel wrote:
> > On Tue, 29 Nov 2022 at 17:36, Ryan Roberts <ryan.roberts at arm.com> wrote:
> >>
> >> On 29/11/2022 15:47, Ard Biesheuvel wrote:
> >>> On Tue, 29 Nov 2022 at 16:31, Ryan Roberts <ryan.roberts at arm.com> wrote:
> >>>>
> >>>> Hi Ard,
> >>>>
> >>>> As promised, I ran your patch set through my test set up and have noticed a few
> >>>> issues. Sorry it turned into rather a long email...
> >>>>
> >>>
> >>> No worries, and thanks a lot for going through the trouble.
> >>>
> >>>> First, a quick explanation of the test suite: For all valid combinations of the
> >>>> below parameters, boot the host kernel on the FVP, then boot the guest kernel in
> >>>> a VM, check that booting succeeds all the way to the guest shell then poweroff
> >>>> guest followed by host can check shutdown is clean.
> >>>>
> >>>> Parameters:
> >>>>  - hw_pa:               [48, lpa, lpa2]
> >>>>  - hw_va:               [48, 52]
> >>>>  - kvm_mode:            [vhe, nvhe, protected]
> >>>>  - host_page_size:      [4KB, 16KB, 64KB]
> >>>>  - host_pa:             [48, 52]
> >>>>  - host_va:             [48, 52]
> >>>>  - host_load_addr:      [low, high]
> >>>>  - guest_page_size:     [64KB]
> >>>>  - guest_pa:            [52]
> >>>>  - guest_va:            [52]
> >>>>  - guest_load_addr:     [low, high]
> >>>>
> >>>> When *_load_addr is 'low', that means the RAM is below 48 bits in (I)PA space.
> >>>> 'high' means the RAM starts at 2048TB for the guest (52 bit PA), and it means
> >>>> there are 2 regions for the host; one at 0x880000000000 (48 bit PA) sized to
> >>>> hold the kernel image only and another at 0x8800000000000 (52 bit PA) sized at
> >>>> 4GB. The FVP only allows RAM at certain locations and having a contiguous region
> >>>> cross the 48 bit boundary is not an option. So I chose these values to ensure
> >>>> that the linear map size is within 51 bits, which is a requirement for
> >>>> nhve/protected mode kvm.
> >>>>
> >>>> In all cases, I preload TF-A bl31, kernel, dt and initrd into RAM and run the
> >>>> FVP. This sidesteps problems with EFI needing low memory, and with the FVP's
> >>>> block devices needing DMA memory below 44 bits PA. bl31 and dt are appropriately
> >>>> fixed up for the 2 different memory layouts.
> >>>>
> >>>> Given this was designed to test my KVM changes, I was previously running these
> >>>> without the host_load_addr=high option for the 4k and 16k host kernels (since
> >>>> this requires your patch set). In this situation there are 132 valid configs and
> >>>> all of them pass.
> >>>>
> >>>> I then rebased my changes on top of yours and added in the host_load_addr=high
> >>>> option. Now there are 186 valid configs, 64 of which fail. (some of these
> >>>> failures are regressions). From a quick initial triage, there are 3 failure modes:
> >>>>
> >>>>
> >>>> 1) 18 FAILING TESTS: Host kernel never outputs anything to console
> >>>>
> >>>>   TF-A runs successfully, says it is jumping to the kernel, then nothing further
> >>>>   is seen. I'm pretty confident that the blobs are loaded into memory correctly
> >>>>   because the same framework is working for the other configs (including 64k
> >>>>   kernel loaded into high memory). This affects all configs where a host kernel
> >>>>   with 4k or 16k pages built with LPA2 support is loaded into high memory.
> >>>>
> >>>
> >>> Not sure how to interpret this in combination with your explanation
> >>> above, but if 'loaded high' means that the kernel itself is not in
> >>> 48-bit addressable physical memory, this failure is expected.
> >>
> >> Sorry - my wording was confusing. host_load_addr=high means what I said at the
> >> top; the kernel image is loaded at 0x880000000000 in a block of memory sized to
> >> hold the kernel image only (actually its forward aligned to 2MB). The dtb and
> >> initrd are loaded into a 4GB region at 0x8800000000000. The reason I'm doing
> >> this is to ensure that when I create a VM, the memory used for it (at least the
> >> vast majority) is coming from the region at 52 bits. I want to do this to prove
> >> that the stage2 implementation is correctly handling the 52 OA case.
> >>
> >>>
> >>> Given that we have no way of informing the bootloader or firmware
> >>> whether or not a certain kernel image supports running from such a
> >>> high offset, it must currently assume it cannot. We've just queued up
> >>> a documentation fix to clarify this in the boot protocol, i.e., that
> >>> the kernel must be loaded in 48-bit addressable physical memory.
> >>
> >> OK, but I think what I'm doing complies with this. Unless the DTB also has to be
> >> below 48 bits?
> >>
> >
> > Ahh yes, good point. Yes, this is actually implied but we should
> > clarify this. Or fix it.
> >
> > But the same reasoning applies: currently, a loader cannot know from
> > looking at a certain binary whether or not it supports addressing any
> > memory outside of the 48-bit addressable range, so any asset loading
> > into physical memory is affected by the same limitation.
> >
> > This has implications for other firmware aspects as well, i.e., ACPI tables etc.
> >
> >>>
> >>> The fact that you had to doctor your boot environment to get around
> >>> this kind of proves my point, and unless someone is silly enough to
> >>> ship a SoC that cannot function without this, I don't think we should
> >>> add this support.
> >>>
> >>> I understand how this is an interesting case for completeness from a
> >>> validation pov, but the reality is that adding support for this would
> >>> mean introducing changes amounting to dead code to fragile boot
> >>> sequence code that is already hard to maintain.
> >>
> >> I'm not disagreeing. But I think what I'm doing should conform with the
> >> requirements? (Previously I had the tests set up to just have a single region of
> >> memory above 52 bits and the kernel image was placed there. That works/worked
> >> for the 64KB kernel. But I brought the kernel image to below 48 bits to align
> >> with the requirements of this patch set.
> >>
> >> If you see an easier way for me to validate 52 bit OAs in the stage 2 (and
> >> ideally hyp stage 1), then I'm all ears!
> >>
> >
> > There is a Kconfig knob CONFIG_ARM64_FORCE_52BIT which was introduced
> > for a similar purpose, i.e., to ensure that the 52-bit range gets
> > utilized. I could imagine adding a similar control for KVM in
> > particular or for preferring allocations from outside the 48-bit PA
> > range in general.
>
> I managed to get these all booting after moving the kernel and dtb to low
> memory. I've kept the initrd in high memory for now, which works fine. (I think
> the initrd memory will get freed and I didn't want any free low memory floating
> around to ensure that kvm guests get high memory).

The DT's early mapping is via the ID map, while the initrd is only
mapped much later via the kernel mappings in TTBR1, so this is why it
works.

However, for the boot protocol, this distinction doesn't really
matter: if the boot stack cannot be certain that a certain kernel
image was built to support 52 physical addressing, it simply must
never place anything there.

> Once booting, some of these
> get converted to passes, and others remain failures but for other reasons (see
> below).
>
>
> >
> >>>
> >>>>
> >>>> 2) 4 FAILING TESTS: Host kernel gets stuck initializing KVM
> >>>>
> >>>>   During kernel boot, last console log is "kvm [1]: vgic interrupt IRQ9". All
> >>>>   failing tests are configured for protected KVM, and are build with LPA2
> >>>>   support, running on non-LPA2 HW.
> >>>>
> >>>
> >>> I will try to reproduce this locally.
>
> It turns out the same issue is hit when running your patches without mine on
> top. The root cause in both cases is an assumption that kvm_get_parange() makes
> that ID_AA64MMFR0_EL1_PARANGE_MAX will always be 48 for 4KB and 16KB PAGE_SIZE.
> That is no longer true with your patches. It causes the VTCR_EL2 to be
> programmed incorrectly for the host stage2, then on return to the host, bang.
>

Right. I made an attempt at replacing 'u32 level' with 's32 level'
throughout that code, along with some related changes, but I didn't
spot this issue.

> This demonstrates that kernel stage1 support for LPA2 depends on kvm support for
> LPA2, since for protected kvm, the host stage2 needs to be able to id map the
> full physical range that the host kernel sees prior to deprivilege. So I don't
> think it's fixable in your series. I have a fix in my series for this.
>

The reference to ID_AA64MMFR0_EL1_PARANGE_MAX should be fixable in
isolation, no? Even if it results in a KVM that cannot use 52-bit PAs
while the host can.

> I also found another dependency problem (hit by some of the tests that were
> previously failing at the first issue) where kvm uses its page table library to
> walk a user space page table created by the kernel (see
> get_user_mapping_size()). In the case where the kernel creates an LPA2 page
> table, kvm can't walk it without my patches. I've also added a fix for this to
> my series.
>

OK

>
> >>>
> >>>>
> >>>> 3) 42 FAILING TESTS: Guest kernel never outputs anything to console
> >>>>
> >>>>   Host kernel boots fine, and we attempt to launch a guest kernel using kvmtool.
> >>>>   There is no error reported, but the guest never outputs anything. Haven't
> >>>>   worked out which config options are common to all failures yet.
> >>>>
> >>>
> >>> This goes a bit beyond what I am currently set up for in terms of
> >>> testing, but I'm happy to help narrow this down.
>
> I don't have a root cause for this yet. I'll try to take a loo this afternoon.
> Will keep you posted.
>

Thanks

> >>>
> >>>>
> >>>> Finally, I removed my code, and ran with your patch set as provided. For this I
> >>>> hacked up my test suite to boot the host, and ignore booting a guest. I also
> >>>> didn't bother to vary the KVM mode and just left it in VHE mode. There were 46
> >>>> valid configs here, of which 4 failed. They were all the same failure mode as
> >>>> (1) above. Failing configs were:
> >>>>
> >>>> id  hw_pa  hw_va  host_page_size  host_pa  host_va  host_load_addr
> >>>> ------------------------------------------------------------------
> >>>> 40  lpa    52     4k              52       52       high
> >>>> 45  lpa    52     16k             52       52       high
> >>>> 55  lpa2   52     4k              52       52       high
> >>>> 60  lpa2   52     16k             52       52       high
> >>>>
> >>>
> >>> Same point as above then, I guess.>>>
> >>>>
> >>>> So on the balance of probabilities, I think failure mode (1) is very likely to
> >>>> be due to a bug in your code. (2) and (3) could be my issue or your issue: I
> >>>> propose to dig into those a bit further and will get back to you on them. I
> >>>> don't plan to look any further into (1).
> >>>>
> >>>
> >>> Thanks again. (1) is expected, and (2) is something I will investigate further.
> >>
>
> Once I have all the tests passing, I'll post my series, then hopefully we can
> move it all forwards as one?
>

That would be great, yes, although my work depends on a sizable rework
of the early boot code that has seen very little review as of yet.

So for the time being, let's keep aligned but let's not put any eggs
in each other's baskets :-)

> As part of my debugging, I've got a patch to sort out the tlbi code to support
> LPA2 properly - I think I raised that comment on one of the patches. Are you
> happy for me to post as part of my series?
>

I wasn't sure where to look tbh. The generic 5-level paging stuff
seems to work fine - is this specific to KVM?