[PATCH v2 00/19] arm64: Enable LPA2 support for 4k and 16k pages

Fri Nov 25 02:07:14 PST 2022

On 25/11/2022 09:35, Ard Biesheuvel wrote:
> On Fri, 25 Nov 2022 at 10:23, Ryan Roberts <ryan.roberts at arm.com> wrote:
>>
>> On 24/11/2022 17:14, Ard Biesheuvel wrote:
>>> On Thu, 24 Nov 2022 at 15:39, Ryan Roberts <ryan.roberts at arm.com> wrote:
>>>>
>>>> Hi Ard,
>>>>
>>>> Thanks for including me on this. I'll plan to do a review over the next week or
>>>> so, but in the meantime, I have a couple of general questions/comments:
>>>>
>>>> On 24/11/2022 12:39, Ard Biesheuvel wrote:
>>>>> Enable support for LPA2 when running with 4k or 16k pages. In the former
>>>>> case, this requires 5 level paging with a runtime fallback to 4 on
>>>>> non-LPA2 hardware. For consistency, the same approach is adopted for 16k
>>>>> pages, where we fall back to 3 level paging (47 bit virtual addressing)
>>>>> on non-LPA2 configurations.
>>>>
>>>> It seems odd to me that if you have a non-LPA2 system, if you run a kernel that
>>>> is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if
>>>> you run a kernel that is compiled for 16KB pages and 52 VA bits then you will
>>>> get 47 VA bits? Wouldn't that pose a potential user space compat issue?
>>>>
>>>
>>> Well, given that Android happily runs with 39-bit VAs to avoid 4 level
>>> paging at all cost, I don't think that is a universal concern.
>>
>> Well presumably the Android kernel is always explicitly compiled for 39 VA bits
>> so that's what user space is used to? I was really just making the point that if
>> you have (the admittedly exotic and unlikely) case of having a 16KB kernel
>> previously compiled for 48 VA bits, and you "upgrade" it to 52 VA bits now that
>> the option is available, on HW without LPA2, this will actually be observed as a
>> "downgrade" to 47 bits. If you previously wanted to limit to 3 levels of lookup
>> with 16KB you would already have been compiling for 47 VA bits.
>>
> 
> I am not debating that. I'm just saying that, without any hardware in
> existence, it is difficult to predict which of these concerns is going
> to dominate, and so I opted for the least messy and most symmetrical
> approach.

OK fair enough. My opinion is logged ;-).

> 
>>>
>>> The benefit of this approach is that you can decide at runtime whether
>>> you want to take the performance hit of 4 (or 5) level paging to get
>>> access to the extended VA space.
>>>
>>>>> (Falling back to 48 bits would involve
>>>>> finding a workaround for the fact that we cannot construct a level 0
>>>>> table covering 52 bits of VA space that appears aligned to its size in
>>>>> memory, and has the top 2 entries that represent the 48-bit region
>>>>> appearing at an alignment of 64 bytes, which is required by the
>>>>> architecture for TTBR address values.
>>>>
>>>> I'm not sure I've understood this. The level 0 table would need 32 entries for
>>>> 52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is
>>>> a factor of 256 so surely the top 2 entries are guaranteed to also meet the
>>>> constraint for the fallback path too?
>>>>
>>>
>>> The top 2 entries are 16 bytes combined, and end on a 256 byte aligned
>>> boundary so I don't see how they can start on a 64 byte aligned
>>> boundary at the same time.
>>
>> I'm still not following; why would the 2 entry/16 byte table *end* on a 256 byte
>> boundary? I guess I should go and read your patch before making assumptions, but
>> my assumption from your description here was that you were optimistically
>> allocating a 32 entry/256 byte table for the 52 VA bit case, then needing to
>> reuse that table for the 2 entry/16 byte case if HW turns out not to support
>> LPA2. In which case, surely the 2 entry table would be overlayed at the start
>> (low address) of the allocated 32 entry table, and therefore its alignment is
>> 256 bytes, which meets the HW's 64 byte alignment requirement?
>>
> 
> No, it's at the end, that is the point. I am specifically referring to
> TTBR1 upper region page tables here.
> 
> Please refer to the existing ttbr1_offset asm macro, which implements
> this today for 64k pages + LVA. In this case, however, the condensed
> table covers 6 bits of translation so it is naturally aligned to the
> TTBR minimum alignment.

Afraid I don't see any such ttbr1_offset macro, either in upstream or the branch
you posted. The best I can find is TTBR1_OFFSET in arm arch, which I'm guessing
isn't it. I'm keen to understand this better if you can point me to the right
location?

Regardless, the Arm ARM states this for TTBR1_EL1.BADDR:

"""
BADDR[47:1], bits [47:1]

Translation table base address:
• Bits A[47:x] of the stage 1 translation table base address bits are in
register bits[47:x].
• Bits A[(x-1):0] of the stage 1 translation table base address are zero.

Address bit x is the minimum address bit required to align the translation table
to the size of the table. The smallest permitted value of x is 6. The AArch64
Virtual Memory System Architecture chapter describes how x is calculated based
on the value of TCR_EL1.T1SZ, the translation stage, and the translation granule
size.

Note
A translation table is required to be aligned to the size of the table. If a
table contains fewer than eight entries, it must be aligned on a 64 byte address
boundary.
"""

I don't see how that is referring to the alignment of the *end* of the table?

> 
>>>
>>> My RFC had a workaround for this, but it is a bit nasty because we
>>> need to copy those two entries at the right time and keep them in
>>> sync.
>>>
>>>>> Also, using an additional level of
>>>>> paging to translate a single VA bit is wasteful in terms of TLB
>>>>> efficiency)
>>>>>
>>>>> This means support for falling back to 3 levels of paging at runtime
>>>>> when configured for 4 is also needed.
>>>>>
>>>>> Another thing worth to note is that the repurposed physical address bits
>>>>> in the page table descriptors were not RES0 before, and so there is now
>>>>> a big global switch (called TCR.DS) which controls how all page table
>>>>> descriptors are interpreted. This requires some extra care in the PTE
>>>>> conversion helpers, and additional handling in the boot code to ensure
>>>>> that we set TCR.DS safely if supported (and not overridden)
>>>>>
>>>>> Note that this series is mostly orthogonal to work by Anshuman done last
>>>>> year: this series assumes that 52-bit physical addressing is never
>>>>> needed to map the kernel image itself, and therefore that we never need
>>>>> ID map range extension to cover the kernel with a 5th level when running
>>>>> with 4.
>>>>
>>>> This limitation will certainly make it more tricky to test the the LPA2 stage2
>>>> implementation that I have done. I've got scripts that construct host systems
>>>> with all the RAM above 48 bits so that the output addresses in the stage2 page
>>>> tables can be guaranteed to contain OAs > 48 bits. I think the work around here
>>>> would be to place the RAM so that it straddles the 48 bit boundary such that the
>>>> size of RAM below is the same size as the kernel image, and place the kernel
>>>> image in it. Then this will ensure that the VM's memory still uses the RAM above
>>>> the threshold. Or is there a simpler approach?
>>>>
>>>
>>> No, that sounds reasonable. I'm using QEMU which happily lets you put
>>> the start of DRAM at any address you can imagine (if you recompile it)
>>
>> I'm running on FVP, which will let me do this with runtime parameters. Anyway,
>> I'll update my tests to cope with this constraint and run this patch set
>> through, and I'll let you know if it spots anything.
> 
> Excellent, thanks.
> 
>>>
>>> Another approach could be to simply stick a memblock_reserve()
>>> somewhere that covers all 48-bit addressable memory, but you will need
>>> some of both in any case.
>>>
>>>>> And given that the LPA2 architectural feature covers both the
>>>>> virtual and physical range extensions, where enabling the latter is
>>>>> required to enable the former, we can simplify things further by only
>>>>> enabling them as a pair. (I.e., 52-bit physical addressing cannot be
>>>>> enabled for 48-bit VA space or smaller)
>>>>>
>>>>> [...]
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>