[PATCH v2 00/19] arm64: Enable LPA2 support for 4k and 16k pages

Fri Nov 25 06:12:50 PST 2022

On 25/11/2022 10:36, Ard Biesheuvel wrote:
> On Fri, 25 Nov 2022 at 11:07, Ryan Roberts <ryan.roberts at arm.com> wrote:
>>
>> On 25/11/2022 09:35, Ard Biesheuvel wrote:
>>> On Fri, 25 Nov 2022 at 10:23, Ryan Roberts <ryan.roberts at arm.com> wrote:
>>>>
>>>> On 24/11/2022 17:14, Ard Biesheuvel wrote:
>>>>> On Thu, 24 Nov 2022 at 15:39, Ryan Roberts <ryan.roberts at arm.com> wrote:
>>>>>>
>>>>>> Hi Ard,
>>>>>>
>>>>>> Thanks for including me on this. I'll plan to do a review over the next week or
>>>>>> so, but in the meantime, I have a couple of general questions/comments:
>>>>>>
>>>>>> On 24/11/2022 12:39, Ard Biesheuvel wrote:
>>>>>>> Enable support for LPA2 when running with 4k or 16k pages. In the former
>>>>>>> case, this requires 5 level paging with a runtime fallback to 4 on
>>>>>>> non-LPA2 hardware. For consistency, the same approach is adopted for 16k
>>>>>>> pages, where we fall back to 3 level paging (47 bit virtual addressing)
>>>>>>> on non-LPA2 configurations.
>>>>>>
>>>>>> It seems odd to me that if you have a non-LPA2 system, if you run a kernel that
>>>>>> is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if
>>>>>> you run a kernel that is compiled for 16KB pages and 52 VA bits then you will
>>>>>> get 47 VA bits? Wouldn't that pose a potential user space compat issue?
>>>>>>
>>>>>
>>>>> Well, given that Android happily runs with 39-bit VAs to avoid 4 level
>>>>> paging at all cost, I don't think that is a universal concern.
>>>>
>>>> Well presumably the Android kernel is always explicitly compiled for 39 VA bits
>>>> so that's what user space is used to? I was really just making the point that if
>>>> you have (the admittedly exotic and unlikely) case of having a 16KB kernel
>>>> previously compiled for 48 VA bits, and you "upgrade" it to 52 VA bits now that
>>>> the option is available, on HW without LPA2, this will actually be observed as a
>>>> "downgrade" to 47 bits. If you previously wanted to limit to 3 levels of lookup
>>>> with 16KB you would already have been compiling for 47 VA bits.
>>>>
>>>
>>> I am not debating that. I'm just saying that, without any hardware in
>>> existence, it is difficult to predict which of these concerns is going
>>> to dominate, and so I opted for the least messy and most symmetrical
>>> approach.
>>
>> OK fair enough. My opinion is logged ;-).
>>
>>>
>>>>>
>>>>> The benefit of this approach is that you can decide at runtime whether
>>>>> you want to take the performance hit of 4 (or 5) level paging to get
>>>>> access to the extended VA space.
>>>>>
>>>>>>> (Falling back to 48 bits would involve
>>>>>>> finding a workaround for the fact that we cannot construct a level 0
>>>>>>> table covering 52 bits of VA space that appears aligned to its size in
>>>>>>> memory, and has the top 2 entries that represent the 48-bit region
>>>>>>> appearing at an alignment of 64 bytes, which is required by the
>>>>>>> architecture for TTBR address values.
>>>>>>
>>>>>> I'm not sure I've understood this. The level 0 table would need 32 entries for
>>>>>> 52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is
>>>>>> a factor of 256 so surely the top 2 entries are guaranteed to also meet the
>>>>>> constraint for the fallback path too?
>>>>>>
>>>>>
>>>>> The top 2 entries are 16 bytes combined, and end on a 256 byte aligned
>>>>> boundary so I don't see how they can start on a 64 byte aligned
>>>>> boundary at the same time.
>>>>
>>>> I'm still not following; why would the 2 entry/16 byte table *end* on a 256 byte
>>>> boundary? I guess I should go and read your patch before making assumptions, but
>>>> my assumption from your description here was that you were optimistically
>>>> allocating a 32 entry/256 byte table for the 52 VA bit case, then needing to
>>>> reuse that table for the 2 entry/16 byte case if HW turns out not to support
>>>> LPA2. In which case, surely the 2 entry table would be overlayed at the start
>>>> (low address) of the allocated 32 entry table, and therefore its alignment is
>>>> 256 bytes, which meets the HW's 64 byte alignment requirement?
>>>>
>>>
>>> No, it's at the end, that is the point. I am specifically referring to
>>> TTBR1 upper region page tables here.
>>>
>>> Please refer to the existing ttbr1_offset asm macro, which implements
>>> this today for 64k pages + LVA. In this case, however, the condensed
>>> table covers 6 bits of translation so it is naturally aligned to the
>>> TTBR minimum alignment.
>>
>> Afraid I don't see any such ttbr1_offset macro, either in upstream or the branch
>> you posted. The best I can find is TTBR1_OFFSET in arm arch, which I'm guessing
>> isn't it. I'm keen to understand this better if you can point me to the right
>> location?
>>
> 
> Apologies, I got the name wrong
> 
> /*
>  * Offset ttbr1 to allow for 48-bit kernel VAs set with 52-bit PTRS_PER_PGD.
>  * orr is used as it can cover the immediate value (and is idempotent).
>  * In future this may be nop'ed out when dealing with 52-bit kernel VAs.
>  *      ttbr: Value of ttbr to set, modified.
>  */
>         .macro  offset_ttbr1, ttbr, tmp
> #ifdef CONFIG_ARM64_VA_BITS_52
>         mrs_s   \tmp, SYS_ID_AA64MMFR2_EL1
>         and     \tmp, \tmp, #(0xf << ID_AA64MMFR2_EL1_VARange_SHIFT)
>         cbnz    \tmp, .Lskipoffs_\@
>         orr     \ttbr, \ttbr, #TTBR1_BADDR_4852_OFFSET
> .Lskipoffs_\@ :
> #endif
>         .endm
> 
>> Regardless, the Arm ARM states this for TTBR1_EL1.BADDR:
>>
>> """
>> BADDR[47:1], bits [47:1]
>>
>> Translation table base address:
>> • Bits A[47:x] of the stage 1 translation table base address bits are in
>> register bits[47:x].
>> • Bits A[(x-1):0] of the stage 1 translation table base address are zero.
>>
>> Address bit x is the minimum address bit required to align the translation table
>> to the size of the table. The smallest permitted value of x is 6. The AArch64
>> Virtual Memory System Architecture chapter describes how x is calculated based
>> on the value of TCR_EL1.T1SZ, the translation stage, and the translation granule
>> size.
>>
>> Note
>> A translation table is required to be aligned to the size of the table. If a
>> table contains fewer than eight entries, it must be aligned on a 64 byte address
>> boundary.
>> """
>>
>> I don't see how that is referring to the alignment of the *end* of the table?
>>
> 
> It refers to the address poked into the register
> 
> When you create a 256 byte aligned 32 entry 52-bit level 0 table for
> 16k pages, entry #0 covers the start of the 52-bit addressable VA
> space, and entry #30 covers the start of the 48-bit addressable VA
> space.
> 
> When LPA2 is not supported, the walk must start at entry #30 so that
> is where TTBR1_EL1 should point, but doing so violates the
> architecture's alignment requirement.
> 
> So what we might do is double the size of the table, and clone entries
> #30 and #31 to positions #32 and #33, for instance (and remember to
> keep them in sync, which is not that hard)

OK I get it now - thanks for explaining. I think the key part that I didn't
appreciate is that the table is always allocated and populated as if we have 52
VA bits even if we only have 48, then you just fix up TTBR1 to point part way
down the table if we only have 48 bits.