[v3 PATCH 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Thu May 29 10:50:38 PDT 2025

On 5/29/25 10:01 AM, Ryan Roberts wrote:
> On 29/05/2025 17:37, Yang Shi wrote:
>>
>> On 5/29/25 12:36 AM, Ryan Roberts wrote:
>>> On 28/05/2025 16:18, Yang Shi wrote:
>>>> On 5/28/25 6:13 AM, Ryan Roberts wrote:
>>>>> On 28/05/2025 01:00, Yang Shi wrote:
>>>>>> Hi Ryan,
>>>>>>
>>>>>> I got a new spin ready in my local tree on top of v6.15-rc4. I noticed there
>>>>>> were some more comments on Miko's BBML2 patch, it looks like a new spin is
>>>>>> needed. But AFAICT there should be no significant change to how I advertise
>>>>>> AmpereOne BBML2 in my patches. We will keep using MIDR list to check whether
>>>>>> BBML2 is advertised or not and the erratum seems still be needed to fix up
>>>>>> AA64MMFR2 BBML2 bits for AmpereOne IIUC.
>>>>> Yes, I agree this should not impact you too much.
>>>>>
>>>>>> You also mentioned Dev was working on patches to have __change_memory_common()
>>>>>> apply permission change on a contiguous range instead of on page basis (the
>>>>>> status quo). But I have not seen the patches on mailing list yet. However I
>>>>>> don't think this will result in any significant change to my patches either,
>>>>>> particularly the split primitive and linear map repainting.
>>>>> I think you would need Dev's series to be able to apply the permissions change
>>>>> without needing to split the whole range to pte mappings? So I guess your
>>>>> change
>>>>> must either be implementing something similar to what Dev is working on or you
>>>>> are splitting the entire range to ptes? If the latter, then I'm not keen on
>>>>> that
>>>>> approach.
>>>> I don't think Dev's series is mandatory prerequisite for my patches. IIUC how
>>>> the split primitive keeps block mapping if it is fully contained is independent
>>>> from how to apply the permissions change on it.
>>>> The new spin implemented keeping block mapping if it is fully contained as we
>>>> discussed earlier. I'm supposed Dev's series just need to check whether the
>>>> mapping is block or not when applying permission change.
>>> The way I was thinking the split primitive would would, you would need Dev's
>>> change as a prerequisite, so I suspect we both have a slightly different idea of
>>> how this will work.
>>>
>>>> The flow just looks like as below conceptually:
>>>>
>>>> split_mapping(start, end)
>>>> apply_permission_change(start, end)
>>> The flow I was thinking of would be this:
>>>
>>> split_mapping(start)
>>> split_mapping(end)
>>> apply_permission_change(start, end)
>>>
>>> split_mapping() takes a virtual address that is at least page aligned and when
>>> it returns, ensures that the address is at the start of a leaf mapping. And it
>>> will only break the leaf mappings down so that they are the maximum size that
>>> can still meet the requirement.
>>>
>>> As an example, let's suppose you initially start with a region that is composed
>>> entirely of 2M mappings. Then you want to change permissions of a region [2052K,
>>> 6208K).
>>>
>>> Before any splitting, you have:
>>>
>>>     - 2M   x4: [0, 8192K)
>>>
>>> Then you call split_mapping(start=2052K):
>>>
>>>     - 2M   x1: [0, 2048K)
>>>     - 4K  x16: [2048K, 2112K)  << start is the start of the second 4K leaf mapping
>>>     - 64K x31: [2112K, 4096K)
>>>     - 2M:  x2: [4096K, 8192K)
>>>
>>> Then you call split_mapping(end=6208K):
>>>
>>>     - 2M   x1: [0, 2048K)
>>>     - 4K  x16: [2048K, 2112K)
>>>     - 64K x31: [2112K, 4096K)
>>>     - 2M:  x1: [4096K, 6144K)
>>>     - 64K x32: [6144K, 8192K)  << end is the end of the first 64K leaf mapping
>>>
>>> So then when you call apply_permission_change(start=2052K, end=6208K), the
>>> following leaf mappings' permissions will be modified:
>>>
>>>     - 4K  x15: [2052K, 2112K)
>>>     - 64K x31: [2112K, 4096K)
>>>     - 2M:  x1: [4096K, 6144K)
>>>     - 64K  x1: [6144K, 6208K)
>>>
>>> Since there are block mappings in this range, Dev's change is required to change
>>> the permissions.
>>>
>>> This approach means that we only ever split the minimum required number of
>>> mappings and we only split them to the largest size that still provides the
>>> alignment requirement.
>> I see your point. I believe we are on the same page: keep the block mappings in
>> the range as possible as we can. My implementation actually ends up having the
>> same result as your example shows. I guess we just have different ideas about
>> how to implement it.
> OK great!
>
>> However I do have hard time to understand why not just use split_mapping(start,
>> end).
> I don't really understand why you need to pass a range here. It's not like we
> want to visit every leaf mapping in the range. We just want to walk down through
> the page tables until we get to a leaf mapping that contains the address, then
> keep splitting and walking deeper until the address is the start of a leaf
> mapping. That's my thinking anyway. But you're the one doing the actual work
> here so you probably have better insight than me.

split_mapping(start, end) actually does the same thing, and we just need 
one call instead of two.

>
>> We can reuse some of the existing code easily with "end". Because the
>> existing code does calculate the page table (PUD/PMD/CONT PMD/CONT PTE)
>> boundary, so I reused it. Basically my implementation just skip to the next page
>> table if:
>>    * The start address is at page table boundary, and
>>    * The "end" is greater than page table boundary
>>
>> The logic may be a little bit convoluted, not sure if I articulated myself or
>> not. Anyway the code will explain everything.
> OK I think I understand; I think you're saying that if you pass in end, there is
> an optimization you can do for the case where end is contained within the same
> (ultimate) leaf mapping as start to avoid rewalking the pgtables?

Yes, we can just skip that page table to the next one because we know 
the "end".

>
>>>> The split_mapping() guarantees keep block mapping if it is fully contained in
>>>> the range between start and end, this is my series's responsibility. I know the
>>>> current code calls apply_to_page_range() to apply permission change and it just
>>>> does it on PTE basis. So IIUC Dev's series will modify it or provide a new API,
>>>> then __change_memory_common() will call it to change permission. There should be
>>>> some overlap between mine and Dev's, but I don't see strong dependency.
>>> But if you have a block mapping in the region you are calling
>>> __change_memory_common() on, today that will fail because it can only handle
>>> page mappings.
>> IMHO letting __change_memory_common() manipulate on contiguous address range is
>> another story and should be not a part of the split primitive.
> I 100% agree that it should not be part of the split primitive.
>
> But your series *depends* upon __change_memory_common() being able to change
> permissions on block mappings. Today it can only change permissions on page
> mappings.

I don't think split primitive depends on it. Changing permission on 
block mappings is just the user of the new split primitive IMHO. We just 
have no real user right now.

>
> Your original v1 series solved this by splitting *all* of the mappings in a
> given range to page mappings before calling __change_memory_common(), right?

Yes, but if the range is contiguous, the new split primitive doesn't 
have to split to page mappings.

>
> Remember it's not just vmalloc areas that are passed to
> __change_memory_common(); virtually contiguous linear map regions can be passed
> in as well. See (for example) set_direct_map_invalid_noflush(),
> set_direct_map_default_noflush(), set_direct_map_valid_noflush(),
> __kernel_map_pages(), realm_set_memory_encrypted(), realm_set_memory_decrypted().

Yes, no matter who the caller is, as long as the caller passes in 
contiguous address range, the split primitive can keep block mappings.

>
>
>> For example, we need to use vmalloc_huge() instead of vmalloc() to allocate huge
>> memory, then does:
>> split_mapping(start, start+HPAGE_PMD_SIZE);
>> change_permission(start, start+HPAGE_PMD_SIZE);
>>
>> The split primitive will guarantee (start, start+HPAGE_PMD_SIZE) is kept as PMD
>> mapping so that change_permission() can change it on PMD basis too.
>>
>> But this requires other kernel subsystems, for example, module, to allocate huge
>> memory with proper APIs, for example, vmalloc_huge().
> The longer term plan is to have vmalloc() always allocate using the
> VM_ALLOW_HUGE_VMAP flag on systems that support BBML2. So there will be no need
> to migrate users to vmalloc_huge(). We will just detect if we can split live
> mappings safely and use huge mappings in that case.

Anyway this is the potential user of the new split primitive.

Thanks,
Yang

>
> Thanks,
> Ryan
>
>> Thanks,
>> Yang
>>
>>>>> Regarding the linear map repainting, I had a chat with Catalin, and he reminded
>>>>> me of a potential problem; if you are doing the repainting with the machine
>>>>> stopped, you can't allocate memory at that point; it's possible a CPU was
>>>>> inside
>>>>> the allocator when it stopped. And I think you need to allocate intermediate
>>>>> pgtables, right? Do you have a solution to that problem? I guess one approach
>>>>> would be to figure out how much memory you will need and pre-allocate prior to
>>>>> stoping the machine?
>>>> OK, I don't remember we discussed this problem before. I think we can do
>>>> something like what kpti does. When creating the linear map we know how many PUD
>>>> and PMD mappings are created, we can record the number, it will tell how many
>>>> pages we need for repainting the linear map.
>>> I saw a separate reply you sent for this. I'll read that and respond in that
>>> context.
>>>
>>> Thanks,
>>> Ryan
>>>
>>>>>> So I plan to post v4 patches to the mailing list. We can focus on reviewing
>>>>>> the
>>>>>> split primitive and linear map repainting. Does it sound good to you?
>>>>> That works assuming you have a solution for the above.
>>>> I think the only missing part is preallocating page tables for repainting. I
>>>> will add this, then post the new spin to the mailing list.
>>>>
>>>> Thanks,
>>>> Yang
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>> Thanks,
>>>>>> Yang
>>>>>>
>>>>>>
>>>>>> On 5/7/25 2:16 PM, Yang Shi wrote:
>>>>>>> On 5/7/25 12:58 AM, Ryan Roberts wrote:
>>>>>>>> On 05/05/2025 22:39, Yang Shi wrote:
>>>>>>>>> On 5/2/25 4:51 AM, Ryan Roberts wrote:
>>>>>>>>>> On 14/04/2025 22:24, Yang Shi wrote:
>>>>>>>>>>> On 4/14/25 6:03 AM, Ryan Roberts wrote:
>>>>>>>>>>>> On 10/04/2025 23:00, Yang Shi wrote:
>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I know you may have a lot of things to follow up after LSF/MM. Just
>>>>>>>>>>>>> gently
>>>>>>>>>>>>> ping,
>>>>>>>>>>>>> hopefully we can resume the review soon.
>>>>>>>>>>>> Hi, I'm out on holiday at the moment, returning on the 22nd April. But
>>>>>>>>>>>> I'm very
>>>>>>>>>>>> keen to move this series forward so will come back to you next week.
>>>>>>>>>>>> (although
>>>>>>>>>>>> TBH, I thought I was waiting for you to respond to me... :-| )
>>>>>>>>>>>>
>>>>>>>>>>>> FWIW, having thought about it a bit more, I think some of the
>>>>>>>>>>>> suggestions I
>>>>>>>>>>>> previously made may not have been quite right, but I'll elaborate next
>>>>>>>>>>>> week.
>>>>>>>>>>>> I'm
>>>>>>>>>>>> keen to build a pgtable splitting primitive here that we can reuse with
>>>>>>>>>>>> vmalloc
>>>>>>>>>>>> as well to enable huge mappings by default with vmalloc too.
>>>>>>>>>>> Sounds good. I think the patches can support splitting vmalloc page table
>>>>>>>>>>> too.
>>>>>>>>>>> Anyway we can discuss more after you are back. Enjoy your holiday.
>>>>>>>>>> Hi Yang,
>>>>>>>>>>
>>>>>>>>>> Sorry I've taken so long to get back to you. Here's what I'm currently
>>>>>>>>>> thinking:
>>>>>>>>>> I'd eventually like to get to the point where the linear map and most
>>>>>>>>>> vmalloc
>>>>>>>>>> memory is mapped using the largest possible mapping granularity (i.e.
>>>>>>>>>> block
>>>>>>>>>> mappings at PUD/PMD, and contiguous mappings at PMD/PTE level).
>>>>>>>>>>
>>>>>>>>>> vmalloc has history with trying to do huge mappings by default; it
>>>>>>>>>> ended up
>>>>>>>>>> having to be turned into an opt-in feature (instead of the original
>>>>>>>>>> opt-out
>>>>>>>>>> approach) because there were problems with some parts of the kernel
>>>>>>>>>> expecting
>>>>>>>>>> page mappings. I think we might be able to overcome those issues on arm64
>>>>>>>>>> with
>>>>>>>>>> BBML2.
>>>>>>>>>>
>>>>>>>>>> arm64 can already support vmalloc PUD and PMD block mappings, and I have a
>>>>>>>>>> series (that should make v6.16) that enables contiguous PTE mappings in
>>>>>>>>>> vmalloc
>>>>>>>>>> too. But these are currently limited to when VM_ALLOW_HUGE is specified.
>>>>>>>>>> To be
>>>>>>>>>> able to use that by default, we need to be able to change permissions on
>>>>>>>>>> sub-regions of an allocation, which is where BBML2 and your series come
>>>>>>>>>> in.
>>>>>>>>>> (there may be other things we need to solve as well; TBD).
>>>>>>>>>>
>>>>>>>>>> I think the key thing we need is a function that can take a page-aligned
>>>>>>>>>> kernel
>>>>>>>>>> VA, will walk to the leaf entry for that VA and if the VA is in the
>>>>>>>>>> middle of
>>>>>>>>>> the leaf entry, it will split it so that the VA is now on a boundary. This
>>>>>>>>>> will
>>>>>>>>>> work for PUD/PMD block entries and contiguous-PMD/contiguous-PTE entries.
>>>>>>>>>> The
>>>>>>>>>> function can assume BBML2 is present. And it will return 0 on success, -
>>>>>>>>>> EINVAL
>>>>>>>>>> if the VA is not mapped or -ENOMEM if it couldn't allocate a pgtable to
>>>>>>>>>> perform
>>>>>>>>>> the split.
>>>>>>>>> OK, the v3 patches already handled page table allocation failure with
>>>>>>>>> returning
>>>>>>>>> -ENOMEM and BUG_ON if it is not mapped because kernel assumes linear
>>>>>>>>> mapping
>>>>>>>>> should be always present. It is easy to return -EINVAL instead of BUG_ON.
>>>>>>>>> However I'm wondering what usecases you are thinking about? Splitting
>>>>>>>>> vmalloc
>>>>>>>>> area may run into unmapped VA?
>>>>>>>> I don't think BUG_ON is the right behaviour; crashing the kernel should be
>>>>>>>> discouraged. I think even for vmalloc under correct conditions we shouldn't
>>>>>>>> see
>>>>>>>> any unmapped VA. But vmalloc does handle it gracefully today; see (e.g.)
>>>>>>>> vunmap_pmd_range() which skips the pmd if its none.
>>>>>>>>
>>>>>>>>>> Then we can use that primitive on the start and end address of any
>>>>>>>>>> range for
>>>>>>>>>> which we need exact mapping boundaries (e.g. when changing permissions on
>>>>>>>>>> part
>>>>>>>>>> of linear map or vmalloc allocation, when freeing part of a vmalloc
>>>>>>>>>> allocation,
>>>>>>>>>> etc). This way we only split enough to ensure the boundaries are precise,
>>>>>>>>>> and
>>>>>>>>>> keep larger mappings inside the range.
>>>>>>>>> Yeah, makes sense to me.
>>>>>>>>>
>>>>>>>>>> Next we need to reimplement __change_memory_common() to not use
>>>>>>>>>> apply_to_page_range(), because that assumes page mappings only. Dev
>>>>>>>>>> Jain has
>>>>>>>>>> been working on a series that converts this to use
>>>>>>>>>> walk_page_range_novma() so
>>>>>>>>>> that we can change permissions on the block/contig entries too. That's not
>>>>>>>>>> posted publicly yet, but it's not huge so I'll ask if he is comfortable
>>>>>>>>>> with
>>>>>>>>>> posting an RFC early next week.
>>>>>>>>> OK, so the new __change_memory_common() will change the permission of page
>>>>>>>>> table, right?
>>>>>>>> It will change permissions of all the leaf entries in the range of VAs it is
>>>>>>>> passed. Currently it assumes that all the leaf entries are PTEs. But we will
>>>>>>>> generalize to support all the other types of leaf entries too.,
>>>>>>>>
>>>>>>>>> If I remember correctly, you suggested change permissions in
>>>>>>>>> __create_pgd_mapping_locked() for v3. So I can disregard it?
>>>>>>>> Yes I did. I think this made sense (in my head at least) because in the
>>>>>>>> context
>>>>>>>> of the linear map, all the PFNs are contiguous so it kind-of makes sense to
>>>>>>>> reuse that infrastructure. But it doesn't generalize to vmalloc because
>>>>>>>> vmalloc
>>>>>>>> PFNs are not contiguous. So for that reason, I think it's preferable to
>>>>>>>> have an
>>>>>>>> independent capability.
>>>>>>> OK, sounds good to me.
>>>>>>>
>>>>>>>>> The current code assumes the address range passed in by
>>>>>>>>> change_memory_common()
>>>>>>>>> is *NOT* physically contiguous so __change_memory_common() handles page
>>>>>>>>> table
>>>>>>>>> permission on page basis. I'm supposed Dev's patches will handle this
>>>>>>>>> then my
>>>>>>>>> patch can safely assume the linear mapping address range for splitting is
>>>>>>>>> physically contiguous too otherwise I can't keep large mappings inside the
>>>>>>>>> range. Splitting vmalloc area doesn't need to worry about this.
>>>>>>>> I'm not sure I fully understand the point you're making here...
>>>>>>>>
>>>>>>>> Dev's series aims to use walk_page_range_novma() similar to riscv's
>>>>>>>> implementation so that it can walk a VA range and update the permissions on
>>>>>>>> each
>>>>>>>> leaf entry it visits, regadless of which level the leaf entry is at. This
>>>>>>>> doesn't make any assumption of the physical contiguity of neighbouring leaf
>>>>>>>> entries in the page table.
>>>>>>>>
>>>>>>>> So if we are changing permissions on the linear map, we have a range of
>>>>>>>> VAs to
>>>>>>>> walk and convert all the leaf entries, regardless of their size. The same
>>>>>>>> goes
>>>>>>>> for vmalloc... But for vmalloc, we will also want to change the underlying
>>>>>>>> permissions in the linear map, so we will have to figure out the contiguous
>>>>>>>> pieces of the linear map and call __change_memory_common() for each;
>>>>>>>> there is
>>>>>>>> definitely some detail to work out there!
>>>>>>> Yes, this is my point. When changing underlying linear map permission for
>>>>>>> vmalloc, the linear map address may be not contiguous. This is why
>>>>>>> change_memory_common() calls __change_memory_common() on page basis.
>>>>>>>
>>>>>>> But how Dev's patch work should have no impact on how I implement the split
>>>>>>> primitive by thinking it further. It should be the caller's responsibility to
>>>>>>> make sure __create_pgd_mapping_locked() is called for contiguous linear map
>>>>>>> address range.
>>>>>>>
>>>>>>>>>> You'll still need to repaint the whole linear map with page mappings
>>>>>>>>>> for the
>>>>>>>>>> case !BBML2 case, but I'm hoping __create_pgd_mapping_locked()
>>>>>>>>>> (potentially
>>>>>>>>>> with
>>>>>>>>>> minor modifications?) can do that repainting on the live mappings;
>>>>>>>>>> similar to
>>>>>>>>>> how you are doing it in v3.
>>>>>>>>> Yes, when repainting I need to split the page table all the way down to PTE
>>>>>>>>> level. A simple flag should be good enough to tell
>>>>>>>>> __create_pgd_mapping_locked()
>>>>>>>>> do the right thing off the top of my head.
>>>>>>>> Perhaps it may be sufficient to reuse the NO_BLOCK_MAPPINGS and
>>>>>>>> NO_CONT_MAPPINGS
>>>>>>>> flags? For example, if you are find a leaf mapping and NO_BLOCK_MAPPINGS is
>>>>>>>> set,
>>>>>>>> then you need to split it?
>>>>>>> Yeah, sounds feasible. Anyway I will figure it out.
>>>>>>>
>>>>>>>>>> Miko's BBML2 series should hopefully get imminently queued for v6.16.
>>>>>>>>> Great! Anyway my series is based on his advertising BBML2 patch.
>>>>>>>>>
>>>>>>>>>> So in summary, what I'm asking for your large block mapping the linear map
>>>>>>>>>> series is:
>>>>>>>>>>        - Paint linear map using blocks/contig if boot CPU supports BBML2
>>>>>>>>>>        - Repaint linear map using page mappings if secondary CPUs don't
>>>>>>>>>> support BBML2
>>>>>>>>> OK, I just need to add some simple tweak to split down to PTE level to v3.
>>>>>>>>>
>>>>>>>>>>        - Integrate Dev's __change_memory_common() series
>>>>>>>>> OK, I think I have to do my patches on top of it. Because Dev's patch need
>>>>>>>>> guarantee the linear mapping address range is physically contiguous.
>>>>>>>>>
>>>>>>>>>>        - Create primitive to ensure mapping entry boundary at a given page-
>>>>>>>>>> aligned VA
>>>>>>>>>>        - Use primitive when changing permissions on linear map region
>>>>>>>>> Sure.
>>>>>>>>>
>>>>>>>>>> This will be mergable on its own, but will also provide a great starting
>>>>>>>>>> base
>>>>>>>>>> for adding huge-vmalloc-by-default.
>>>>>>>>>>
>>>>>>>>>> What do you think?
>>>>>>>>> Definitely makes sense to me.
>>>>>>>>>
>>>>>>>>> If I remember correctly, we still have some unsolved comments/questions
>>>>>>>>> for v3
>>>>>>>>> in my replies on March 17, particularly:
>>>>>>>>> https://lore.kernel.org/linux-arm-kernel/2b715836-b566-4a9e-
>>>>>>>>> b344-9401fa4c0feb at os.amperecomputing.com/
>>>>>>>> Ahh sorry about that. I'll take a look now...
>>>>>>> No problem.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Yang
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yang
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Yang
>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Ryan
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 3/13/25 10:40 AM, Yang Shi wrote:
>>>>>>>>>>>>>> On 3/13/25 10:36 AM, Ryan Roberts wrote:
>>>>>>>>>>>>>>> On 13/03/2025 17:28, Yang Shi wrote:
>>>>>>>>>>>>>>>> Hi Ryan,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I saw Miko posted a new spin of his patches. There are some slight
>>>>>>>>>>>>>>>> changes
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> have impact to my patches (basically check the new boot parameter).
>>>>>>>>>>>>>>>> Do you
>>>>>>>>>>>>>>>> prefer I rebase my patches on top of his new spin right now then
>>>>>>>>>>>>>>>> restart
>>>>>>>>>>>>>>>> review
>>>>>>>>>>>>>>>> from the new spin or review the current patches then solve the new
>>>>>>>>>>>>>>>> review
>>>>>>>>>>>>>>>> comments and rebase to Miko's new spin together?
>>>>>>>>>>>>>>> Hi Yang,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sorry I haven't got to reviewing this version yet, it's in my queue!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm happy to review against v3 as it is. I'm familiar with Miko's
>>>>>>>>>>>>>>> series
>>>>>>>>>>>>>>> and am
>>>>>>>>>>>>>>> not too bothered about the integration with that; I think it's pretty
>>>>>>>>>>>>>>> straight
>>>>>>>>>>>>>>> forward. I'm more interested in how you are handling the splitting,
>>>>>>>>>>>>>>> which I
>>>>>>>>>>>>>>> think is the bulk of the effort.
>>>>>>>>>>>>>> Yeah, sure, thank you.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm hoping to get to this next week before heading out to LSF/MM the
>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>> week (might I see you there?)
>>>>>>>>>>>>>> Unfortunately I can't make it this year. Have a fun!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Ryan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Yang
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 3/4/25 2:19 PM, Yang Shi wrote:
>>>>>>>>>>>>>>>>> Changelog
>>>>>>>>>>>>>>>>> =========
>>>>>>>>>>>>>>>>> v3:
>>>>>>>>>>>>>>>>>           * Rebased to v6.14-rc4.
>>>>>>>>>>>>>>>>>           * Based on Miko's BBML2 cpufeature patch (https://
>>>>>>>>>>>>>>>>> lore.kernel.org/
>>>>>>>>>>>>>>>>> linux-
>>>>>>>>>>>>>>>>> arm-kernel/20250228182403.6269-3-miko.lenczewski at arm.com/).
>>>>>>>>>>>>>>>>>             Also included in this series in order to have the
>>>>>>>>>>>>>>>>> complete
>>>>>>>>>>>>>>>>> patchset.
>>>>>>>>>>>>>>>>>           * Enhanced __create_pgd_mapping() to handle split as
>>>>>>>>>>>>>>>>> well per
>>>>>>>>>>>>>>>>> Ryan.
>>>>>>>>>>>>>>>>>           * Supported CONT mappings per Ryan.
>>>>>>>>>>>>>>>>>           * Supported asymmetric system by splitting kernel linear
>>>>>>>>>>>>>>>>> mapping if
>>>>>>>>>>>>>>>>> such
>>>>>>>>>>>>>>>>>             system is detected per Ryan. I don't have such system to
>>>>>>>>>>>>>>>>> test,
>>>>>>>>>>>>>>>>> so the
>>>>>>>>>>>>>>>>>             testing is done by hacking kernel to call linear mapping
>>>>>>>>>>>>>>>>> repainting
>>>>>>>>>>>>>>>>>             unconditionally. The linear mapping doesn't have any
>>>>>>>>>>>>>>>>> block and
>>>>>>>>>>>>>>>>> cont
>>>>>>>>>>>>>>>>>             mappings after booting.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> RFC v2:
>>>>>>>>>>>>>>>>>           * Used allowlist to advertise BBM lv2 on the CPUs which
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> handle TLB
>>>>>>>>>>>>>>>>>             conflict gracefully per Will Deacon
>>>>>>>>>>>>>>>>>           * Rebased onto v6.13-rc5
>>>>>>>>>>>>>>>>>           * https://lore.kernel.org/linux-arm-
>>>>>>>>>>>>>>>>> kernel/20250103011822.1257189-1-
>>>>>>>>>>>>>>>>> yang at os.amperecomputing.com/
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> RFC v1: https://lore.kernel.org/lkml/20241118181711.962576-1-
>>>>>>>>>>>>>>>>> yang at os.amperecomputing.com/
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Description
>>>>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>>>>> When rodata=full kernel linear mapping is mapped by PTE due to
>>>>>>>>>>>>>>>>> arm's
>>>>>>>>>>>>>>>>> break-before-make rule.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> A number of performance issues arise when the kernel linear map is
>>>>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>> PTE entries due to arm's break-before-make rule:
>>>>>>>>>>>>>>>>>           - performance degradation
>>>>>>>>>>>>>>>>>           - more TLB pressure
>>>>>>>>>>>>>>>>>           - memory waste for kernel page table
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> These issues can be avoided by specifying rodata=on the kernel
>>>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>> line but this disables the alias checks on page table
>>>>>>>>>>>>>>>>> permissions and
>>>>>>>>>>>>>>>>> therefore compromises security somewhat.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With FEAT_BBM level 2 support it is no longer necessary to
>>>>>>>>>>>>>>>>> invalidate the
>>>>>>>>>>>>>>>>> page table entry when changing page sizes. This allows the
>>>>>>>>>>>>>>>>> kernel to
>>>>>>>>>>>>>>>>> split large mappings after boot is complete.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This patch adds support for splitting large mappings when FEAT_BBM
>>>>>>>>>>>>>>>>> level 2
>>>>>>>>>>>>>>>>> is available and rodata=full is used. This functionality will be
>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>> when modifying page permissions for individual page frames.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Without FEAT_BBM level 2 we will keep the kernel linear map using
>>>>>>>>>>>>>>>>> PTEs
>>>>>>>>>>>>>>>>> only.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If the system is asymmetric, the kernel linear mapping may be
>>>>>>>>>>>>>>>>> repainted
>>>>>>>>>>>>>>>>> once
>>>>>>>>>>>>>>>>> the BBML2 capability is finalized on all CPUs.  See patch #6 for
>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>> details.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We saw significant performance increases in some benchmarks with
>>>>>>>>>>>>>>>>> rodata=full without compromising the security features of the
>>>>>>>>>>>>>>>>> kernel.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Testing
>>>>>>>>>>>>>>>>> =======
>>>>>>>>>>>>>>>>> The test was done on AmpereOne machine (192 cores, 1P) with 256GB
>>>>>>>>>>>>>>>>> memory and
>>>>>>>>>>>>>>>>> 4K page size + 48 bit VA.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Function test (4K/16K/64K page size)
>>>>>>>>>>>>>>>>>           - Kernel boot.  Kernel needs change kernel linear mapping
>>>>>>>>>>>>>>>>> permission at
>>>>>>>>>>>>>>>>>             boot stage, if the patch didn't work, kernel typically
>>>>>>>>>>>>>>>>> didn't
>>>>>>>>>>>>>>>>> boot.
>>>>>>>>>>>>>>>>>           - Module stress from stress-ng. Kernel module load change
>>>>>>>>>>>>>>>>> permission
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>             linear mapping.
>>>>>>>>>>>>>>>>>           - A test kernel module which allocates 80% of total memory
>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>> vmalloc(),
>>>>>>>>>>>>>>>>>             then change the vmalloc area permission to RO, this also
>>>>>>>>>>>>>>>>> change
>>>>>>>>>>>>>>>>> linear
>>>>>>>>>>>>>>>>>             mapping permission to RO, then change it back before
>>>>>>>>>>>>>>>>> vfree(). Then
>>>>>>>>>>>>>>>>> launch
>>>>>>>>>>>>>>>>>             a VM which consumes almost all physical memory.
>>>>>>>>>>>>>>>>>           - VM with the patchset applied in guest kernel too.
>>>>>>>>>>>>>>>>>           - Kernel build in VM with guest kernel which has this
>>>>>>>>>>>>>>>>> series
>>>>>>>>>>>>>>>>> applied.
>>>>>>>>>>>>>>>>>           - rodata=on. Make sure other rodata mode is not broken.
>>>>>>>>>>>>>>>>>           - Boot on the machine which doesn't support BBML2.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Performance
>>>>>>>>>>>>>>>>> ===========
>>>>>>>>>>>>>>>>> Memory consumption
>>>>>>>>>>>>>>>>> Before:
>>>>>>>>>>>>>>>>> MemTotal:       258988984 kB
>>>>>>>>>>>>>>>>> MemFree:        254821700 kB
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> After:
>>>>>>>>>>>>>>>>> MemTotal:       259505132 kB
>>>>>>>>>>>>>>>>> MemFree:        255410264 kB
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Around 500MB more memory are free to use.  The larger the machine,
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> more memory saved.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Performance benchmarking
>>>>>>>>>>>>>>>>> * Memcached
>>>>>>>>>>>>>>>>> We saw performance degradation when running Memcached benchmark
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> rodata=full vs rodata=on.  Our profiling pointed to kernel TLB
>>>>>>>>>>>>>>>>> pressure.
>>>>>>>>>>>>>>>>> With this patchset we saw ops/sec is increased by around 3.5%, P99
>>>>>>>>>>>>>>>>> latency is reduced by around 9.6%.
>>>>>>>>>>>>>>>>> The gain mainly came from reduced kernel TLB misses.  The kernel
>>>>>>>>>>>>>>>>> TLB
>>>>>>>>>>>>>>>>> MPKI is reduced by 28.5%.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The benchmark data is now on par with rodata=on too.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> * Disk encryption (dm-crypt) benchmark
>>>>>>>>>>>>>>>>> Ran fio benchmark with the below command on a 128G ramdisk (ext4)
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> disk
>>>>>>>>>>>>>>>>> encryption (by dm-crypt).
>>>>>>>>>>>>>>>>> fio --directory=/data --random_generator=lfsr --norandommap --
>>>>>>>>>>>>>>>>> randrepeat 1 \
>>>>>>>>>>>>>>>>>             --status-interval=999 --rw=write --bs=4k --loops=1 --
>>>>>>>>>>>>>>>>> ioengine=sync \
>>>>>>>>>>>>>>>>>             --iodepth=1 --numjobs=1 --fsync_on_close=1 --
>>>>>>>>>>>>>>>>> group_reporting --
>>>>>>>>>>>>>>>>> thread \
>>>>>>>>>>>>>>>>>             --name=iops-test-job --eta-newline=1 --size 100G
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The IOPS is increased by 90% - 150% (the variance is high, but the
>>>>>>>>>>>>>>>>> worst
>>>>>>>>>>>>>>>>> number of good case is around 90% more than the best number of bad
>>>>>>>>>>>>>>>>> case).
>>>>>>>>>>>>>>>>> The bandwidth is increased and the avg clat is reduced
>>>>>>>>>>>>>>>>> proportionally.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> * Sequential file read
>>>>>>>>>>>>>>>>> Read 100G file sequentially on XFS (xfs_io read with page cache
>>>>>>>>>>>>>>>>> populated).
>>>>>>>>>>>>>>>>> The bandwidth is increased by 150%.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Mikołaj Lenczewski (1):
>>>>>>>>>>>>>>>>>               arm64: Add BBM Level 2 cpu feature
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yang Shi (5):
>>>>>>>>>>>>>>>>>               arm64: cpufeature: add AmpereOne to BBML2 allow list
>>>>>>>>>>>>>>>>>               arm64: mm: make __create_pgd_mapping() and helpers
>>>>>>>>>>>>>>>>> non-void
>>>>>>>>>>>>>>>>>               arm64: mm: support large block mapping when
>>>>>>>>>>>>>>>>> rodata=full
>>>>>>>>>>>>>>>>>               arm64: mm: support split CONT mappings
>>>>>>>>>>>>>>>>>               arm64: mm: split linear mapping if BBML2 is not
>>>>>>>>>>>>>>>>> supported on
>>>>>>>>>>>>>>>>> secondary
>>>>>>>>>>>>>>>>> CPUs
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>          arch/arm64/Kconfig                  | 11 +++++
>>>>>>>>>>>>>>>>>          arch/arm64/include/asm/cpucaps.h    | 2 +
>>>>>>>>>>>>>>>>>          arch/arm64/include/asm/cpufeature.h | 15 ++++++
>>>>>>>>>>>>>>>>>          arch/arm64/include/asm/mmu.h        | 4 ++
>>>>>>>>>>>>>>>>>          arch/arm64/include/asm/pgtable.h    | 12 ++++-
>>>>>>>>>>>>>>>>>          arch/arm64/kernel/cpufeature.c      | 95 ++++++++++++++++++
>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>> +++++++
>>>>>>>>>>>>>>>>>          arch/arm64/mm/mmu.c                 | 397 ++++++++++++++
>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>>> ++++++
>>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>>>>>>>>>> ++++
>>>>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>>>>> +++++
>>>>>>>>>>>>>>>>> ++++++++++++++++++++++-------------------
>>>>>>>>>>>>>>>>>          arch/arm64/mm/pageattr.c            | 37 ++++++++++++---
>>>>>>>>>>>>>>>>>          arch/arm64/tools/cpucaps            | 1 +
>>>>>>>>>>>>>>>>>          9 files changed, 518 insertions(+), 56 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>