[PATCH v7 0/6] arm64: support FEAT_BBM level 2 and large block mapping when rodata=full

Tue Sep 9 09:32:36 PDT 2025

On 09/09/2025 16:32, Yang Shi wrote:
> 
> 
> On 9/9/25 7:36 AM, Ryan Roberts wrote:
>> On 08/09/2025 19:31, Yang Shi wrote:
>>>
>>> On 9/8/25 9:34 AM, Ryan Roberts wrote:
>>>> On 04/09/2025 22:49, Yang Shi wrote:
>>>>> On 9/4/25 10:47 AM, Yang Shi wrote:
>>>>>> On 9/4/25 6:16 AM, Ryan Roberts wrote:
>>>>>>> On 04/09/2025 14:14, Ryan Roberts wrote:
>>>>>>>> On 03/09/2025 01:50, Yang Shi wrote:
>>>>>>>>>>>> I am wondering whether we can just have a warn_on_once or something
>>>>>>>>>>>> for the
>>>>>>>>>>>> case
>>>>>>>>>>>> when we fail to allocate a pagetable page. Or, Ryan had
>>>>>>>>>>>> suggested in an off-the-list conversation that we can maintain a cache
>>>>>>>>>>>> of PTE
>>>>>>>>>>>> tables for every PMD block mapping, which will give us
>>>>>>>>>>>> the same memory consumption as we do today, but not sure if this is
>>>>>>>>>>>> worth it.
>>>>>>>>>>>> x86 can already handle splitting but due to the callchains
>>>>>>>>>>>> I have described above, it has the same problem, and the code has been
>>>>>>>>>>>> working
>>>>>>>>>>>> for years :)
>>>>>>>>>>> I think it's preferable to avoid having to keep a cache of pgtable
>>>>>>>>>>> memory
>>>>>>>>>>> if we
>>>>>>>>>>> can...
>>>>>>>>>> Yes, I agree. We simply don't know how many pages we need to cache,
>>>>>>>>>> and it
>>>>>>>>>> still can't guarantee 100% allocation success.
>>>>>>>>> This is wrong... We can know how many pages will be needed for splitting
>>>>>>>>> linear
>>>>>>>>> mapping to PTEs for the worst case once linear mapping is finalized.
>>>>>>>>> But it
>>>>>>>>> may
>>>>>>>>> require a few hundred megabytes memory to guarantee allocation success. I
>>>>>>>>> don't
>>>>>>>>> think it is worth for such rare corner case.
>>>>>>>> Indeed, we know exactly how much memory we need for pgtables to map the
>>>>>>>> linear
>>>>>>>> map by pte - that's exactly what we are doing today. So we _could_ keep a
>>>>>>>> cache.
>>>>>>>> We would still get the benefit of improved performance but we would lose
>>>>>>>> the
>>>>>>>> benefit of reduced memory.
>>>>>>>>
>>>>>>>> I think we need to solve the vm_reset_perms() problem somehow, before we
>>>>>>>> can
>>>>>>>> enable this.
>>>>>>> Sorry I realise this was not very clear... I am saying I think we need to
>>>>>>> fix it
>>>>>>> somehow. A cache would likely work. But I'd prefer to avoid it if we can
>>>>>>> find a
>>>>>>> better solution.
>>>>>> Took a deeper look at vm_reset_perms(). It was introduced by commit
>>>>>> 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions").
>>>>>> The
>>>>>> VM_FLUSH_RESET_PERMS flag is supposed to be set if the vmalloc memory is RO
>>>>>> and/or ROX. So set_memory_ro() or set_memory_rox() is supposed to follow up
>>>>>> vmalloc(). So the page table should be already split before reaching vfree().
>>>>>> I think this why vm_reset_perms() doesn't not check return value.
>>>> If vm_reset_perms() is assuming it can't/won't fail, I think it should at least
>>>> output a warning if it does?
>>> It should. Anyway warning will be raised if split fails. We have somehow
>>> mitigation.
>>>
>>>>>> I scrutinized all the callsites with VM_FLUSH_RESET_PERMS flag set.
>>>> Just checking; I think you made a comment before about there only being a few
>>>> sites that set VM_FLUSH_RESET_PERMS. But one of them is the helper,
>>>> set_vm_flush_reset_perms(). So just making sure you also followed to the places
>>>> that use that helper?
>>> Yes, I did.
>>>
>>>>>> The most
>>>>>> of them has set_memory_ro() or set_memory_rox() followed.
>>>> And are all callsites calling set_memory_*() for the entire cell that was
>>>> allocated by vmalloc? If there are cases where it only calls that for a portion
>>>> of it, then it's not gurranteed that the memory is correctly split.
>>> Yes, all callsites call set_memory_*() for the entire range.
>>>
>>>>>> But there are 3
>>>>>> places I don't see set_memory_ro()/set_memory_rox() is called.
>>>>>>
>>>>>> 1. BPF trampoline allocation. The BPF trampoline calls
>>>>>> arch_protect_bpf_trampoline(). The generic implementation does call
>>>>>> set_memory_rox(). But the x86 and arm64 implementation just simply return 0.
>>>>>> For x86, it is because execmem cache is used and it does call
>>>>>> set_memory_rox(). ARM64 doesn't need to split page table before this series,
>>>>>> so it should never fail. I think we just need to use the generic
>>>>>> implementation (remove arm64 implementation) if this series is merged.
>>>> I know zero about BPF. But it looks like the allocation happens in
>>>> arch_alloc_bpf_trampoline(), which for arm64, calls bpf_prog_pack_alloc(). And
>>>> for small sizes, it grabs some memory from a "pack". So doesn't this mean that
>>>> you are calling set_memory_rox() for a sub-region of the cell, so that doesn't
>>>> actually help at vm_reset_perms()-time?
>>> Took a deeper look at bpf pack allocator. The "pack" is allocated by
>>> alloc_new_pack(), which does:
>>> bpf_jit_alloc_exec()
>>> set_vm_flush_reset_perms()
>>> set_memory_rox()
>>>
>>> If the size is greater than the pack size, it calls:
>>> bpf_jit_alloc_exec()
>>> set_vm_flush_reset_perms()
>>> set_memory_rox()
>>>
>>> So it looks like bpf trampoline is good, and we don't need do anything. It
>>> should be removed from the list. I didn't look deep enough for bpf pack
>>> allocator in the first place.
>>>
>>>>>> 2. BPF dispatcher. It calls execmem_alloc which has VM_FLUSH_RESET_PERMS set.
>>>>>> But it is used for rw allocation, so VM_FLUSH_RESET_PERMS should be
>>>>>> unnecessary IIUC. So it doesn't matter even though vm_reset_perms() fails.
>>>>>>
>>>>>> 3. kprobe. S390's alloc_insn_page() does call set_memory_rox(), x86 also
>>>>>> called set_memory_rox() before switching to execmem cache. The execmem cache
>>>>>> calls set_memory_rox(). I don't know why ARM64 doesn't call it.
>>>>>>
>>>>>> So I think we just need to fix #1 and #3 per the above analysis. If this
>>>>>> analysis look correct to you guys, I will prepare two patches to fix them.
>>>> This all seems quite fragile. I find it interesting that vm_reset_perms() is
>>>> doing break-before-make; it sets the PTEs as invalid, then flushes the TLB,
>>>> then
>>>> sets them to default. But for arm64, at least, I think break-before-make is not
>>>> required. We are only changing the permissions so that can be done on live
>>>> mappings; essentially change the sequence to; set default, flush TLB.
>>> Yeah, I agree it is a little bit fragile. I think this is the "contract" for
>>> vmalloc users. You allocate ROX memory via vmalloc, you are required to call
>>> set_memory_*(). But there is nothing to guarantee the "contract" is followed.
>>> But I don't think this is the only case in kernel.
>>>
>>>> If we do that, then if the memory was already default, then there is no need to
>>>> do anything (so no chance of allocation failure). If the memory was not
>>>> default,
>>>> then it must have already been split to make it non-default, in which case we
>>>> can also gurrantee that no allocations are required.
>>>>
>>>> What am I missing?
>>> The comment says:
>>> Set direct map to something invalid so that it won't be cached if there are any
>>> accesses after the TLB flush, then flush the TLB and reset the direct map
>>> permissions to the default.
>>>
>>> IIUC, it guarantees the direct map can't be cached in TLB after TLB flush from
>>> _vm_unmap_aliases() by setting them invalid because TLB never cache invalid
>>> entries. Skipping set direct map to invalid seems break this. Or "changing
>>> permission on live mappings" on ARM64 can achieve the same goal?
>> Here's my understanding of the intent of the code:
>>
>> Let's say we start with some memory that has been mapped RO. Our goal is to
>> reset the memory back to RW and ensure that no TLB entry remains in the TLB for
>> the old RO mapping. There are 2 ways to do that:
> 
> 
> 
>>
>> Approach 1 (used in current code):
>> 1. set PTE to invalid
>> 2. invalidate any TLB entry for the VA
>> 3. set the PTE to RW
>>
>> Approach 2:
>> 1. set the PTE to RW
>> 2. invalidate any TLB entry for the VA
> 
> IIUC, the intent of the code is "reset direct map permission *without* leaving a
> RW+X window". The TLB flush call actually flushes both VA and direct map together.
> So if this is the intent, approach #2 may have VA with X permission but direct
> map may be RW at the mean time. It seems break the intent.

Ahh! Thanks, it's starting to make more sense now.

Though on first sight it seems a bit mad to me to form a tlb flush range that
covers all the direct map pages and all the lazy vunmap regions. Is that
intended to be a perf optimization or something else? It's not clear from the
history.

Could this be split into 2 operations?

1. unmap the aliases (+ tlbi the aliases).
2. set the direct memory back to default (+ tlbi the direct map region).

The only 2 potential problems I can think of are;

 - Performance: 2 tlbis instead of 1, but conversely we probably avoid flushing
a load of TLB entries that we didn't really need to.

 - Given there is now no lock around the tlbis (currently it's under
vmap_purge_lock) is there a race where a new alias can appear between steps 1
and 2? I don't think so, because the memory is allocated to the current mapping
so how is it going to get re-mapped?

Could this solve it?

> 
> Thanks,
> Yang
> 
>>
>> The benefit of approach 1 is that it is guarranteed that it is impossible for
>> different CPUs to have different translations for the same VA in their
>> respective TLB. But for approach 2, it's possible that between steps 1 and 2, 1
>> CPU has a RO entry and another CPU has a RW entry. But that will get fixed once
>> the TLB is flushed - it's not really an issue.
>>
>> (There is probably also an obscure way to end up with 2 TLB entries (one with RO
>> and one with RW) for the same CPU, but the arm64 architecture permits that as
>> long as it's only a permission mismatch).
>>
>> Anyway, approach 2 is used when changing memory permissions on user mappings, so
>> I don't see why we can't take the same approach here. That would solve this
>> whole class of issue for us.
>>
>> Thanks,
>> Ryan
>>
>>
>>> Thanks,
>>> Yang
>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>>> Tested the below patch with bpftrace kfunc (allocate bpf trampoline) and
>>>>> kprobes. It seems work well.
>>>>>
>>>>> diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/
>>>>> kprobes.c
>>>>> index 0c5d408afd95..c4f8c4750f1e 100644
>>>>> --- a/arch/arm64/kernel/probes/kprobes.c
>>>>> +++ b/arch/arm64/kernel/probes/kprobes.c
>>>>> @@ -10,6 +10,7 @@
>>>>>
>>>>>    #define pr_fmt(fmt) "kprobes: " fmt
>>>>>
>>>>> +#include <linux/execmem.h>
>>>>>    #include <linux/extable.h>
>>>>>    #include <linux/kasan.h>
>>>>>    #include <linux/kernel.h>
>>>>> @@ -41,6 +42,17 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
>>>>>    static void __kprobes
>>>>>    post_kprobe_handler(struct kprobe *, struct kprobe_ctlblk *, struct pt_regs
>>>>> *);
>>>>>
>>>>> +void *alloc_insn_page(void)
>>>>> +{
>>>>> +       void *page;
>>>>> +
>>>>> +       page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
>>>>> +       if (!page)
>>>>> +               return NULL;
>>>>> +       set_memory_rox((unsigned long)page, 1);
>>>>> +       return page;
>>>>> +}
>>>>> +
>>>>>    static void __kprobes arch_prepare_ss_slot(struct kprobe *p)
>>>>>    {
>>>>>           kprobe_opcode_t *addr = p->ainsn.xol_insn;
>>>>> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
>>>>> index 52ffe115a8c4..3e301bc2cd66 100644
>>>>> --- a/arch/arm64/net/bpf_jit_comp.c
>>>>> +++ b/arch/arm64/net/bpf_jit_comp.c
>>>>> @@ -2717,11 +2717,6 @@ void arch_free_bpf_trampoline(void *image, unsigned int
>>>>> size)
>>>>>           bpf_prog_pack_free(image, size);
>>>>>    }
>>>>>
>>>>> -int arch_protect_bpf_trampoline(void *image, unsigned int size)
>>>>> -{
>>>>> -       return 0;
>>>>> -}
>>>>> -
>>>>>    int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *ro_image,
>>>>>                                   void *ro_image_end, const struct
>>>>> btf_func_model *m,
>>>>>                                   u32 flags, struct bpf_tramp_links *tlinks,
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>> Yang
>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Yang
>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Yang
>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ryan
>>>>>>>>>>>
>>>>>>>>>>>
>