[PATCH bpf-next v4 0/3] bpf, arm64: use BPF prog pack allocator in BPF JIT

Mon Jul 3 09:40:21 PDT 2023

Hi Mark,

On 6/26/23 10:58 AM, Puranjay Mohan wrote:
> BPF programs currently consume a page each on ARM64. For systems with many BPF
> programs, this adds significant pressure to instruction TLB. High iTLB pressure
> usually causes slow down for the whole system.
> 
> Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue.
> It packs multiple BPF programs into a single huge page. It is currently only
> enabled for the x86_64 BPF JIT.
> 
> This patch series enables the BPF prog pack allocator for the ARM64 BPF JIT.
> 
> ====================================================
> Performance Analysis of prog pack allocator on ARM64
> ====================================================
> 
> To test the performance of the BPF prog pack allocator on ARM64, a stresser
> tool[2] was built. This tool loads 8 BPF programs on the system and triggers
> 5 of them in an infinite loop by doing system calls.
> 
> The runner script starts 20 instances of the above which loads 8*20=160 BPF
> programs on the system, 5*20=100 of which are being constantly triggered.
> 
> In the above environment we try to build Python-3.8.4 and try to find different
> iTLB metrics for the compilation done by gcc-12.2.0.
> 
> The source code[3] is  configured with the following command:
> ./configure --enable-optimizations --with-ensurepip=install
> 
> Then the runner script is executed with the following command:
> ./run.sh "perf stat -e ITLB_WALK,L1I_TLB,INST_RETIRED,iTLB-load-misses -a make -j32"
> 
> This builds Python while 160 BPF programs are loaded and 100 are being constantly
> triggered and measures iTLB related metrics.
> 
> The output of the above command is discussed below before and after enabling the
> BPF prog pack allocator.
> 
> The tests were run on qemu-system-aarch64 with 32 cpus, 4G memory, -machine virt,
> -cpu host, and -enable-kvm.
> 
> Results
> -------
> 
> Before enabling prog pack allocator:
> ------------------------------------
> 
> Performance counter stats for 'system wide':
> 
>           333278635      ITLB_WALK
>       6762692976558      L1I_TLB
>      25359571423901      INST_RETIRED
>         15824054789      iTLB-load-misses
> 
>       189.029769053 seconds time elapsed
> 
> After enabling prog pack allocator:
> -----------------------------------
> 
> Performance counter stats for 'system wide':
> 
>           190333544      ITLB_WALK
>       6712712386528      L1I_TLB
>      25278233304411      INST_RETIRED
>          5716757866      iTLB-load-misses
> 
>       185.392650561 seconds time elapsed
> 
> Improvements in metrics
> -----------------------
> 
> Compilation time                             ---> 1.92% faster
> iTLB-load-misses/Sec (Less is better)        ---> 63.16% decrease
> ITLB_WALK/1000 INST_RETIRED (Less is better) ---> 42.71% decrease
> ITLB_Walk/L1I_TLB (Less is better)           ---> 42.47% decrease
> 
> [1] https://lore.kernel.org/bpf/20220204185742.271030-1-song@kernel.org/
> [2] https://github.com/puranjaymohan/BPF-Allocator-Bench
> [3] https://www.python.org/ftp/python/3.8.4/Python-3.8.4.tgz
> 
> Chanes in V3 => V4: Changes only in 3rd patch
> 1. Fix the I-cache maintenance: Clean the data cache and invalidate the i-Cache
>     only *after* the instructions have been copied to the ROX region.

If you get a chance to take another look at the v4 changes from Puranjay and
in case they look good to you reply with an Ack, that would be great.

Thanks,
Daniel