[PATCH] arm64: Implement clear_pages()

Tue Mar 3 06:46:34 PST 2026

On Tue, Mar 03, 2026 at 11:06:13AM +0100, Linus Walleij wrote:
> A recent patch introduced clear_pages() and made it possible to
> provide assembly optimizations like for clear_page().
> 
> This augments the existing clear_page() optimization in arm64
> to accept any number of pages the following way:
> 
> - Make clear_page() a static inline special case of clear_pages()
> 
> - Implement clear_pages() as a static inline that just calculate
>   the number of total bytes in the page set and passes this number
>   to the assembly routine clear_pages_asm.
> 
> - The old clear_pages assembly is rewritten to clear_pages_asm
>   which will take a start address (at an even page) and a number
>   of bytes to clear from that address.
> 
> This is similar to the optimization provided for x86.
> 
> Performance improvements:
> 
> The baseline is the current v7.0-rc1 which calls the existing
> clear_page() assembly optimization in a loop, see <linux/mm.h>.
> Any improvements are about avoiding the outer loop, in most cases
> the clearing will be linear and the savings will be small and
> only noticeable on really big clearing operations.
> 
> We boot the kernel with cmdline like this:
> "default_hugepagesz=1G hugepagesz=1G hugepages=32" to make sure
> we have ample hugepages. This was then tested with the same
> cmdline as the original series:
> 
> perf bench mem mmap -p 1GB -f demand -s 32GB -l 5
> 
> The first run was discarded as the memory hierarchy is cold on
> the first run. Then I ran the above command 5 times and averaged
> the throughput, which sees a small but consistent improvement in
> the throughput:
> 
> On QEMU:
> 
> Before this patch:     After this patch:
> 2.38 GB/s              2.41 GB/s

I really don't think we should pay attention to performance under QEMU
as it doesn't necessarily have any correlation with real hardware.

> On hardware Radxa Orion O6 we see this on *some* cores and no
> change on others:
> 
> Before this patch:     After this patch:
> 43.3 GB/s              45.3 GB/s
> 
> There is a small but consistent improvement in throughput, as
> expected.
> 
> Tested-by: James Clark <james.clark2 at arm.com>
> Signed-off-by: Linus Walleij <linusw at kernel.org>
> ---
>  arch/arm64/include/asm/page.h                  | 13 ++++++++++++-
>  arch/arm64/kernel/image-vars.h                 |  2 +-
>  arch/arm64/kvm/hyp/nvhe/Makefile               |  2 +-
>  arch/arm64/lib/Makefile                        |  2 +-
>  arch/arm64/lib/{clear_page.S => clear_pages.S} | 18 +++++++++---------
>  5 files changed, 24 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> index b39cc1127e1f..916a3e7c9a19 100644
> --- a/arch/arm64/include/asm/page.h
> +++ b/arch/arm64/include/asm/page.h
> @@ -20,7 +20,18 @@ struct page;
>  struct vm_area_struct;
>  
>  extern void copy_page(void *to, const void *from);
> -extern void clear_page(void *to);
> +extern void clear_pages_asm(void *addr, unsigned int nbytes);
> +
> +static inline void clear_pages(void *addr, unsigned int npages)
> +{
> +	clear_pages_asm(addr, npages * PAGE_SIZE);
> +}
> +#define clear_pages clear_pages

Hmm. From what I can tell, this just turns a branch in C code into a
branch in assembly, so it's hard to correlate that meaningfully with
the performance improvement you see.

If we have CPUs that are this sensitive to branches, perhaps we'd be
better off taking the opposite approach and moving more code into C
so that the compiler can optimise the control flow for us?

Will