[PATCH] arm64/mm: save memory access in check_and_switch_context() fast switch path

Mark Rutland mark.rutland at arm.com
Fri Jul 3 06:13:36 EDT 2020


On Fri, Jul 03, 2020 at 01:44:39PM +0800, Pingfan Liu wrote:
> The cpu_number and __per_cpu_offset cost two different cache lines, and may
> not exist after a heavy user space load.
> 
> By replacing per_cpu(active_asids, cpu) with this_cpu_ptr(&active_asids) in
> fast path, register is used and these memory access are avoided.

How about:

| On arm64, smp_processor_id() reads a per-cpu `cpu_number` variable,
| using the per-cpu offset stored in the tpidr_el1 system register. In
| some cases we generate a per-cpu address with a sequence like:
|
| | cpu_ptr = &per_cpu(ptr, smp_processor_id());
|
| Which potentially incurs a cache miss for both `cpu_number` and the
| in-memory `__per_cpu_offset` array. This can be written more optimally
| as:
|
| | cpu_ptr = this_cpu_ptr(ptr);
|
| ... which only needs the offset from tpidr_el1, and does not need to
| load from memory.

> By replacing per_cpu(active_asids, cpu) with this_cpu_ptr(&active_asids) in
> fast path, register is used and these memory access are avoided.

Do you have any numbers that show benefit here? It's not clear to me how
often the above case would apply where the cahes would also be hot for
everything else we need, and numbers would help to justify that.

> Signed-off-by: Pingfan Liu <kernelfans at gmail.com>
> Cc: Catalin Marinas <catalin.marinas at arm.com>
> Cc: Will Deacon <will at kernel.org>
> Cc: Steve Capper <steve.capper at arm.com>
> Cc: Mark Rutland <mark.rutland at arm.com>
> Cc: Vladimir Murzin <vladimir.murzin at arm.com>
> Cc: Jean-Philippe Brucker <jean-philippe at linaro.org>
> To: linux-arm-kernel at lists.infradead.org
> ---
>  arch/arm64/include/asm/mmu_context.h |  6 ++----
>  arch/arm64/mm/context.c              | 10 ++++++----
>  2 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
> index ab46187..808c3be 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -175,7 +175,7 @@ static inline void cpu_replace_ttbr1(pgd_t *pgdp)
>   * take CPU migration into account.
>   */
>  #define destroy_context(mm)		do { } while(0)
> -void check_and_switch_context(struct mm_struct *mm, unsigned int cpu);
> +void check_and_switch_context(struct mm_struct *mm);
>  
>  #define init_new_context(tsk,mm)	({ atomic64_set(&(mm)->context.id, 0); 0; })
>  
> @@ -214,8 +214,6 @@ enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
>  
>  static inline void __switch_mm(struct mm_struct *next)
>  {
> -	unsigned int cpu = smp_processor_id();
> -
>  	/*
>  	 * init_mm.pgd does not contain any user mappings and it is always
>  	 * active for kernel addresses in TTBR1. Just set the reserved TTBR0.
> @@ -225,7 +223,7 @@ static inline void __switch_mm(struct mm_struct *next)
>  		return;
>  	}
>  
> -	check_and_switch_context(next, cpu);
> +	check_and_switch_context(next);
>  }
>  
>  static inline void
> diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c
> index d702d60..a206655 100644
> --- a/arch/arm64/mm/context.c
> +++ b/arch/arm64/mm/context.c
> @@ -198,9 +198,10 @@ static u64 new_context(struct mm_struct *mm)
>  	return idx2asid(asid) | generation;
>  }
>  
> -void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
> +void check_and_switch_context(struct mm_struct *mm)
>  {
>  	unsigned long flags;
> +	unsigned int cpu;
>  	u64 asid, old_active_asid;
>  
>  	if (system_supports_cnp())
> @@ -222,9 +223,9 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
>  	 *   relaxed xchg in flush_context will treat us as reserved
>  	 *   because atomic RmWs are totally ordered for a given location.
>  	 */
> -	old_active_asid = atomic64_read(&per_cpu(active_asids, cpu));
> +	old_active_asid = atomic64_read(this_cpu_ptr(&active_asids));
>  	if (old_active_asid && asid_gen_match(asid) &&
> -	    atomic64_cmpxchg_relaxed(&per_cpu(active_asids, cpu),
> +	    atomic64_cmpxchg_relaxed(this_cpu_ptr(&active_asids),
>  				     old_active_asid, asid))
>  		goto switch_mm_fastpath;
>  
> @@ -236,10 +237,11 @@ void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
>  		atomic64_set(&mm->context.id, asid);
>  	}
>  
> +	cpu = smp_processor_id();
>  	if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending))
>  		local_flush_tlb_all();
>  
> -	atomic64_set(&per_cpu(active_asids, cpu), asid);
> +	atomic64_set(this_cpu_ptr(&active_asids), asid);
>  	raw_spin_unlock_irqrestore(&cpu_asid_lock, flags);

FWIW, this looks sound to me.

Mark.



More information about the linux-arm-kernel mailing list