[PATCH] arm64/mm: save memory access in check_and_switch_context() fast switch path

Pingfan Liu kernelfans at gmail.com
Mon Jul 6 21:50:58 EDT 2020


On Mon, Jul 6, 2020 at 4:10 PM Pingfan Liu <kernelfans at gmail.com> wrote:
>
> On Fri, Jul 3, 2020 at 6:13 PM Mark Rutland <mark.rutland at arm.com> wrote:
> >
> > On Fri, Jul 03, 2020 at 01:44:39PM +0800, Pingfan Liu wrote:
> > > The cpu_number and __per_cpu_offset cost two different cache lines, and may
> > > not exist after a heavy user space load.
> > >
> > > By replacing per_cpu(active_asids, cpu) with this_cpu_ptr(&active_asids) in
> > > fast path, register is used and these memory access are avoided.
> >
> > How about:
> >
> > | On arm64, smp_processor_id() reads a per-cpu `cpu_number` variable,
> > | using the per-cpu offset stored in the tpidr_el1 system register. In
> > | some cases we generate a per-cpu address with a sequence like:
> > |
> > | | cpu_ptr = &per_cpu(ptr, smp_processor_id());
> > |
> > | Which potentially incurs a cache miss for both `cpu_number` and the
> > | in-memory `__per_cpu_offset` array. This can be written more optimally
> > | as:
> > |
> > | | cpu_ptr = this_cpu_ptr(ptr);
> > |
> > | ... which only needs the offset from tpidr_el1, and does not need to
> > | load from memory.
> Appreciate for your clear document.
> >
> > > By replacing per_cpu(active_asids, cpu) with this_cpu_ptr(&active_asids) in
> > > fast path, register is used and these memory access are avoided.
> >
> > Do you have any numbers that show benefit here? It's not clear to me how
> > often the above case would apply where the cahes would also be hot for
> > everything else we need, and numbers would help to justify that.
> Initially, I was just abstracted by the macro __my_cpu_offset
> implement, and came to this question. But following your thinking, I
> realized data is needed to make things clear.
>
> I have finished a test with 5.8.0-rc4 kernel on a 46 cpus qualcomm machine.
> command: time -p make all -j138
>
> Before this patch:
> real 291.86
> user 11050.18
> sys 362.91
>
> After this patch
> real 291.11
> user 11055.62
> sys 363.39
>
> As the data, it shows a very small improvement.
The data may be affected by random factors, and less persuasive. And I
tried to do some repeated tests with perf-stat.
#cat b.sh
make clean && make all -j138

#perf stat --repeat 10 --null --sync sh b.sh

- before this patch
 Performance counter stats for 'sh b.sh' (10 runs):

            298.62 +- 1.86 seconds time elapsed  ( +-  0.62% )


- after this patch
 Performance counter stats for 'sh b.sh' (10 runs):

           297.734 +- 0.954 seconds time elapsed  ( +-  0.32% )


As the mean value  298.62 VS 297.734 shows, this trivial change does
bring a small improvement in performance.
>
> Thanks,
> Pingfan



More information about the linux-arm-kernel mailing list