[PATCHv2 06/11] arm64: entry: move el1 irq/nmi logic to C
Mark Rutland
mark.rutland at arm.com
Fri May 7 02:41:13 PDT 2021
On Fri, May 07, 2021 at 11:25:31AM +0800, He Ying wrote:
> 在 2021/5/6 18:58, Mark Rutland 写道:
> > On Thu, May 06, 2021 at 06:25:40PM +0800, He Ying wrote:
> > > 在 2021/5/6 17:16, Mark Rutland 写道:
> > > > On Thu, May 06, 2021 at 04:28:09PM +0800, He Ying wrote:
> > > > > Hi Mark,
> > > > Hi,
> > > >
> > > > > I have faced a performance regression for handling IPIs since this commit.
> > > > >
> > > > > I caculate the cycles from the entry of el1_irq to the entry of
> > > > > gic_handle_irq.
> > > > >
> > > > > From my test, this commit may overhead an average of 200 cycles. Do you
> > > > >
> > > > > have any ideas about this? Looking forward to your reply.
> > > > On that path, the only meaningfull difference is the call to
> > > > enter_el1_irq_or_nmi(), since that's now unconditional, and it's an
> > > > extra layer in the callchain.
> > > >
> > > > When either CONFIG_ARM64_PSEUDO_NMI or CONFIG_TRACE_IRQFLAGS are
> > > > selected, enter_el1_irq_or_nmi() is a wrapper for functions we'd already
> > > > call, and I'd expectthe cost of the callees to dominate.
> > > >
> > > > When neither CONFIG_ARM64_PSEUDO_NMI nor CONFIG_TRACE_IRQFLAGS are
> > > > selected, this should add a trivial function that immediately returns,
> > > > and so 200 cycles seems excessive.
> > > >
> > > > Building that commit with defconfig, I see that GCC 10.1.0 generates:
> > > >
> > > > | ffff800010dfc864 <enter_el1_irq_or_nmi>:
> > > > | ffff800010dfc864: d503233f paciasp
> > > > | ffff800010dfc868: d50323bf autiasp
> > > > | ffff800010dfc86c: d65f03c0 ret
> > > CONFIG_ARM64_PSEUDO_NMI is not set in my test. And I generate a different
> > > object
> > >
> > > from yours:
> > >
> > > 00000000000002b8 <enter_el1_irq_or_nmi>:
> > >
> > > 2b8: d503233f paciasp
> > > 2bc: a9bf7bfd stp x29, x30, [sp, #-16]!
> > > 2c0: 91052000 add x0, x0, #0x148
> > > 2c4: 910003fd mov x29, sp
> > > 2c8: 97ffff57 bl 24 <enter_from_kernel_mode.isra.6>
> > > 2cc: a8c17bfd ldp x29, x30, [sp], #16
> > > 2d0: d50323bf autiasp
> > > 2d4: d65f03c0 ret
> > Which commit are you testing with?
> >
> > The call to enter_from_kernel_mode() was introduced later in commit:
> >
> > 7cd1ea1010acbede ("rm64: entry: fix non-NMI kernel<->kernel transitions")
> >
> > ... and doesn't exist in commit:
> >
> > 105fc3352077bba5 ("arm64: entry: move el1 irq/nmi logic to C")
> >
> > Do you see the 200 cycle penalty with 105fc3352077bba5 alone? ... or
> > only only after the whole series is applied?
> Sorry I didn't point it out. The truth is after the whole series is applied.
Ok. In future it would be very helpful to be more precise, as otherwise
people can end up wasting time investigating with the wrong information.
What you initially said:
| I have faced a performance regression for handling IPIs since this
| commit.
... is somewhat misleading.
> > If enter_from_kernel_mode() is what's taking the bulk of the cycles,
> > then this is likely unavoidable work that previously (erroneously)
> > omitted.
> Unavoided work? No, please...
> >
> > > > ... so perhaps the PACIASP and AUTIASP have an impact?
> > > I'm not sure...
> > > > I have a few questions:
> > > >
> > > > * Which CPU do you see this on?
> > > Hisilicon hip05-d02.
> > > > * Does that CPU implement pointer authentication?
> > > I'm not sure. How to check?
> > Does the dmesg contain "Address authentication" anywhere?
>
> I don't find "Address authentication" in dmesg. But I find
> CONFIG_ARM64_PTR_AUTH is set to y in our config.
>
> Does the config CONFIG_ARM64_PTR_AUTH impact the performance?
If your HW implements pointer authentication, then there will be some
(small) impact. If your HW does not, then the cost should just be a few
NOPs, and is not expeected to be measureable.
> > > > * What kernel config are you using? e.g. is this seen with defconfig?
> > > Our own. But CONFIG_ARM64_PSEUDO_NMI is not set.
> > >
> > > Should I provide it as an attachment?
> > >From your attachment I see that TRACE_IRQFLAGS and LOCKDEP aren't
> > selected either, so IIUC the only non-trivial bits in
> > enter_from_kernel_mode() will be the RCU accounting.
>
> From my other tests, the following code contriutes most to the overhead.
>
>
> static void noinstr enter_from_kernel_mode(struct pt_regs *regs)
>
> {
>
> regs->exit_rcu = false;
>
> ...
>
> }
The logic manipulting regs->exit_rcu and calling rcu_irq_enter() is
necessary to correctly handle taking interrupts (or other exceptions)
from idle sequences. Without this, RCU isn't guaranteed to be watching,
and is unsafe to use.
So this isn't something that can be easily removed.
> > > > * What's the total cycle count from el1_irq to gic_handle_irq?
> > > Applying the patchset: 249 cycles.
> > >
> > > Reverting the patchset: 77 cycles.
> > >
> > > Maybe 170 cycles is more correct.
> > >
> > > > * Does this measurably impact a real workload?
> > > Have some impact to scheduling perf test.
> > Does it affect a real workload? i.e. not a microbenchmark?
> We just run some benchmarks. I'm not sure how it affects a real workload.
I appreciate that you can measure this with a microbenchmark, but unless
this affects a real workload in a measureable way I don't think that we
should make any changes here.
Thanks
Mark.
More information about the linux-arm-kernel
mailing list