[PATCHv2 06/11] arm64: entry: move el1 irq/nmi logic to C

Thu May 6 20:25:31 PDT 2021

在 2021/5/6 18:58, Mark Rutland 写道:
> On Thu, May 06, 2021 at 06:25:40PM +0800, He Ying wrote:
>> 在 2021/5/6 17:16, Mark Rutland 写道:
>>> On Thu, May 06, 2021 at 04:28:09PM +0800, He Ying wrote:
>>>> Hi Mark,
>>> Hi,
>>>
>>>> I have faced a performance regression for handling IPIs since this commit.
>>>>
>>>> I caculate the cycles from the entry of el1_irq to the entry of
>>>> gic_handle_irq.
>>>>
>>>>   From my test, this commit may overhead an average of 200 cycles. Do you
>>>>
>>>> have any ideas about this? Looking forward to your reply.
>>> On that path, the only meaningfull difference is the call to
>>> enter_el1_irq_or_nmi(), since that's now unconditional, and it's an
>>> extra layer in the callchain.
>>>
>>> When either CONFIG_ARM64_PSEUDO_NMI or CONFIG_TRACE_IRQFLAGS are
>>> selected, enter_el1_irq_or_nmi() is a wrapper for functions we'd already
>>> call, and I'd expectthe cost of the callees to dominate.
>>>
>>> When neither CONFIG_ARM64_PSEUDO_NMI nor CONFIG_TRACE_IRQFLAGS are
>>> selected, this should add a trivial function that immediately returns,
>>> and so 200 cycles seems excessive.
>>>
>>> Building that commit with defconfig, I see that GCC 10.1.0 generates:
>>>
>>> | ffff800010dfc864 <enter_el1_irq_or_nmi>:
>>> | ffff800010dfc864:       d503233f        paciasp
>>> | ffff800010dfc868:       d50323bf        autiasp
>>> | ffff800010dfc86c:       d65f03c0        ret
>> CONFIG_ARM64_PSEUDO_NMI is not set in my test. And I generate a different
>> object
>>
>> from yours:
>>
>> 00000000000002b8 <enter_el1_irq_or_nmi>:
>>
>>   2b8:        d503233f         paciasp
>>   2bc:        a9bf7bfd          stp  x29, x30, [sp, #-16]!
>>   2c0:        91052000         add x0, x0, #0x148
>>   2c4:        910003fd          mov x29, sp
>>   2c8:        97ffff57             bl 24 <enter_from_kernel_mode.isra.6>
>>   2cc:        a8c17bfd           ldp x29, x30, [sp], #16
>>   2d0:       d50323bf           autiasp
>>   2d4:       d65f03c0            ret
> Which commit are you testing with?
>
> The call to enter_from_kernel_mode() was introduced later in commit:
>
>    7cd1ea1010acbede ("rm64: entry: fix non-NMI kernel<->kernel transitions")
>
> ... and doesn't exist in commit:
>
>    105fc3352077bba5 ("arm64: entry: move el1 irq/nmi logic to C")
>
> Do you see the 200 cycle penalty with 105fc3352077bba5 alone? ... or
> only only after the whole series is applied?
Sorry I didn't point it out. The truth is after the whole series is applied.
>
> If enter_from_kernel_mode() is what's taking the bulk of the cycles,
> then this is likely unavoidable work that previously (erroneously)
> omitted.
Unavoided work? No, please...
>
>>> ... so perhaps the PACIASP and AUTIASP have an impact?
>> I'm not sure...
>>> I have a few questions:
>>>
>>> * Which CPU do you see this on?
>> Hisilicon hip05-d02.
>>> * Does that CPU implement pointer authentication?
>> I'm not sure. How to check?
> Does the dmesg contain "Address authentication" anywhere?

I don't find "Address authentication" in dmesg. But I find 
CONFIG_ARM64_PTR_AUTH is set to y in our config.

Does the config CONFIG_ARM64_PTR_AUTH impact the performance?

>
>>> * What kernel config are you using? e.g. is this seen with defconfig?
>> Our own. But CONFIG_ARM64_PSEUDO_NMI is not set.
>>
>> Should I provide it as an attachment?
> >From your attachment I see that TRACE_IRQFLAGS and LOCKDEP aren't
> selected either, so IIUC the only non-trivial bits in
> enter_from_kernel_mode() will be the RCU accounting.

 From my other tests, the following code contriutes most to the overhead.

static void noinstr enter_from_kernel_mode(struct pt_regs *regs)

{

   regs->exit_rcu = false;

   ...

}

>
>>> * What's the total cycle count from el1_irq to gic_handle_irq?
>> Applying the patchset:   249 cycles.
>>
>> Reverting the patchset:  77 cycles.
>>
>> Maybe 170 cycles is more correct.
>>
>>> * Does this measurably impact a real workload?
>> Have some impact to scheduling perf test.
> Does it affect a real workload? i.e. not a microbenchmark?
We just run some benchmarks. I'm not sure how it affects a real workload.
>
> Thanks,
> Mark.
> .