[PATCHv2 06/11] arm64: entry: move el1 irq/nmi logic to C
He Ying
heying24 at huawei.com
Fri May 7 03:02:13 PDT 2021
在 2021/5/7 17:41, Mark Rutland 写道:
> On Fri, May 07, 2021 at 11:25:31AM +0800, He Ying wrote:
>> 在 2021/5/6 18:58, Mark Rutland 写道:
>>> On Thu, May 06, 2021 at 06:25:40PM +0800, He Ying wrote:
>>>> 在 2021/5/6 17:16, Mark Rutland 写道:
>>>>> On Thu, May 06, 2021 at 04:28:09PM +0800, He Ying wrote:
>>>>>> Hi Mark,
>>>>> Hi,
>>>>>
>>>>>> I have faced a performance regression for handling IPIs since this commit.
>>>>>>
>>>>>> I caculate the cycles from the entry of el1_irq to the entry of
>>>>>> gic_handle_irq.
>>>>>>
>>>>>> From my test, this commit may overhead an average of 200 cycles. Do you
>>>>>>
>>>>>> have any ideas about this? Looking forward to your reply.
>>>>> On that path, the only meaningfull difference is the call to
>>>>> enter_el1_irq_or_nmi(), since that's now unconditional, and it's an
>>>>> extra layer in the callchain.
>>>>>
>>>>> When either CONFIG_ARM64_PSEUDO_NMI or CONFIG_TRACE_IRQFLAGS are
>>>>> selected, enter_el1_irq_or_nmi() is a wrapper for functions we'd already
>>>>> call, and I'd expectthe cost of the callees to dominate.
>>>>>
>>>>> When neither CONFIG_ARM64_PSEUDO_NMI nor CONFIG_TRACE_IRQFLAGS are
>>>>> selected, this should add a trivial function that immediately returns,
>>>>> and so 200 cycles seems excessive.
>>>>>
>>>>> Building that commit with defconfig, I see that GCC 10.1.0 generates:
>>>>>
>>>>> | ffff800010dfc864 <enter_el1_irq_or_nmi>:
>>>>> | ffff800010dfc864: d503233f paciasp
>>>>> | ffff800010dfc868: d50323bf autiasp
>>>>> | ffff800010dfc86c: d65f03c0 ret
>>>> CONFIG_ARM64_PSEUDO_NMI is not set in my test. And I generate a different
>>>> object
>>>>
>>>> from yours:
>>>>
>>>> 00000000000002b8 <enter_el1_irq_or_nmi>:
>>>>
>>>> 2b8: d503233f paciasp
>>>> 2bc: a9bf7bfd stp x29, x30, [sp, #-16]!
>>>> 2c0: 91052000 add x0, x0, #0x148
>>>> 2c4: 910003fd mov x29, sp
>>>> 2c8: 97ffff57 bl 24 <enter_from_kernel_mode.isra.6>
>>>> 2cc: a8c17bfd ldp x29, x30, [sp], #16
>>>> 2d0: d50323bf autiasp
>>>> 2d4: d65f03c0 ret
>>> Which commit are you testing with?
>>>
>>> The call to enter_from_kernel_mode() was introduced later in commit:
>>>
>>> 7cd1ea1010acbede ("rm64: entry: fix non-NMI kernel<->kernel transitions")
>>>
>>> ... and doesn't exist in commit:
>>>
>>> 105fc3352077bba5 ("arm64: entry: move el1 irq/nmi logic to C")
>>>
>>> Do you see the 200 cycle penalty with 105fc3352077bba5 alone? ... or
>>> only only after the whole series is applied?
>> Sorry I didn't point it out. The truth is after the whole series is applied.
> Ok. In future it would be very helpful to be more precise, as otherwise
> people can end up wasting time investigating with the wrong information.
>
> What you initially said:
>
> | I have faced a performance regression for handling IPIs since this
> | commit.
>
> ... is somewhat misleading.
Sorry about that. I'll be more careful about that in future.
>
>>> If enter_from_kernel_mode() is what's taking the bulk of the cycles,
>>> then this is likely unavoidable work that previously (erroneously)
>>> omitted.
>> Unavoided work? No, please...
>>>>> ... so perhaps the PACIASP and AUTIASP have an impact?
>>>> I'm not sure...
>>>>> I have a few questions:
>>>>>
>>>>> * Which CPU do you see this on?
>>>> Hisilicon hip05-d02.
>>>>> * Does that CPU implement pointer authentication?
>>>> I'm not sure. How to check?
>>> Does the dmesg contain "Address authentication" anywhere?
>> I don't find "Address authentication" in dmesg. But I find
>> CONFIG_ARM64_PTR_AUTH is set to y in our config.
>>
>> Does the config CONFIG_ARM64_PTR_AUTH impact the performance?
> If your HW implements pointer authentication, then there will be some
> (small) impact. If your HW does not, then the cost should just be a few
> NOPs, and is not expeected to be measureable.
OK.
>
>>>>> * What kernel config are you using? e.g. is this seen with defconfig?
>>>> Our own. But CONFIG_ARM64_PSEUDO_NMI is not set.
>>>>
>>>> Should I provide it as an attachment?
>>> >From your attachment I see that TRACE_IRQFLAGS and LOCKDEP aren't
>>> selected either, so IIUC the only non-trivial bits in
>>> enter_from_kernel_mode() will be the RCU accounting.
>> From my other tests, the following code contriutes most to the overhead.
>>
>>
>> static void noinstr enter_from_kernel_mode(struct pt_regs *regs)
>>
>> {
>>
>> regs->exit_rcu = false;
>>
>> ...
>>
>> }
> The logic manipulting regs->exit_rcu and calling rcu_irq_enter() is
> necessary to correctly handle taking interrupts (or other exceptions)
> from idle sequences. Without this, RCU isn't guaranteed to be watching,
> and is unsafe to use.
>
> So this isn't something that can be easily removed.
OK.
>
>>>>> * What's the total cycle count from el1_irq to gic_handle_irq?
>>>> Applying the patchset: 249 cycles.
>>>>
>>>> Reverting the patchset: 77 cycles.
>>>>
>>>> Maybe 170 cycles is more correct.
>>>>
>>>>> * Does this measurably impact a real workload?
>>>> Have some impact to scheduling perf test.
>>> Does it affect a real workload? i.e. not a microbenchmark?
>> We just run some benchmarks. I'm not sure how it affects a real workload.
> I appreciate that you can measure this with a microbenchmark, but unless
> this affects a real workload in a measureable way I don't think that we
> should make any changes here.
I see. If I find that it affects a real workload in a measureable way,
I'll contact
you again. Thanks a lot for all your reply.
Thanks.
>
> Thanks
> Mark.
> .
More information about the linux-arm-kernel
mailing list