armv8pmu: Pending overflow interrupt is discarded when perf event is disabled

Wed Nov 29 08:35:01 PST 2023

Hi William,

On Mon, Nov 20, 2023 at 10:32:10PM +0000, Ashley, William wrote:
> Adding linux-arm-kernel at lists.infradead.org and linux-kernel at vger.kernel.org,
> sorry for the noise.

Thanks for this!

For the benefit of others, the original mail (and attachment) can be found at:

  https://lore.kernel.org/linux-perf-users/950001BD-490C-4BAC-8EEA-CDB9F7C4ADFC@amazon.com/

For some reason, links and whitespace seem to have been mangled in the resend;
I'm not sure what happened there.

I've added Will Deacon, as Will and I co-maintain the ARM PMU drivers.

> On 11/20/23, 12:36 PM, "Ashley, William" <wash at amazon.com <mailto:wash at amazon.com>> wrote:
> 
> 
> An issue [1] was opened in the rr-debugger project reporting occasional missed
> perf event overflow signals on arm64. I've been digging into this and think I
> understand what's happening, but wanted to confirm my understanding.
> 
> The attached example application, derived from an rr-debugger test case, reports
> when the value of a counter doesn't increase by the expected period +/- some
> tolerance. When it is ping-ponged between cores (e.g. with taskset) at a high
> frequency, it frequently reports increases of ~2x the expected. I've confirmed
> this same behavior on kernels 5.4, 5.10, 6.1 and 6.5.
> 
> 
> I found armv8pmu_disable_intens [2] that is called as part of event
> de-scheduling and contains
> /* Clear the overflow flag in case an interrupt is pending. */
> write_pmovsclr(mask);
> which results in any pending overflow interrupt being dropped. I added some
> debug output here and indeed there is a correlation of this bit being high at
> the point of the above code and the reproducer identifying a missed signal.

I think you're right that if we had an overflow asserted at this point, we'll
throw away the occurrence of the overflow (and so not call
perf_event_overflow() and generate a sample, etc).

It looks like we only lose the occurrence of the overflow; the actual counts
will get sampled correctly and when we next reprogram the event,
armpmu_event_set_period() should set up the next overflow period.

> This behavior does not occur with pseudo-NMIs (irqchip.gicv3_pseudo_nmi=1)
> enabled.

That's interesting, because it implies that the PMU overflow interrupt is being
recognised by the CPU while regular interrupts are disabled. There are some
narrow races where that could occur (e.g. taking a timer or scheduler IRQ
*just* as an overflow occurs), and some other cases I'd expect RR to avoid by
construction (e.g. if RR isn't using mode exclusion and also counts kernel
events). It's also worth noting that this means there are races even with
pseudo-NMI where overflows could be lost.

How often do you see the overflow being lost?

Does RR set any of the perf_event_attr::exclude_* bits? If not, does RR
intentionally count events that occur within the kernel?

> When an event is not being explicitly torn down (e.g. being closed), this seems
> like an undesirable behavior.

I agree it's undesirable, though my understanding is that for most other users
this isn't any worse than losing samples for other reasons (e.g. the perf ring
buffer being full), and for the general case of sample-based profiling, losing
a sample every so often isn't the end of the world.

> I haven't attempted to demo it yet, but I suspect
> an application disabling an event temporarily could occasionally see the same
> missed overflow signals. Is my understanding here correct?

That sounds right to me, though I haven't checked that end-to-end yet.

> Does anyone have thoughts on how this could be addressed without creating
> other issues?

We should be able to detect overflow from the counter value alone, so we might
be able to account for that when we actually read the event out, or when we
schedule it back in and reprogram the period.

I'm not sure if we can reasonably do that when scheduling the event out, since
if we're switching tasks, that'll queue up IRQ work which will be triggered in
the context of the next task.

We might be able to figure out that we have an overflow when we schedule the
event in under armpmu_start(), but I'll need to go digging to see if there are
any free-running counters as the comment above the call to
armpmu_event_set_period() says, or whether that's a historical artifact.

I suspect we might need to restructure the code somewhat to be able to catch
overflows more consistently. I'll see if I can come up with something, but we
might not be able to guarantee this in all cases.

Mark.

> [1] https://github.com/rr-debugger/rr/issues/3607 <https://github.com/rr-debugger/rr/issues/3607>
> [2] https://github.com/torvalds/linux/blob/c42d9eeef8e5ba9292eda36fd8e3c11f35ee065c/drivers/perf/arm_pmuv3.c#L652C20-L652C43 <https://github.com/torvalds/linux/blob/c42d9eeef8e5ba9292eda36fd8e3c11f35ee065c/drivers/perf/arm_pmuv3.c#L652C20-L652C43>