[PATCH 3/3] perf/arm-ni: Support sharing IRQs within an NI instance

Mon May 19 07:57:31 PDT 2025

On 19/05/2025 1:29 pm, Will Deacon wrote:
> On Tue, May 13, 2025 at 04:39:00PM +0100, Robin Murphy wrote:
>> From: Shouping Wang <allen.wang at hj-micro.com>
>>
>> NI-700 has a distinct PMU interrupt output for each Clock Domain,
>> however some integrations may still combine these together externally.
>> The initial driver didn't attempt to support this, in anticipation of a
>> more general solution for IRQ sharing between system PMU instances, but
>> that's still a way off, so let's make this intermediate step for now to
>> at least allow sharing IRQs within an individual NI instance.
>>
>> Now that CPU affinity and migration are cleaned up, it's fairly
>> straightforward to adopt similar logic to arm-cmn, to identify CDs with
>> a common interrupt and loop over them directly in the handler.
>>
>> Signed-off-by: Shouping Wang <allen.wang at hj-micro.com>
>> [ rm: Rework for affinity handling, cosmetics, new commit message ]
>> Signed-off-by: Robin Murphy <robin.murphy at arm.com>
>> ---
>>   drivers/perf/arm-ni.c | 78 ++++++++++++++++++++++++++++---------------
>>   1 file changed, 51 insertions(+), 27 deletions(-)
>>
>> diff --git a/drivers/perf/arm-ni.c b/drivers/perf/arm-ni.c
>> index 168750e78fc4..ca033c8785ff 100644
>> --- a/drivers/perf/arm-ni.c
>> +++ b/drivers/perf/arm-ni.c
>> @@ -102,6 +102,7 @@ struct arm_ni_unit {
>>   struct arm_ni_cd {
>>   	void __iomem *pmu_base;
>>   	u16 id;
>> +	s8 irq_friend;
> 
> Why not just store the pointer to the cd as opposed to a relative offset?

Same reason as in arm-cmn - we know we're referencing another entry 
within the same array, so a full pointer costs 8 bytes of memory to be 
mostly redundant, while the offset fits in existing padding so costs 
nothing. In this case we could actually encode an absolute index just as 
easily as a relative offset since we have at most 32 CDs, but since you 
and Mark tend to prefer consistency across drivers where they do similar 
things, following the exact same pattern as arm-cmn feels like the 
clearest option (and in that existing case it is more significant, since 
arm-cmn's offset range is only +/-3, but somewhere in the middle of an 
array of many hundreds.)

>>   	int num_units;
>>   	int irq;
>>   	struct pmu pmu;
>> @@ -448,33 +449,37 @@ static irqreturn_t arm_ni_handle_irq(int irq, void *dev_id)
>>   {
>>   	struct arm_ni_cd *cd = dev_id;
>>   	irqreturn_t ret = IRQ_NONE;
>> -	u32 reg = readl_relaxed(cd->pmu_base + NI_PMOVSCLR);
>>   
>> -	if (reg & (1U << NI_CCNT_IDX)) {
>> -		ret = IRQ_HANDLED;
>> -		if (!(WARN_ON(!cd->ccnt))) {
>> -			arm_ni_event_read(cd->ccnt);
>> -			arm_ni_init_ccnt(cd);
>> +	for (;;) {
> 
> Could this be a do { ... } while (cd) loop instead?

I did try, but it still ends up looking just as clunky if not more so - 
I assume I must have gone through the same exercise 5 years ago for 
arm-cmn to have ended up this way :)

>> +		u32 reg = readl_relaxed(cd->pmu_base + NI_PMOVSCLR);
>> +
>> +		if (reg & (1U << NI_CCNT_IDX)) {
>> +			ret = IRQ_HANDLED;
>> +			if (!(WARN_ON(!cd->ccnt))) {
>> +				arm_ni_event_read(cd->ccnt);
>> +				arm_ni_init_ccnt(cd);
>> +			}
>>   		}
>> -	}
>> -	for (int i = 0; i < NI_NUM_COUNTERS; i++) {
>> -		if (!(reg & (1U << i)))
>> -			continue;
>> -		ret = IRQ_HANDLED;
>> -		if (!(WARN_ON(!cd->evcnt[i]))) {
>> -			arm_ni_event_read(cd->evcnt[i]);
>> -			arm_ni_init_evcnt(cd, i);
>> +		for (int i = 0; i < NI_NUM_COUNTERS; i++) {
>> +			if (!(reg & (1U << i)))
>> +				continue;
>> +			ret = IRQ_HANDLED;
>> +			if (!(WARN_ON(!cd->evcnt[i]))) {
>> +				arm_ni_event_read(cd->evcnt[i]);
>> +				arm_ni_init_evcnt(cd, i);
>> +			}
>>   		}
>> +		writel_relaxed(reg, cd->pmu_base + NI_PMOVSCLR);
>> +		if (!cd->irq_friend)
>> +			return ret;
>> +		cd += cd->irq_friend;
>>   	}
>> -	writel_relaxed(reg, cd->pmu_base + NI_PMOVSCLR);
>> -	return ret;
>>   }
>>   
>>   static int arm_ni_init_cd(struct arm_ni *ni, struct arm_ni_node *node, u64 res_start)
>>   {
>>   	struct arm_ni_cd *cd = ni->cds + node->id;
>>   	const char *name;
>> -	int err;
>>   
>>   	cd->id = node->id;
>>   	cd->num_units = node->num_components;
>> @@ -534,20 +539,11 @@ static int arm_ni_init_cd(struct arm_ni *ni, struct arm_ni_node *node, u64 res_s
>>   		       cd->pmu_base + NI_PMCR);
>>   	writel_relaxed(U32_MAX, cd->pmu_base + NI_PMCNTENCLR);
>>   	writel_relaxed(U32_MAX, cd->pmu_base + NI_PMOVSCLR);
>> -	writel_relaxed(U32_MAX, cd->pmu_base + NI_PMINTENSET);
>>   
>>   	cd->irq = platform_get_irq(to_platform_device(ni->dev), cd->id);
>>   	if (cd->irq < 0)
>>   		return cd->irq;
>>   
>> -	err = devm_request_irq(ni->dev, cd->irq, arm_ni_handle_irq,
>> -			       IRQF_NOBALANCING | IRQF_NO_THREAD,
>> -			       dev_name(ni->dev), cd);
>> -	if (err)
>> -		return err;
>> -
>> -	irq_set_affinity(cd->irq, cpumask_of(ni->cpu));
>> -
>>   	cd->pmu = (struct pmu) {
>>   		.module = THIS_MODULE,
>>   		.parent = ni->dev,
>> @@ -593,6 +589,30 @@ static void arm_ni_probe_domain(void __iomem *base, struct arm_ni_node *node)
>>   	node->num_components = readl_relaxed(base + NI_CHILD_NODE_INFO);
>>   }
>>   
>> +static int arm_ni_init_irqs(struct arm_ni *ni)
>> +{
>> +	int err;
>> +
>> +	ni_for_each_cd(ni, cd) {
>> +		for (struct arm_ni_cd *prev = cd; prev-- > ni->cds; ) {
>> +			if (prev->irq == cd->irq) {
>> +				prev->irq_friend = cd - prev;
> 
> Can't this race with the read of `irq_friend` in the interrupt handler?

Not in any way that matters. Any IRQ at this point would be a spurious 
one left latched at the interrupt controller after we've already reset 
all the PMUs. The handler can hardly observe a torn partial store of 1 
byte, so either it'll see a valid irq_friend or none. Either way we'll 
eventually return IRQ_NONE, EOI the phantom IRQ and be done.

I'm not entertaining the idea of somehow being preempted here for long 
enough for userspace to notice the already-registered PMUs (even though 
the module load hasn't yet finished...), open an event and have it count 
enough to genuinely overflow, because that would be in the order of at 
least tens of seconds if not minutes.

> Similarly, how do you handle the race between the interrupt firing on
> one CD and the friend CD being torn down on e.g. the remove path?

Again, we've already disabled *all* the PMUs (and explicitly cleared 
their PMINTENs to leave no doubt) in arm_ni_remove(). After that, devres 
ordering will then free the IRQs themselves before the CDs ever get freed.

>> +				goto enable_irq;
>> +			}
>> +		}
>> +		err = devm_request_irq(ni->dev, cd->irq, arm_ni_handle_irq,
>> +				       IRQF_NOBALANCING | IRQF_NO_THREAD,
>> +				       dev_name(ni->dev), cd);
> 
> What's the reason not to use IRQF_SHARED and register multiple handlers?
> I'm sure there is one, but capturing that in a comment would help to
> justify the manual multiplexing.

The usual perf reason - IRQF_SHARED allows any *other* random driver to 
also request the same IRQ and then mess with its affinity behind our back.

Thanks,
Robin.