[RFC PATCH] irq: allow percpu_devid interrupts to be requested with target mask

Fri Nov 27 11:15:52 PST 2015

Will,

On Fri, 27 Nov 2015, Will Deacon wrote:

> On multi-cluster platforms, a per-cpu interrupt may actually be wired on
> a per-cluster basis, meaning that different devices may have the same
> interrupt number depending on the CPU.
> 
> This is problematic for drivers using the percpu_device interface as
> there is currently no way to request an interrupt of this type for a
> subset of CPUs. Furthermore, interrupt sharing is not permitted.
> 
> This patch adds the ability to provide a CPU mask to the percpu
> interrupt request/free functions such that the interrupt is only enabled
> on those CPUs set in the mask. Each irqaction has a copy of the mask to
> allow the percpu_devid flow handler to dispatch to the correct driver
> when an interrupt occurs. Whilst interrupt sharing is still forbidden
> in the usual sense, multiple actions are permitted providing that their
> target CPU masks do not intersect.

I'm not too excited about this approach. That looks pretty much bolted
on the existing percpu_irq machinery and you add the overhead of the
action loop plus the mask check to each per cpu interrupt. Not really
efficient.

I assume that 

> meaning that different devices may have the same interrupt number
> depending on the CPU.

means: They have the same HW interrupt number and therefor the same
linux irq number due to the way percpu_irq works today.

So the scenario is:

|----------------------|   |--------------------|
|Cluster 0             |   |Cluster1            |
|----------------------|   |--------------------|
|PerCPU irqs 	       |   |PerCPU irqs         |
|----------------------|   |--------------------|
|PerClusterPerCPU irqs |   |PerCluster irqs     |
|----------------------|   |--------------------|
|Global shared irqs    |   |Global shared irqs  |
|----------------------|   |--------------------|

1) PerCPU irqs:

   Same HWirq number on all CPUs. Interrupt is raised locally on the
   CPU from a CPU local facility (local timer, IPI receive, perf ...)

   Covered by the existing percpu interface.

2) Shared irqs:

   Unique HWirq number for a single device interrupt. Used by globally
   available devices. irq can be routed to any CPU.

3) PerCluster irqs:

   Similar to PerCPU irqs, but the same HWirq number is only shared on
   a per cluster basis.

   So we might end up with hwirqX on cluster 0 wired to the local
   timer while it is wired to perf on cluster 1 and vice versa.

   Now you want to use the same Linux irq number for the local timer
   or whatever on all CPUs independent of the cluster, right?

   So the per cpu timer interrupt is requested once as a per cpu
   interrupt with the linux irq number of cluster 0. But when you call
   enable_per_cpu_irq() on a cpu from cluster 1 you want to use the
   same linux irq number. But with the current scheme this ends up at
   the wrong hw irq.

   That's why you introduced that cpumask stuff.

We can do better than that. Hierarchical irq domains to the rescue.

|------------|
| GIC domain |<--- Global PerCPU irqs
|	     |
|	     |<--- Global shared irqs
|	     |
|	     |<--- [ITS domain] <--- [MSI domain]
|	     |
|	     |<--- [PerCluster PerCPU domain]
|------------|

Now that PerCluster PerCPU domain does the following:

When the irq is allocated and mapped it requests N consective Linux
irq numbers, where N is the number of clusters. The return value is
the first irq number, i.e. the one which maps to cluster 0.

The allocater/mapping sets up irq + 0 for cluster 0, irq + 1 for
cluster 1,.... Each irq has a different irq to hwirq mapping.

The domain also needs a cluster aware mapping mechanism. See further
below.

All handling functions setup/request/free/remove_percpu_irq()
now do:

    struct irq_desc *desc = irq_to_desc(irq);
    struct irqdomain *domain = irq_desc_get_domain(desc);
    int i;

    __setup_irq(desc, handler);

    if (!domain->is_clustered)
       return;

    for (i = 1; i < domain->nr_clusters: i++) {
    	desc = irq_to_desc(irq + i);
	desc->handler = handler;
    }

    This might also require some special handling in your domains
    activate() callback, which is invoked from __setup_irq() but that's
    a detail.

enable/disable_percpu_irq() does:

    struct irq_desc *desc = irq_to_desc(irq);
    struct irqdomain *domain = irq_desc_get_domain(desc);

    if (domain && domain->nr_clusters > 1)
	  desc = irq_to_desc(irq + domain->get_cluster(smp_processor_id());

    enable/disable(desc);

So lets look at an example

setup_local_timer()
  irq = map_irq();
      	 Allocates irq10 and irq11, returns irq 10
	 Maps irq10 to hwirq 20 and irq11 to hwirq 21
  ...

  request_percpu_irq(10, timer_handler)
  	irq10->handler = timer_handler;
  	irq11->handler = timer_handler;

  enable_percpu_irq(10) <- On cluster 0
  	enable_irq(irq10) -> hwirq20

  enable_percpu_irq(10) <- On cluster 1
  	enable_irq(irq11) -> hwirq21

setup_perf()
  irq = map_irq();
      	 Allocates irq12 and irq13, returns irq 12
	 Maps irq12 to hwirq 21 and irq13 to hwirq 20
  ...

  request_percpu_irq(12, perf_handler)
  	irq12->handler = perf_handler;
  	irq12->handler = perf_handler;

  enable_percpu_irq(12) <- On cluster 0
  	enable_irq(irq12) -> hwirq21

  enable_percpu_irq(12) <- On cluster 1
  	enable_irq(irq13) -> hwirq20

We can do that, because all operations have to be invoked on the
target cpu.

So far so good. Now we have the interrupt entry code which wants to
translate hwirqX to a linux irq number. So we need some modification
to irq_find_mapping() as well. We could do a domain specific
find_mapping() callback, or add something to the core code.

	if (domain->is_clustered_mapped) {
	      map = domain->map[domain->get_cluster(cpuid)];
	      return map[hwirq];
	}			  

works for a linear domain. A domain specific find_mapping() callback
might be better suited, but thats a detail.

That might be nice counterpart to the work Qais is doing:

   http://marc.info/?l=linux-kernel&m=144845325611158&w=2

The way MIPS sets up IPIs and per cpu irqs is not clustered as in your
case. It needs one hwirq per CPU, but we still want to operate from a
single base irq. It's a bit different, as it does not need that extra
reverse mapping function because he operates on a globally unique
hwirq number, but still needs to setup N irq descriptors/handlers per
PERCPU irq as in your cluster scenario. So we might be able to share
that part and just handle the IPI allocation/send mechanism in Qais
new code.

Thoughts?

Thanks,

	tglx