[PATCH RESEND] lib/group_cpus: make group CPU cluster aware

Tue Nov 11 04:08:39 PST 2025

On Tue, Nov 11, 2025 at 01:31:04PM +0800, Guo, Wangyang wrote:
> On 11/11/2025 11:25 AM, Ming Lei wrote:
> > On Tue, Nov 11, 2025 at 10:06:08AM +0800, Wangyang Guo wrote:
> > > As CPU core counts increase, the number of NVMe IRQs may be smaller than
> > > the total number of CPUs. This forces multiple CPUs to share the same
> > > IRQ. If the IRQ affinity and the CPU’s cluster do not align, a
> > > performance penalty can be observed on some platforms.
> > 
> > Can you add details why/how CPU cluster isn't aligned with IRQ
> > affinity? And how performance penalty is caused?
> 
> Intel Xeon E platform packs 4 CPU cores as 1 module (cluster) and share the
> L2 cache. Let's say, if there are 40 CPUs in 1 NUMA domain and 11 IRQs to
> dispatch. The existing algorithm will map first 7 IRQs each with 4 CPUs and
> remained 4 IRQs each with 3 CPUs each. The last 4 IRQs may have cross
> cluster issue. For example, the 9th IRQ which pinned to CPU32, then for
> CPU31, it will have cross L2 memory access.

CPUs sharing L2 usually have small number, and it is common to see one queue
mapping includes CPUs from different L2.

So how much does crossing L2 hurt IO perf?

They still should share same L3 cache, and cpus_share_cache() should be
true when the IO completes on the CPU which belong to different L2 with the
submission CPU, and remote completion via IPI won't be triggered.

>From my observation, remote completion does hurt NVMe IO perf very much,
for example, AMD's crossing L3 mapping.

Thanks,
Ming