[PATCH RESEND] lib/group_cpus: make group CPU cluster aware

Wed Nov 12 17:38:56 PST 2025

On Wed, Nov 12, 2025 at 11:02:47AM +0800, Guo, Wangyang wrote:
> On 11/11/2025 8:08 PM, Ming Lei wrote:
> > On Tue, Nov 11, 2025 at 01:31:04PM +0800, Guo, Wangyang wrote:
> > > On 11/11/2025 11:25 AM, Ming Lei wrote:
> > > > On Tue, Nov 11, 2025 at 10:06:08AM +0800, Wangyang Guo wrote:
> > > > > As CPU core counts increase, the number of NVMe IRQs may be smaller than
> > > > > the total number of CPUs. This forces multiple CPUs to share the same
> > > > > IRQ. If the IRQ affinity and the CPU’s cluster do not align, a
> > > > > performance penalty can be observed on some platforms.
> > > > 
> > > > Can you add details why/how CPU cluster isn't aligned with IRQ
> > > > affinity? And how performance penalty is caused?
> > > 
> > > Intel Xeon E platform packs 4 CPU cores as 1 module (cluster) and share the
> > > L2 cache. Let's say, if there are 40 CPUs in 1 NUMA domain and 11 IRQs to
> > > dispatch. The existing algorithm will map first 7 IRQs each with 4 CPUs and
> > > remained 4 IRQs each with 3 CPUs each. The last 4 IRQs may have cross
> > > cluster issue. For example, the 9th IRQ which pinned to CPU32, then for
> > > CPU31, it will have cross L2 memory access.
> > 
> > 
> > CPUs sharing L2 usually have small number, and it is common to see one queue
> > mapping includes CPUs from different L2.
> > 
> > So how much does crossing L2 hurt IO perf?
> We see 15%+ performance difference in FIO libaio/randread/bs=8k.

As I mentioned, it is common to see CPUs crossing L2 in same group, but why
does it make a difference here? You mentioned just some platforms are
affected.

> > They still should share same L3 cache, and cpus_share_cache() should be
> > true when the IO completes on the CPU which belong to different L2 with the
> > submission CPU, and remote completion via IPI won't be triggered.
> Yes, remote IPI not triggered.

OK, in my test on AMD zen4, NVMe performance can be dropped to 1/2 - 1/3 if
remote IPI is triggered in case of crossing L3, which is understandable.

I will check if topo cluster can cover L3, if yes, the patch still can be
simplified a lot by introducing sub-node spread by changing build_node_to_cpumask()
and adding nr_sub_nodes.

Thanks,
Ming