NVMe and IRQ Affinity, another problem

Young Yu young.yu at northwestern.edu
Wed Apr 4 17:28:05 PDT 2018


Hello,

I know that this is another run on the old topic, but I'm still
wondering what is the right way to bind irq of NVMe-pci devices to the
cores in local NUMA node.  I'm using kernel 4.16.0-1.el7 on CentOS 7.4
and the machine have 2 numa nodes as in

$ lscpu|grep NUMA
NUMA node(s):          2
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23

I have 16 NVMe devices, 8 per NUMA node, nvme0 to 7 to the NUMA 0 and
8 to 15 to NUMA 1. irqbalance was on by default.  The irq of these
devices are all bound to the core 0 and 1 regardless of where they are
physically attached. affinity_hint looks still invalid, however there
is an effective_affinity that matches with some interrupt
bounded. cpu_list on mq was pointed to the wrong cores on the NVMe
devices on NUMA 1. I read it was fixed in kernel 4.3 so not sure
whether I’m looking at it in a right way.

Eventually I’d like to know if there is a way to distribute irq of
each nvme devices to different local cores in NUMA they are attached
to.
e.g. nvme0 - cpu 0
     nvme1 - cpu 2
     ...
     nvme8 - cpu 1
     nvme9 - cpu 3
     ...

Here are the output below.

$ cat /sys/block/nvme0n1/device/device/numa_node 
0

$ cat /sys/block/nvme8n1/device/device/numa_node 
1

$ cat /proc/interrupts |grep nvme0q
$ 143: 1777 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331648-edge nvme0q0, nvme0q1
 152: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331649-edge nvme0q2
 157: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331650-edge nvme0q3
 160: 0 12773 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331651-edge nvme0q4
 161: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331652-edge nvme0q5
 162: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331653-edge nvme0q6
 163: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331654-edge nvme0q7

$  cat /proc/interrupts |grep nvme8q
  51: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
  71827462-edge nvme8q7
  54: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
  71827457-edge nvme8q2
  65: 13931 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
  71827456-edge nvme8q0, nvme8q1
  76: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
  71827458-edge nvme8q3
  87: 0 13380 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  PCI-MSI 71827459-edge nvme8q4
  102: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
  71827460-edge nvme8q5
  117: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
  71827461-edge nvme8q6

$ for i in $(grep nvme0q /proc/interrupts | cut -d":" -f1 | sed "s/
//g"); do echo "IRQ: $i"; echo -n "HINT: " && cat
/proc/irq/$i/affinity_hint; echo -n "SMP: " && cat
/proc/irq/$i/smp_affinity && echo -n "EFF: " && cat
/proc/irq/$i/effective_affinity; done

IRQ:  143
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00055555,55555555,55555555
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  152
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000015,55555555,55555555,55500000,00000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  157
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  555555,55555555,55555540,00000000,00000000,00000000,00000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  160
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,2aaaaaaa,aaaaaaaa
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000002
IRQ:  161
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,0aaaaaaa,aaaaaaaa,80000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  162
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,02aaaaaa,aaaaaaaa,a0000000,00000000,00000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  163
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  aaaaaa,aaaaaaaa,a8000000,00000000,00000000,00000000,00000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001

$ for i in $(grep nvme8q /proc/interrupts | cut -d":" -f1 | sed "s/
//g"); do echo "IRQ: $i"; echo -n "HINT: " && cat
/proc/irq/$i/affinity_hint; echo -n "SMP: " && cat
/proc/irq/$i/smp_affinity && echo -n "EFF: " && cat
/proc/irq/$i/effective_affinity; done

IRQ:  51
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  aaaaaa,aaaaaaaa,a8000000,00000000,00000000,00000000,00000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  54
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000015,55555555,55555555,55500000,00000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  65
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00055555,55555555,55555555
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  76
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  555555,55555555,55555540,00000000,00000000,00000000,00000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  87
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,2aaaaaaa,aaaaaaaa
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000002
IRQ:  102
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,0aaaaaaa,aaaaaaaa,80000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  117
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,02aaaaaa,aaaaaaaa,a0000000,00000000,00000000,00000000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001

$ cat /sys/block/nvme8n1/mq/*/cpu_list 
0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36,
38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70,
72, 74, 76, 78, 80, 82
84, 86, 88, 90, 92, 94, 96, 98, 100, 102, 104, 106, 108, 110, 112,
114, 116, 118, 120, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140,
142, 144, 146, 148, 150, 152, 154, 156, 158, 160, 162, 164
166, 168, 170, 172, 174, 176, 178, 180, 182, 184, 186, 188, 190, 192,
194, 196, 198, 200, 202, 204, 206, 208, 210, 212, 214, 216, 218, 220,
222, 224, 226, 228, 230, 232, 234, 236, 238, 240, 242, 244, 246
1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37,
39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61
63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95,
97, 99, 101, 103, 105, 107, 109, 111, 113, 115, 117, 119, 121, 123
125, 127, 129, 131, 133, 135, 137, 139, 141, 143, 145, 147, 149, 151,
153, 155, 157, 159, 161, 163, 165, 167, 169, 171, 173, 175, 177, 179,
181, 183, 185
187, 189, 191, 193, 195, 197, 199, 201, 203, 205, 207, 209, 211, 213,
215, 217, 219, 221, 223, 225, 227, 229, 231, 233, 235, 237, 239, 241,
243, 245, 247

$ echo 000000,00000000,00000000,00000000,00000000,000aaaaa,aaaaaaaa,aaaaaaaa
> /proc/irq/143/smp_affinity
bash: echo: write error: Input/output error


After I was looking at this, I have built 4.13.16 kernel myself from
the source, and try to see if there is any difference to the one that
is from ELRepo. However, the hint was still invalid and interrupts are
bound to the core in different NUMA although they are more
distributed. I was not able to manually fix the smp_affinity in both
kernels.


$ cat /proc/interrupts |grep nvme0q
  62: 3687 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
  50331648-edge nvme0q0, nvme0q1
 123: 0 0 0 0 0 0 0 0 6642 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331649-edge nvme0q2
 129: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 335 0 0 0 0 0 0 0 PCI-MSI
 50331650-edge nvme0q3
 142: 0 6426 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331651-edge nvme0q4
 155: 0 0 0 0 0 0 0 4842 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331652-edge nvme0q5
 167: 0 0 0 0 0 0 0 0 0 0 0 0 0 2895 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 50331653-edge nvme0q6
 179: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3063 0 0 0 0 PCI-MSI
 50331654-edge nvme0q7

$ cat /proc/interrupts |grep nvme8q
 134: 21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 71827456-edge nvme8q0, nvme8q1
 147: 0 0 0 0 0 0 0 0 1102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 71827457-edge nvme8q2
 160: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 950 0 0 0 0 0 0 0 PCI-MSI
 71827458-edge nvme8q3
 172: 0 468 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 71827459-edge nvme8q4
 181: 0 0 0 0 0 0 0 889 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 71827460-edge nvme8q5
 187: 0 0 0 0 0 0 0 0 0 0 0 0 0 552 0 0 0 0 0 0 0 0 0 0 PCI-MSI
 71827461-edge nvme8q6
 191: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 470 0 0 0 0 PCI-MSI
 71827462-edge nvme8q7

$ for i in $(grep nvme0q /proc/interrupts | cut -d":" -f1 | sed "s/
//g"); do echo "IRQ: $i"; echo -n "HINT: " && cat
/proc/irq/$i/affinity_hint; echo -n "SMP: " && cat
/proc/irq/$i/smp_affinity && echo -n "EFF: " && cat
/proc/irq/$i/effective_affinity; done

IRQ:  62
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000055
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  123
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00005500
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000100
IRQ:  129
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00550000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00010000
IRQ:  142
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,0000002a
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000002
IRQ:  155
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000a80
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000080
IRQ:  167
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,0002a000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00002000
IRQ:  179
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00a80000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00080000

$ for i in $(grep nvme8q /proc/interrupts | cut -d":" -f1 | sed "s/
//g"); do echo "IRQ: $i"; echo -n "HINT: " && cat
/proc/irq/$i/affinity_hint; echo -n "SMP: " && cat
/proc/irq/$i/smp_affinity && echo -n "EFF: " && cat
/proc/irq/$i/effective_affinity; done

IRQ:  134
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000055
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001
IRQ:  147
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00005500
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000100
IRQ:  160
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00550000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00010000
IRQ:  172
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,0000002a
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000002
IRQ:  181
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000a80
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00000080
IRQ:  187
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,0002a000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00002000
IRQ:  191
HINT: 000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000
SMP:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00a80000
EFF:  000000,00000000,00000000,00000000,00000000,00000000,00000000,00080000

$ cat /sys/block/nvme8n1/mq/*/cpu_list 
0, 2, 4, 6
8, 10, 12, 14
16, 18, 20, 22
1, 3, 5
7, 9, 11
13, 15, 17
19, 21, 23

# echo  000000,00000000,00000000,00000000,00000000,00000000,00000000,000000aa
> /proc/irq/134/smp_affinity
bash: echo: write error: Input/output error


I have tried the manual config on one of the other machine we have,
but I still have the same problem except the kernel 4.4 where I can
manually set the smp_affinity.  With the same hardware setup, I cannot
get it to work on the kernel 4.11 and still get the same Input/output
error.

Se-young Yu
Northwestern University


More information about the Linux-nvme mailing list