[PATCH] nvme: find numa distance only if controller has valid numa id
Sagi Grimberg
sagi at grimberg.me
Mon Apr 15 01:55:44 PDT 2024
On 14/04/2024 14:02, Nilay Shroff wrote:
>
> On 4/14/24 14:00, Sagi Grimberg wrote:
>>
>> On 13/04/2024 12:04, Nilay Shroff wrote:
>>> On numa aware system where native nvme multipath is configured and
>>> iopolicy is set to numa but the nvme controller numa node id is
>>> undefined or -1 (NUMA_NO_NODE) then avoid calculating node distance
>>> for finding optimal io path. In such case we may access numa distance
>>> table with invalid index and that may potentially refer to incorrect
>>> memory. So this patch ensures that if the nvme controller numa node
>>> id is -1 then instead of calculating node distance for finding optimal
>>> io path, we set the numa node distance of such controller to default 10
>>> (LOCAL_DISTANCE).
>> Patch looks ok to me, but it is not clear weather this fixes a real issue or not.
>>
> I think this patch does help fix a real issue. I have a numa aware system where
> I have a multi port/controller NNVMe PCIe disk attached. On this system, I found
> that sometimes the nvme controller numa id is set to -1 (NUMA_NO_NODE). And the
> reason being, my system has processors and memory coming from one or more NUMA nodes
> and the NVMe PCIe device is coming from a NUMA node which is different. For example,
> we could have processors coming from node 0 and node 1, but the PCIe device coming from
> node 2, and we don't have any processor coming from node 2, so there would be no way for
> Linux to affinitize the PCIe device with a processor and hence while enumerating PCIe
> device kernel sets the numa id of such device to -1. Later if we hotplug CPU on node 2
> then kernel would assign the numa node id 2 to the PCIe device.
>
> For instance, I have a system with two numa nodes currently online. I also have
> a multi controller NVMe PCIe disk attached to this system:
>
> # numactl -H
> available: 2 nodes (2-3)
> node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> node 2 size: 15290 MB
> node 2 free: 14200 MB
> node 3 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
> node 3 size: 16336 MB
> node 3 free: 15075 MB
> node distances:
> node 2 3
> 2: 10 20
> 3: 20 10
>
> As we could see above on this system I have currently numa node 2 and 3 online.
> And I have CPUs coming from node 2 and 3.
>
> # lspci
> 052e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa
> 058e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa
>
> # nvme list -v
> Subsystem Subsystem-NQN Controllers
> ---------------- ------------------------------------------------------------------------------------------------ ----------------
> nvme-subsys3 nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057 nvme1, nvme3
>
> Device SN MN FR TxPort Asdress Slot Subsystem Namespaces
> -------- -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
> nvme1 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 052e:78:00.0 nvme-subsys3 nvme3n1
> nvme3 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 058e:78:00.0 nvme-subsys3 nvme3n1, nvme3n2
>
> Device Generic NSID Usage Format Controllers
> ------------ ------------ ---------- -------------------------- ---------------- ----------------
> /dev/nvme3n1 /dev/ng3n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme1, nvme3
> /dev/nvme3n2 /dev/ng3n2 0x2 5.75 GB / 5.75 GB 4 KiB + 0 B nvme3
>
> # cat ./sys/devices/pci058e:78/058e:78:00.0/numa_node
> 2
> # cat ./sys/devices/pci052e:78/052e:78:00.0/numa_node
> -1
>
> # cat /sys/class/nvme/nvme3/numa_node
> 2
> # cat /sys/class/nvme/nvme1/numa_node
> -1
>
> As we could see above I have multi controller NVMe disk atatched to this system. This disk
> has 2 controllers. However the numa node id assigned to one of the controller (nvme1) is -1.
> This is because on this system, currently I don't have any processor coming from a numa node
> where nvme1 controller numa node could be be affinitized.
Thanks for the explanation. But what is the bug you see in this
configuration? panic? suboptimal performance?
which is it? it is not clear from the patch description.
More information about the Linux-nvme
mailing list