[PATCH] nvme: find numa distance only if controller has valid numa id

Mon Apr 15 02:30:32 PDT 2024

On 4/15/24 14:25, Sagi Grimberg wrote:
> 
> 
> On 14/04/2024 14:02, Nilay Shroff wrote:
>>
>> On 4/14/24 14:00, Sagi Grimberg wrote:
>>>
>>> On 13/04/2024 12:04, Nilay Shroff wrote:
>>>> On numa aware system where native nvme multipath is configured and
>>>> iopolicy is set to numa but the nvme controller numa node id is
>>>> undefined or -1 (NUMA_NO_NODE) then avoid calculating node distance
>>>> for finding optimal io path. In such case we may access numa distance
>>>> table with invalid index and that may potentially refer to incorrect
>>>> memory. So this patch ensures that if the nvme controller numa node
>>>> id is -1 then instead of calculating node distance for finding optimal
>>>> io path, we set the numa node distance of such controller to default 10
>>>> (LOCAL_DISTANCE).
>>> Patch looks ok to me, but it is not clear weather this fixes a real issue or not.
>>>
>> I think this patch does help fix a real issue. I have a numa aware system where
>> I have a multi port/controller NNVMe PCIe disk attached. On this system, I found
>> that sometimes the nvme controller numa id is set to -1 (NUMA_NO_NODE). And the
>> reason being, my system has processors and memory coming from one or more NUMA nodes
>> and the NVMe PCIe device is coming from a NUMA node which is different. For example,
>> we could have processors coming from node 0 and node 1, but the PCIe device coming from
>> node 2, and we don't have any processor coming from node 2, so there would be no way for
>> Linux to affinitize the PCIe device with a processor and hence while enumerating PCIe
>> device kernel sets the numa id of such device to -1. Later if we hotplug CPU on node 2
>> then kernel would assign the numa node id 2 to the PCIe device.
>>
>> For instance, I have a system with two numa nodes currently online. I also have
>> a multi controller NVMe PCIe disk attached to this system:
>>
>> # numactl -H
>> available: 2 nodes (2-3)
>> node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
>> node 2 size: 15290 MB
>> node 2 free: 14200 MB
>> node 3 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
>> node 3 size: 16336 MB
>> node 3 free: 15075 MB
>> node distances:
>> node   2   3
>>    2:  10  20
>>    3:  20  10
>>
>> As we could see above on this system I have currently numa node 2 and 3 online.
>> And I have CPUs coming from node 2 and 3.
>>
>> # lspci
>> 052e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa
>> 058e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa
>>
>> # nvme list -v
>> Subsystem        Subsystem-NQN                                                                                    Controllers
>> ---------------- ------------------------------------------------------------------------------------------------ ----------------
>> nvme-subsys3     nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057                                     nvme1, nvme3
>>
>> Device   SN                   MN                                       FR       TxPort Asdress        Slot   Subsystem    Namespaces
>> -------- -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
>> nvme1    S6RTNE0R900057       3.2TB NVMe Gen4 U.2 SSD III              REV.SN66 pcie   052e:78:00.0          nvme-subsys3 nvme3n1
>> nvme3    S6RTNE0R900057       3.2TB NVMe Gen4 U.2 SSD III              REV.SN66 pcie   058e:78:00.0          nvme-subsys3 nvme3n1, nvme3n2
>>
>> Device       Generic      NSID       Usage                      Format           Controllers
>> ------------ ------------ ---------- -------------------------- ---------------- ----------------
>> /dev/nvme3n1 /dev/ng3n1   0x1          5.75  GB /   5.75  GB      4 KiB +  0 B   nvme1, nvme3
>> /dev/nvme3n2 /dev/ng3n2   0x2          5.75  GB /   5.75  GB      4 KiB +  0 B   nvme3
>>
>> # cat ./sys/devices/pci058e:78/058e:78:00.0/numa_node
>> 2
>> # cat ./sys/devices/pci052e:78/052e:78:00.0/numa_node
>> -1
>>
>> # cat /sys/class/nvme/nvme3/numa_node
>> 2
>> # cat /sys/class/nvme/nvme1/numa_node
>> -1
>>
>> As we could see above I have multi controller NVMe disk atatched to this system. This disk
>> has 2 controllers. However the numa node id assigned to one of the controller (nvme1) is -1.
>> This is because on this system, currently I don't have any processor coming from a numa node
>> where nvme1 controller numa node could be be affinitized.
> 
> Thanks for the explanation. But what is the bug you see in this configuration? panic? suboptimal performance?
> which is it? it is not clear from the patch description.
> 
I didn't encounter panic, however the issue here is with accessing numa distance table with incorrect index. 

For calculating the distance between two nodes we invoke the function __node_distance(). This function would 
then access the numa distance table, which is typically an array with valid index starting from 0. So obviously 
accessing this table with index of -1 would deference incorrect memory location. De-referencing incorrect memory 
location might have side effects including panic (though I didn't encounter panic). Furthermore in such a case, 
the calculated node distance could potentially be incorrect and that might cause the nvme multipath to choose 
a suboptimal IO path. 

This patch may not help choosing the optimal IO path (as we assume that the node distance would be 
LOCAL_DISTANCE in case nvme controller numa node id is -1) but it ensures that we don't access the invalid memory 
location for calculating node distance.

Thanks,
--Nilay