[PATCHv2 2/4] nvme: extend show-topology command to add support for multipath

Nilay Shroff nilay at linux.ibm.com
Tue Aug 19 04:30:15 PDT 2025



On 8/19/25 4:35 PM, Hannes Reinecke wrote:
> On 8/19/25 12:31, Nilay Shroff wrote:
>>
>>
>> On 8/19/25 11:45 AM, Hannes Reinecke wrote:
>>> On 8/19/25 06:49, Nilay Shroff wrote:
>>>>
>>>>
>>>> On 8/18/25 12:52 PM, Hannes Reinecke wrote:
>>>>> On 8/12/25 14:56, Nilay Shroff wrote:
>>>>>> This commit enhances the show-topology command by adding support for
>>>>>> NVMe multipath. With this change, users can now list all paths to a
>>>>>> namespace from its corresponding head node device. Each NVMe path
>>>>>> entry then also includes additional details such as ANA state, NUMA
>>>>>> node, and queue depth, improving visibility into multipath configs.
>>>>>> This information can be particularly helpful for debugging and
>>>>>> analyzing NVMe multipath setups.
>>>>>>
>>>>>> To support this functionality, the "--ranking" option of the nvme
>>>>>> show-topology command has been extended with a new sub-option:
>>>>>> "multipath".
>>>>>>
>>>>>> Since this enhancement is specific to NVMe multipath, the iopolicy
>>>>>> configured under each subsystem is now always displayed. Previously,
>>>>>> iopolicy was shown only with nvme show-topology verbose output, but
>>>>>> it is now included by default to improve usability and provide better
>>>>>> context when reviewing multipath configurations via show-topology.
>>>>>>
>>>>>> With this update, users can view the multipath topology of a multi
>>>>>> controller/port NVMe disk using:
>>>>>>
>>>>>> $ nvme show-topology -r multipath
>>>>>>
>>>>>> nvme-subsys2 - NQN=nvmet_subsystem
>>>>>>                   hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
>>>>>>                   iopolicy=numa
>>>>>>
>>>>>>              _ _ _<head-node>
>>>>>>             /              _ _ _ <ana-state>
>>>>>>            /              /      _ _ _ <numa-node-list>
>>>>>>           /              /      /  _ _ _<queue-depth>
>>>>>>          |              /      /  /
>>>>>>     +- nvme2n1 (ns 1)  /      /  /
>>>>>>     \                 |      |  |
>>>>>>      +- nvme2c2n1 optimized 1,2 0 nvme2 tcp traddr=127.0.0.2,trsvcid=4460,src_addr=127.0.0.1 live
>>>>>>      +- nvme2c3n1 optimized 3,4 0 nvme3 tcp traddr=127.0.0.3,trsvcid=4460,src_addr=127.0.0.1 live
>>>>>>
>>>>>> Please note that the annotations shown above (e.g., <numa-node-list>,
>>>>>> <ana-state>, <hed-node>, and <queue-depth>) are included for clarity
>>>>>> only and are not part of the actual output.
>>>>>>
>>>>>
>>>>> Hmm. Why do we have the values for 'numa-node-list' and 'queue-depth'
>>>>> both in here? They are tied to the selected IO policy, and pretty
>>>>> meaningless if that IO policy is not selected.
>>>>> Please include only the values relevant for the selected IO policy;
>>>>> this will increase readability of the resulting status string.
>>>>>
>>>> Okay makes sense, so we'd print <numa-node> and exclude <queue-depth> if iopolicy
>>>> is numa. For 'queue-depth' iopolicy, we'd print <queue-depth> and exclude <numa-node>.
>>>> And for 'round-robin' iopolicy, we'd neither print <numa-node> nor <queue-depth>.
>>>> I'll update this in the next patch.
>>>>
>>> Hmm. I'd rather have _some_ value for 'round-robin', too, as otherwise
>>> the number of fields will be different (and making parsing harder).
>>>
>> Okay so then how about printing <numa-node> for round-robin policy as well?
>>
>> I looked at the NVMe path selection code for the round-robin iopolicy, and it
>> appears the kernel uses the NUMA node ID of the I/O submitting CPU as the
>> reference for path selection.
>> For example, on a system with two NUMA nodes (0 and 1) and two NVMe paths
>> (PA and PB):
>> - If an I/O from node 0 selects PA, that choice is cached.
>> - Next time when kernel receives the I/O from the node 0, it'd retrieve
>>    the cached path value and find the last path chosen was PA. So now it
>>    will choose next available path which is PB to forward this IO.
>>
>> This way kernel alternates between PA and PB in round-robin fashion.
>> So the selection is still tied to the submitting NUMA node, just with path
>> rotation layered on top. Given that, I think it makes sense to also print
>> <numa-node> for round-robin iopolicy, to keep field consistency and still
>> provide meaningful context. Agreed?
>>
> Can't we print the current path? That would be more meaningful than just printing the NUMA node ...
> 
The current path is already printed in the output. For instance, if we look at
the below output then it's apparent:

              _ _ _<head-node>
             /              _ _ _ <ana-state>
            /              /     _ _ _ <numa-node>
           /              /     /  
          |              /     /  
     +- nvme2n1 (ns 1)  /     /  
     \                 |     |  
      +- nvme2c2n1 optimized 1 nvme2 tcp traddr=127.0.0.2,trsvcid=4460,src_addr=127.0.0.1 live
      +- nvme2c3n1 optimized 2 nvme3 tcp traddr=127.0.0.3,trsvcid=4460,src_addr=127.0.0.1 live


In the above output, node1 represents the current path nvme2c2n1 and node2 represents 
the current path nvme2c3n1.

Thanks,
--Nilay



More information about the Linux-nvme mailing list