[PATCHv2 2/4] nvme: extend show-topology command to add support for multipath

Hannes Reinecke hare at suse.de
Tue Aug 19 04:05:16 PDT 2025


On 8/19/25 12:31, Nilay Shroff wrote:
> 
> 
> On 8/19/25 11:45 AM, Hannes Reinecke wrote:
>> On 8/19/25 06:49, Nilay Shroff wrote:
>>>
>>>
>>> On 8/18/25 12:52 PM, Hannes Reinecke wrote:
>>>> On 8/12/25 14:56, Nilay Shroff wrote:
>>>>> This commit enhances the show-topology command by adding support for
>>>>> NVMe multipath. With this change, users can now list all paths to a
>>>>> namespace from its corresponding head node device. Each NVMe path
>>>>> entry then also includes additional details such as ANA state, NUMA
>>>>> node, and queue depth, improving visibility into multipath configs.
>>>>> This information can be particularly helpful for debugging and
>>>>> analyzing NVMe multipath setups.
>>>>>
>>>>> To support this functionality, the "--ranking" option of the nvme
>>>>> show-topology command has been extended with a new sub-option:
>>>>> "multipath".
>>>>>
>>>>> Since this enhancement is specific to NVMe multipath, the iopolicy
>>>>> configured under each subsystem is now always displayed. Previously,
>>>>> iopolicy was shown only with nvme show-topology verbose output, but
>>>>> it is now included by default to improve usability and provide better
>>>>> context when reviewing multipath configurations via show-topology.
>>>>>
>>>>> With this update, users can view the multipath topology of a multi
>>>>> controller/port NVMe disk using:
>>>>>
>>>>> $ nvme show-topology -r multipath
>>>>>
>>>>> nvme-subsys2 - NQN=nvmet_subsystem
>>>>>                   hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
>>>>>                   iopolicy=numa
>>>>>
>>>>>              _ _ _<head-node>
>>>>>             /              _ _ _ <ana-state>
>>>>>            /              /      _ _ _ <numa-node-list>
>>>>>           /              /      /  _ _ _<queue-depth>
>>>>>          |              /      /  /
>>>>>     +- nvme2n1 (ns 1)  /      /  /
>>>>>     \                 |      |  |
>>>>>      +- nvme2c2n1 optimized 1,2 0 nvme2 tcp traddr=127.0.0.2,trsvcid=4460,src_addr=127.0.0.1 live
>>>>>      +- nvme2c3n1 optimized 3,4 0 nvme3 tcp traddr=127.0.0.3,trsvcid=4460,src_addr=127.0.0.1 live
>>>>>
>>>>> Please note that the annotations shown above (e.g., <numa-node-list>,
>>>>> <ana-state>, <hed-node>, and <queue-depth>) are included for clarity
>>>>> only and are not part of the actual output.
>>>>>
>>>>
>>>> Hmm. Why do we have the values for 'numa-node-list' and 'queue-depth'
>>>> both in here? They are tied to the selected IO policy, and pretty
>>>> meaningless if that IO policy is not selected.
>>>> Please include only the values relevant for the selected IO policy;
>>>> this will increase readability of the resulting status string.
>>>>
>>> Okay makes sense, so we'd print <numa-node> and exclude <queue-depth> if iopolicy
>>> is numa. For 'queue-depth' iopolicy, we'd print <queue-depth> and exclude <numa-node>.
>>> And for 'round-robin' iopolicy, we'd neither print <numa-node> nor <queue-depth>.
>>> I'll update this in the next patch.
>>>
>> Hmm. I'd rather have _some_ value for 'round-robin', too, as otherwise
>> the number of fields will be different (and making parsing harder).
>>
> Okay so then how about printing <numa-node> for round-robin policy as well?
> 
> I looked at the NVMe path selection code for the round-robin iopolicy, and it
> appears the kernel uses the NUMA node ID of the I/O submitting CPU as the
> reference for path selection.
> For example, on a system with two NUMA nodes (0 and 1) and two NVMe paths
> (PA and PB):
> - If an I/O from node 0 selects PA, that choice is cached.
> - Next time when kernel receives the I/O from the node 0, it'd retrieve
>    the cached path value and find the last path chosen was PA. So now it
>    will choose next available path which is PB to forward this IO.
> 
> This way kernel alternates between PA and PB in round-robin fashion.
> So the selection is still tied to the submitting NUMA node, just with path
> rotation layered on top. Given that, I think it makes sense to also print
> <numa-node> for round-robin iopolicy, to keep field consistency and still
> provide meaningful context. Agreed?
> 
Can't we print the current path? That would be more meaningful than just 
printing the NUMA node ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



More information about the Linux-nvme mailing list