[PATCHv2 2/4] nvme: extend show-topology command to add support for multipath
Hannes Reinecke
hare at suse.de
Tue Aug 19 04:05:16 PDT 2025
On 8/19/25 12:31, Nilay Shroff wrote:
>
>
> On 8/19/25 11:45 AM, Hannes Reinecke wrote:
>> On 8/19/25 06:49, Nilay Shroff wrote:
>>>
>>>
>>> On 8/18/25 12:52 PM, Hannes Reinecke wrote:
>>>> On 8/12/25 14:56, Nilay Shroff wrote:
>>>>> This commit enhances the show-topology command by adding support for
>>>>> NVMe multipath. With this change, users can now list all paths to a
>>>>> namespace from its corresponding head node device. Each NVMe path
>>>>> entry then also includes additional details such as ANA state, NUMA
>>>>> node, and queue depth, improving visibility into multipath configs.
>>>>> This information can be particularly helpful for debugging and
>>>>> analyzing NVMe multipath setups.
>>>>>
>>>>> To support this functionality, the "--ranking" option of the nvme
>>>>> show-topology command has been extended with a new sub-option:
>>>>> "multipath".
>>>>>
>>>>> Since this enhancement is specific to NVMe multipath, the iopolicy
>>>>> configured under each subsystem is now always displayed. Previously,
>>>>> iopolicy was shown only with nvme show-topology verbose output, but
>>>>> it is now included by default to improve usability and provide better
>>>>> context when reviewing multipath configurations via show-topology.
>>>>>
>>>>> With this update, users can view the multipath topology of a multi
>>>>> controller/port NVMe disk using:
>>>>>
>>>>> $ nvme show-topology -r multipath
>>>>>
>>>>> nvme-subsys2 - NQN=nvmet_subsystem
>>>>> hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
>>>>> iopolicy=numa
>>>>>
>>>>> _ _ _<head-node>
>>>>> / _ _ _ <ana-state>
>>>>> / / _ _ _ <numa-node-list>
>>>>> / / / _ _ _<queue-depth>
>>>>> | / / /
>>>>> +- nvme2n1 (ns 1) / / /
>>>>> \ | | |
>>>>> +- nvme2c2n1 optimized 1,2 0 nvme2 tcp traddr=127.0.0.2,trsvcid=4460,src_addr=127.0.0.1 live
>>>>> +- nvme2c3n1 optimized 3,4 0 nvme3 tcp traddr=127.0.0.3,trsvcid=4460,src_addr=127.0.0.1 live
>>>>>
>>>>> Please note that the annotations shown above (e.g., <numa-node-list>,
>>>>> <ana-state>, <hed-node>, and <queue-depth>) are included for clarity
>>>>> only and are not part of the actual output.
>>>>>
>>>>
>>>> Hmm. Why do we have the values for 'numa-node-list' and 'queue-depth'
>>>> both in here? They are tied to the selected IO policy, and pretty
>>>> meaningless if that IO policy is not selected.
>>>> Please include only the values relevant for the selected IO policy;
>>>> this will increase readability of the resulting status string.
>>>>
>>> Okay makes sense, so we'd print <numa-node> and exclude <queue-depth> if iopolicy
>>> is numa. For 'queue-depth' iopolicy, we'd print <queue-depth> and exclude <numa-node>.
>>> And for 'round-robin' iopolicy, we'd neither print <numa-node> nor <queue-depth>.
>>> I'll update this in the next patch.
>>>
>> Hmm. I'd rather have _some_ value for 'round-robin', too, as otherwise
>> the number of fields will be different (and making parsing harder).
>>
> Okay so then how about printing <numa-node> for round-robin policy as well?
>
> I looked at the NVMe path selection code for the round-robin iopolicy, and it
> appears the kernel uses the NUMA node ID of the I/O submitting CPU as the
> reference for path selection.
> For example, on a system with two NUMA nodes (0 and 1) and two NVMe paths
> (PA and PB):
> - If an I/O from node 0 selects PA, that choice is cached.
> - Next time when kernel receives the I/O from the node 0, it'd retrieve
> the cached path value and find the last path chosen was PA. So now it
> will choose next available path which is PB to forward this IO.
>
> This way kernel alternates between PA and PB in round-robin fashion.
> So the selection is still tied to the submitting NUMA node, just with path
> rotation layered on top. Given that, I think it makes sense to also print
> <numa-node> for round-robin iopolicy, to keep field consistency and still
> provide meaningful context. Agreed?
>
Can't we print the current path? That would be more meaningful than just
printing the NUMA node ...
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
More information about the Linux-nvme
mailing list