[PATCH 03/13] libmultipath: Add path selection support

Nilay Shroff nilay at linux.ibm.com
Tue Mar 3 03:01:00 PST 2026


On 3/2/26 8:41 PM, John Garry wrote:
> On 02/03/2026 12:36, Nilay Shroff wrote:
>> On 2/25/26 9:02 PM, John Garry wrote:
>>> Add code for path selection.
>>>
>>> NVMe ANA is abstracted into enum mpath_access_state. The motivation 
>>> here is
>>> so that SCSI ALUA can be used. Callbacks .is_disabled, .is_optimized,
>>> .get_access_state are added to get the path access state.
>>>
>>> Path selection modes round-robin, NUMA, and queue-depth are added, same
>>> as NVMe supports.
>>>
>>> NVMe has almost like-for-like equivalents here:
>>> - __mpath_find_path() -> __nvme_find_path()
>>> - mpath_find_path() -> nvme_find_path()
>>>
>>> and similar for all introduced callee functions.
>>>
>>> Functions mpath_set_iopolicy() and mpath_get_iopolicy() are added for
>>> setting default iopolicy.
>>>
>>> A separate mpath_iopolicy structure is introduced. There is no iopolicy
>>> member included in the mpath_head structure as it may not suit NVMe, 
>>> where
>>> iopolicy is per-subsystem and not per namespace.
>>>
>>> Signed-off-by: John Garry <john.g.garry at oracle.com>
>>> ---
>>>   include/linux/multipath.h |  36 ++++++
>>>   lib/multipath.c           | 251 ++++++++++++++++++++++++++++++++++++++
>>>   2 files changed, 287 insertions(+)
>>>
>>> diff --git a/include/linux/multipath.h b/include/linux/multipath.h
>>> index be9dd9fb83345..c964a1aba9c42 100644
>>> --- a/include/linux/multipath.h
>>> +++ b/include/linux/multipath.h
>>> @@ -7,6 +7,22 @@
>>>   extern const struct block_device_operations mpath_ops;
>>> +enum mpath_iopolicy_e {
>>> +    MPATH_IOPOLICY_NUMA,
>>> +    MPATH_IOPOLICY_RR,
>>> +    MPATH_IOPOLICY_QD,
>>> +};
>>> +
>>> +struct mpath_iopolicy {
>>> +    enum mpath_iopolicy_e    iopolicy;
>>> +};
>>> +
>>> +enum mpath_access_state {
>>> +    MPATH_STATE_OPTIMIZED,
>>> +    MPATH_STATE_ACTIVE,
>>> +    MPATH_STATE_INVALID    = 0xFF
>>> +};
>> Hmm so here we don't have MPATH_STATE_NONOPTIMIZED.
>> We are morphing NVME_ANA_NONOPTIMIZED as MPATH_STATE_ACTIVE.
> 
> Yes, well it is treated the same (as NVME_ANA_NONOPTIMIZED) for path 
> selection.
> 
>> Is it because SCSI doesn't have (NONOPTIMIZED) state?
> 
> It does have an active (and optimal) state, but I think that keeping 
> NVMe terminology may be better for now.
> 
>>
>>> +
>>>   struct mpath_disk {
>>>       struct gendisk        *disk;
>>>       struct kref        ref;
>>> @@ -18,10 +34,16 @@ struct mpath_disk {
>>>   struct mpath_device {
>>>       struct list_head    siblings;
>>> +    atomic_t        nr_active;
>>>       struct gendisk        *disk;
>>> +    int            numa_node;
>>>   };
>> I haven't seen any API which help set nr_active or numa_node.
> 
> I missed setting numa_node for NVMe. About nr_active, that is set/read 
> by the NVMe code, like nvme_mpath_start_request(). I did try to abstract 
> that function into a common helper, but it just becomes a mess.
> 
The nvme_mpath_start_request() increments ns->ctrl->nr_active, and 
nvme_mpath_end_request() decrements it. This means that nr_active is 
maintained per controller. If multiple NVMe namespaces are created and 
attached to the same controller, their I/O activity is accumulated in 
the single ctrl->nr_active counter.

In contrast, libmultipath defines nr_active in struct mpath_device, 
which is referenced from struct nvme_ns. Even if we add code to update 
mpath_device->nr_active, that accounting would effectively be per 
namespace, not per controller.

The nr_active value is used by the queue-depth policy. Currently, 
mpath_queue_depth_path() accesses mpath_device->nr_active to make 
forwarding decisions. However, if mpath_device->nr_active is maintained 
per namespace, it does not correctly reflect controller-wide load when 
multiple namespaces share the same controller.

Therefore, instead of maintaining a separate nr_active in struct 
mpath_device, it may be more appropriate for mpath_queue_depth_path() to 
reference ns->ctrl->nr_active directly. In that case, nr_active could be 
removed from struct mpath_device entirely.

Thanks,
--Nilay




More information about the Linux-nvme mailing list