[PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info
Nilay Shroff
nilay at linux.ibm.com
Mon Oct 7 08:33:54 PDT 2024
On 10/7/24 19:34, Hannes Reinecke wrote:
> On 10/7/24 15:47, Nilay Shroff wrote:
>>
>>
>> On 10/7/24 15:44, Hannes Reinecke wrote:
>>> On 9/11/24 08:26, Nilay Shroff wrote:
>>>> NVMe native multipath supports different IO policies for selecting I/O
>>>> path, however we don't have any visibility about which path is being
>>>> selected by multipath code for forwarding I/O.
>>>> This patch helps add that visibility by adding new sysfs attribute files
>>>> named "numa_nodes" and "queue_depth" under each namespace block device
>>>> path /sys/block/nvmeXcYnZ/. We also create a "multipath" sysfs directory
>>>> under head disk node and then from this directory add a link to each
>>>> namespace path device this head disk node points to.
>>>>
>>>> For instance, /sys/block/nvmeXnY/multipath/ would create a soft link to
>>>> each path the head disk node <nvmeXnY> points to:
>>>>
>>>> $ ls -1 /sys/block/nvme1n1/
>>>> nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1
>>>> nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1
>>>>
>>>> For round-robin I/O policy, we could easily infer from the above output
>>>> that I/O workload targeted to nvme3n1 would toggle across paths nvme1c1n1
>>>> and nvme1c3n1.
>>>>
>>>> For numa I/O policy, the "numa_nodes" attribute file shows the numa nodes
>>>> being preferred by the respective block device path. The numa nodes value
>>>> is comma delimited list of nodes or A-B range of nodes.
>>>>
>>>> For queue-depth I/O policy, the "queue_depth" attribute file shows the
>>>> number of active/in-flight I/O requests currently queued for each path.
>>>>
>>>> Signed-off-by: Nilay Shroff <nilay at linux.ibm.com>
>>>> ---
>>>> drivers/nvme/host/core.c | 3 ++
>>>> drivers/nvme/host/multipath.c | 71 +++++++++++++++++++++++++++++++++++
>>>> drivers/nvme/host/nvme.h | 20 ++++++++--
>>>> drivers/nvme/host/sysfs.c | 20 ++++++++++
>>>> 4 files changed, 110 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index 983909a600ad..6be29fd64236 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -3951,6 +3951,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>>>> if (!nvme_ns_head_multipath(ns->head))
>>>> nvme_cdev_del(&ns->cdev, &ns->cdev_device);
>>>> +
>>>> + nvme_mpath_remove_sysfs_link(ns);
>>>> +
>>>> del_gendisk(ns->disk);
>>>> mutex_lock(&ns->ctrl->namespaces_lock);
>>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>>> index 518e22dd4f9b..7d9c36a7a261 100644
>>>> --- a/drivers/nvme/host/multipath.c
>>>> +++ b/drivers/nvme/host/multipath.c
>>>> @@ -654,6 +654,8 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
>>>> nvme_add_ns_head_cdev(head);
>>>> }
>>>> + nvme_mpath_add_sysfs_link(ns);
>>>> +
>>>> mutex_lock(&head->lock);
>>>> if (nvme_path_is_optimized(ns)) {
>>>> int node, srcu_idx;
>>> Nearly there.
>> Thank you for your review comments!
>>
>>>
>>> You can only call 'nvme_mpath_add_sysfs_link()' if the gendisk on the head had been created.
>>>
>>> And there is one branch in nvme_mpath_add_disk():
>>>
>>> if (desc.state) {
>>> /* found the group desc: update */
>>> nvme_update_ns_ana_state(&desc, ns);
>>>
>>> which does not go via nvme_mpath_set_live(), yet a device link would need to be create here, too.
>>> But you can't call nvme_mpath_add_sysfs_link() from nvme_mpath_add_disk(), as the actual gendisk
>>> might only be created later on during ANA log parsing.>>
>>> It is a tangle, and I haven't found a good way out of this.
>>> But I am _very much_ in favour of having these links, so please
>>> update your patch.
>>> In case disk supports ANA group then yes it would go through nvme_mpath_add_disk()->nvme_update_ns_ana_state();
>>> and later nvme_update_ns_ana_state() would also fall through function nvme_mpath_set_live where we call
>>> nvme_mpath_add_sysfs_link().
>>
>> So I think that in any case while multipath namespace is being created it has to go through
>> nvme_mpath_set_live function. And as we see in nvme_mpath_set_live function, we only create
>> sysfs link after the gendisk on the head is created. Do you agree with this? Or please let
>> me know if you have any further question.
>>
> But it doesn't:
>
> if (nvme_state_is_live(ns->ana_state) &&
> nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE)
> nvme_mpath_set_live(ns);
>
> so if a namespace is being created for which nvme_state_is_live() returns 'false' nvme_mpath_set_live() is not called, and no links are being created.
> This can happen eg. if the first namespace to be encountered is in any other state than 'optimized' or 'non-optimized'.
>
OK I got what you're suggesting here. So in this particular case when ANA state of a shared namespace
is neither "optimized" nor "non-optimized", we would have gendisk for shared namepspace (i.e.
nvmeXcYnZ) created but we don't have yet gendisk for corresponding head node (i.e. nvmeXnY) created.
So without the gendisk for head node created, how could we create a link from it to the namespace node?
The link from gendisk head node would be eventually created when the ANA state of the namespace transition
to "optimized" or "non-optimized" state. I think, it's not anyways possible to have multipathing function
enabled until the gendisk for the head node created, isn't it? So I don't yet understand why do we really
need device link created if we don't have gendisk for head not ready? Am I missing anything here?
> Cheers,
>
> Hannes
Thanks,
--Nilay
More information about the Linux-nvme
mailing list