[PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth"

Sagi Grimberg sagi at grimberg.me
Wed Sep 27 04:02:49 PDT 2023


>> The existing iopolicies are inefficient in some cases, such as
>> the presence of a path with high latency. The round-robin
>> policy would use that path equally with faster paths, which
>> results in sub-optimal performance.
>>
>> The queue-depth policy instead sends I/O requests down the path
>> with the least amount of requests in its request queue. Paths
>> with lower latency will clear requests more quickly and have less
>> requests in their queues compared to "bad" paths. The aim is to
>> use those paths the most to bring down overall latency.
>>
>> [edm: patch developed by Thomas Song @ Pure Storage, fixed whitespace
>>        and compilation warnings, updated MODULE_PARM description
>>        and changed to use new blk_mq_queue_nr_active() instead of adding
>>        additional atomic counters.]
>>
>> Co-developed-by: Thomas Song <tsong at purestorage.com>
>> Signed-off-by: Ewan D. Milne <emilne at redhat.com>
>> ---
>>   drivers/nvme/host/multipath.c | 48 +++++++++++++++++++++++++++++++++--
>>   drivers/nvme/host/nvme.h      |  1 +
>>   2 files changed, 47 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/nvme/host/multipath.c 
>> b/drivers/nvme/host/multipath.c
>> index 0a88d7bdc5e3..deea7fd4aa95 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
>>   static const char *nvme_iopolicy_names[] = {
>>       [NVME_IOPOLICY_NUMA]    = "numa",
>>       [NVME_IOPOLICY_RR]    = "round-robin",
>> +    [NVME_IOPOLICY_QD]      = "queue-depth",
>>   };
>>   static int iopolicy = NVME_IOPOLICY_NUMA;
>> @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const 
>> struct kernel_param *kp)
>>           iopolicy = NVME_IOPOLICY_NUMA;
>>       else if (!strncmp(val, "round-robin", 11))
>>           iopolicy = NVME_IOPOLICY_RR;
>> +    else if (!strncmp(val, "queue-depth", 11))
>> +        iopolicy = NVME_IOPOLICY_QD;
>>       else
>>           return -EINVAL;
>> @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct 
>> kernel_param *kp)
>>   module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
>>       &iopolicy, 0644);
>>   MODULE_PARM_DESC(iopolicy,
>> -    "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
>> +    "Default multipath I/O policy; 'numa' (default) , 'round-robin' 
>> or 'queue-depth'");
>>   void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
>>   {
>> @@ -329,6 +332,41 @@ static struct nvme_ns 
>> *nvme_round_robin_path(struct nvme_ns_head *head,
>>       return found;
>>   }
>> +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head,
>> +        int node)
>> +{
>> +    struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
>> +    unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
>> +    unsigned int depth;
>> +
>> +    list_for_each_entry_rcu(ns, &head->list, siblings) {
>> +        if (nvme_path_is_disabled(ns))
>> +            continue;
>> +
>> +        depth = blk_mq_queue_nr_active(ns->queue);
> 
> That would require an 'atomic_read' on each path for each I/O.
> Which absolutely sucks for performance.

Actually in the current implementation its for every hctx in every
path...

> Considering that we really are only interested in a rough idea about 
> path usage,

I am not sure that a "rough idea" is sufficient here.

> wouldn't it be better to use READ_ONCE()/WRITE_ONCE()
> for accumulating the nr of requests per queue?

I'm assuming that WRITE_ONCE() is a slip up because the path
selector should absolutely not update the hctx active count.

I think its unnatural to access atomic variables non-atomically, and at 
the very least requires a code comment to why this is ok. But regardless
we need some testing to tell us if this is OK to do.

This only matters when there are less hctx than running cpus, but that
is a probable case for fabrics.



More information about the Linux-nvme mailing list