[PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"

Sagi Grimberg sagi at grimberg.me
Tue May 21 02:45:44 PDT 2024



On 21/05/2024 11:48, Nilay Shroff wrote:
>
> On 5/21/24 01:50, John Meneghini wrote:
>> From: "Ewan D. Milne" <emilne at redhat.com>
>>
>> The round-robin path selector is inefficient in cases where there is a
>> difference in latency between multiple active optimized paths.  In the
>> presence of one or more high latency paths the round-robin selector
>> continues to the high latency path equally. This results in a bias
>> towards the highest latency path and can cause is significant decrease
>> in overall performance as IOs pile on the lowest latency path. This
>> problem is particularly accute with NVMe-oF controllers.
>>
>> The queue-depth policy instead sends I/O requests down the path with the
>> least amount of requests in its request queue. Paths with lower latency
>> will clear requests more quickly and have less requests in their queues
>> compared to higher latency paths. The goal of this path selector is to
>> make more use of lower latency paths, which will bring down overall IO
>> latency.
>>
>> Signed-off-by: Ewan D. Milne <emilne at redhat.com>
>> [tsong: patch developed by Thomas Song @ Pure Storage, fixed whitespace
>>        and compilation warnings, updated MODULE_PARM description, and
>>        fixed potential issue with ->current_path[] being used]
>> Signed-off-by: Thomas Song <tsong at purestorage.com>
>> [jmeneghi: vairious changes and improvements, addressed review comments]
>> Signed-off-by: John Meneghini <jmeneghi at redhat.com>
>> Link: https://lore.kernel.org/linux-nvme/20240509202929.831680-1-jmeneghi@redhat.com/
>> Tested-by: Marco Patalano <mpatalan at redhat.com>
>> Reviewed-by: Randy Jennings <randyj at redhat.com>
>> Tested-by: Jyoti Rani <jani at purestorage.com>
>> ---
>>   drivers/nvme/host/core.c      |  2 +-
>>   drivers/nvme/host/multipath.c | 86 +++++++++++++++++++++++++++++++++--
>>   drivers/nvme/host/nvme.h      |  9 ++++
>>   3 files changed, 92 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index a066429b790d..1dd7c52293ff 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -110,7 +110,7 @@ struct workqueue_struct *nvme_delete_wq;
>>   EXPORT_SYMBOL_GPL(nvme_delete_wq);
>>   
>>   static LIST_HEAD(nvme_subsystems);
>> -static DEFINE_MUTEX(nvme_subsystems_lock);
>> +DEFINE_MUTEX(nvme_subsystems_lock);
>>   
>>   static DEFINE_IDA(nvme_instance_ida);
>>   static dev_t nvme_ctrl_base_chr_devt;
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index 5397fb428b24..0e2b6e720e95 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
>>   static const char *nvme_iopolicy_names[] = {
>>   	[NVME_IOPOLICY_NUMA]	= "numa",
>>   	[NVME_IOPOLICY_RR]	= "round-robin",
>> +	[NVME_IOPOLICY_QD]      = "queue-depth",
>>   };
>>   
>>   static int iopolicy = NVME_IOPOLICY_NUMA;
>> @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
>>   		iopolicy = NVME_IOPOLICY_NUMA;
>>   	else if (!strncmp(val, "round-robin", 11))
>>   		iopolicy = NVME_IOPOLICY_RR;
>> +	else if (!strncmp(val, "queue-depth", 11))
>> +		iopolicy = NVME_IOPOLICY_QD;
>>   	else
>>   		return -EINVAL;
>>   
>> @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
>>   module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
>>   	&iopolicy, 0644);
>>   MODULE_PARM_DESC(iopolicy,
>> -	"Default multipath I/O policy; 'numa' (default) or 'round-robin'");
>> +	"Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
>>   
>>   void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
>>   {
>> @@ -127,6 +130,11 @@ void nvme_mpath_start_request(struct request *rq)
>>   	struct nvme_ns *ns = rq->q->queuedata;
>>   	struct gendisk *disk = ns->head->disk;
>>   
>> +	if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) {
>> +		atomic_inc(&ns->ctrl->nr_active);
>> +		nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
>> +	}
>> +
>>   	if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
>>   		return;
>>   
>> @@ -140,8 +148,12 @@ void nvme_mpath_end_request(struct request *rq)
>>   {
>>   	struct nvme_ns *ns = rq->q->queuedata;
>>   
>> +	if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
>> +		atomic_dec_if_positive(&ns->ctrl->nr_active);
>> +
>>   	if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
>>   		return;
>> +
>>   	bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
>>   			 blk_rq_bytes(rq) >> SECTOR_SHIFT,
>>   			 nvme_req(rq)->start_time);
>> @@ -330,6 +342,40 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
>>   	return found;
>>   }
>>   
> I think you may also want to reset nr_active counter if in case, in-flight nvme request
> is cancelled. If the request is cancelled then nvme_mpath_end_request() wouldn't be invoked.
> So you may want to reset nr_active counter from nvme_cancel_request() as below:
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index bf7615cb36ee..4fea7883ce8e 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -497,8 +497,9 @@ EXPORT_SYMBOL_GPL(nvme_host_path_error);
>   
>   bool nvme_cancel_request(struct request *req, void *data)
>   {
> -       dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
> -                               "Cancelling I/O %d", req->tag);
> +       struct nvme_ctrl *ctrl = (struct nvme_ctrl *)data;
> +
> +       dev_dbg_ratelimited(ctrl->device, "Cancelling I/O %d", req->tag);
>   
>          /* don't abort one completed or idle request */
>          if (blk_mq_rq_state(req) != MQ_RQ_IN_FLIGHT)
> @@ -506,6 +507,8 @@ bool nvme_cancel_request(struct request *req, void *data)
>   
>          nvme_req(req)->status = NVME_SC_HOST_ABORTED_CMD;
>          nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> +       if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
> +               atomic_dec(&ctrl->nr_active);

Don't think this matters because cancellation only happens when we
teardown the controller anyways...

btw, can we have a better name than nr_active? this is just for IO and only
for multipath. Maybe ctrl->nr_mpath_io_active ?

Also perhaps rename the flag to NVME_MPATH_CTRL_IO_ACCOUNTING ?



More information about the Linux-nvme mailing list