[PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"
Sagi Grimberg
sagi at grimberg.me
Tue May 21 02:45:44 PDT 2024
On 21/05/2024 11:48, Nilay Shroff wrote:
>
> On 5/21/24 01:50, John Meneghini wrote:
>> From: "Ewan D. Milne" <emilne at redhat.com>
>>
>> The round-robin path selector is inefficient in cases where there is a
>> difference in latency between multiple active optimized paths. In the
>> presence of one or more high latency paths the round-robin selector
>> continues to the high latency path equally. This results in a bias
>> towards the highest latency path and can cause is significant decrease
>> in overall performance as IOs pile on the lowest latency path. This
>> problem is particularly accute with NVMe-oF controllers.
>>
>> The queue-depth policy instead sends I/O requests down the path with the
>> least amount of requests in its request queue. Paths with lower latency
>> will clear requests more quickly and have less requests in their queues
>> compared to higher latency paths. The goal of this path selector is to
>> make more use of lower latency paths, which will bring down overall IO
>> latency.
>>
>> Signed-off-by: Ewan D. Milne <emilne at redhat.com>
>> [tsong: patch developed by Thomas Song @ Pure Storage, fixed whitespace
>> and compilation warnings, updated MODULE_PARM description, and
>> fixed potential issue with ->current_path[] being used]
>> Signed-off-by: Thomas Song <tsong at purestorage.com>
>> [jmeneghi: vairious changes and improvements, addressed review comments]
>> Signed-off-by: John Meneghini <jmeneghi at redhat.com>
>> Link: https://lore.kernel.org/linux-nvme/20240509202929.831680-1-jmeneghi@redhat.com/
>> Tested-by: Marco Patalano <mpatalan at redhat.com>
>> Reviewed-by: Randy Jennings <randyj at redhat.com>
>> Tested-by: Jyoti Rani <jani at purestorage.com>
>> ---
>> drivers/nvme/host/core.c | 2 +-
>> drivers/nvme/host/multipath.c | 86 +++++++++++++++++++++++++++++++++--
>> drivers/nvme/host/nvme.h | 9 ++++
>> 3 files changed, 92 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index a066429b790d..1dd7c52293ff 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -110,7 +110,7 @@ struct workqueue_struct *nvme_delete_wq;
>> EXPORT_SYMBOL_GPL(nvme_delete_wq);
>>
>> static LIST_HEAD(nvme_subsystems);
>> -static DEFINE_MUTEX(nvme_subsystems_lock);
>> +DEFINE_MUTEX(nvme_subsystems_lock);
>>
>> static DEFINE_IDA(nvme_instance_ida);
>> static dev_t nvme_ctrl_base_chr_devt;
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index 5397fb428b24..0e2b6e720e95 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
>> static const char *nvme_iopolicy_names[] = {
>> [NVME_IOPOLICY_NUMA] = "numa",
>> [NVME_IOPOLICY_RR] = "round-robin",
>> + [NVME_IOPOLICY_QD] = "queue-depth",
>> };
>>
>> static int iopolicy = NVME_IOPOLICY_NUMA;
>> @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
>> iopolicy = NVME_IOPOLICY_NUMA;
>> else if (!strncmp(val, "round-robin", 11))
>> iopolicy = NVME_IOPOLICY_RR;
>> + else if (!strncmp(val, "queue-depth", 11))
>> + iopolicy = NVME_IOPOLICY_QD;
>> else
>> return -EINVAL;
>>
>> @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
>> module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
>> &iopolicy, 0644);
>> MODULE_PARM_DESC(iopolicy,
>> - "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
>> + "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
>>
>> void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
>> {
>> @@ -127,6 +130,11 @@ void nvme_mpath_start_request(struct request *rq)
>> struct nvme_ns *ns = rq->q->queuedata;
>> struct gendisk *disk = ns->head->disk;
>>
>> + if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) {
>> + atomic_inc(&ns->ctrl->nr_active);
>> + nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
>> + }
>> +
>> if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
>> return;
>>
>> @@ -140,8 +148,12 @@ void nvme_mpath_end_request(struct request *rq)
>> {
>> struct nvme_ns *ns = rq->q->queuedata;
>>
>> + if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
>> + atomic_dec_if_positive(&ns->ctrl->nr_active);
>> +
>> if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
>> return;
>> +
>> bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
>> blk_rq_bytes(rq) >> SECTOR_SHIFT,
>> nvme_req(rq)->start_time);
>> @@ -330,6 +342,40 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
>> return found;
>> }
>>
> I think you may also want to reset nr_active counter if in case, in-flight nvme request
> is cancelled. If the request is cancelled then nvme_mpath_end_request() wouldn't be invoked.
> So you may want to reset nr_active counter from nvme_cancel_request() as below:
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index bf7615cb36ee..4fea7883ce8e 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -497,8 +497,9 @@ EXPORT_SYMBOL_GPL(nvme_host_path_error);
>
> bool nvme_cancel_request(struct request *req, void *data)
> {
> - dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
> - "Cancelling I/O %d", req->tag);
> + struct nvme_ctrl *ctrl = (struct nvme_ctrl *)data;
> +
> + dev_dbg_ratelimited(ctrl->device, "Cancelling I/O %d", req->tag);
>
> /* don't abort one completed or idle request */
> if (blk_mq_rq_state(req) != MQ_RQ_IN_FLIGHT)
> @@ -506,6 +507,8 @@ bool nvme_cancel_request(struct request *req, void *data)
>
> nvme_req(req)->status = NVME_SC_HOST_ABORTED_CMD;
> nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> + if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
> + atomic_dec(&ctrl->nr_active);
Don't think this matters because cancellation only happens when we
teardown the controller anyways...
btw, can we have a better name than nr_active? this is just for IO and only
for multipath. Maybe ctrl->nr_mpath_io_active ?
Also perhaps rename the flag to NVME_MPATH_CTRL_IO_ACCOUNTING ?
More information about the Linux-nvme
mailing list