[PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"

Tue May 21 03:07:10 PDT 2024


On 5/21/24 15:15, Sagi Grimberg wrote:
> 
> 
> On 21/05/2024 11:48, Nilay Shroff wrote:
>>
>> On 5/21/24 01:50, John Meneghini wrote:
>>> From: "Ewan D. Milne" <emilne at redhat.com>
>>>
>>> The round-robin path selector is inefficient in cases where there is a
>>> difference in latency between multiple active optimized paths.  In the
>>> presence of one or more high latency paths the round-robin selector
>>> continues to the high latency path equally. This results in a bias
>>> towards the highest latency path and can cause is significant decrease
>>> in overall performance as IOs pile on the lowest latency path. This
>>> problem is particularly accute with NVMe-oF controllers.
>>>
>>> The queue-depth policy instead sends I/O requests down the path with the
>>> least amount of requests in its request queue. Paths with lower latency
>>> will clear requests more quickly and have less requests in their queues
>>> compared to higher latency paths. The goal of this path selector is to
>>> make more use of lower latency paths, which will bring down overall IO
>>> latency.
>>>
>>> Signed-off-by: Ewan D. Milne <emilne at redhat.com>
>>> [tsong: patch developed by Thomas Song @ Pure Storage, fixed whitespace
>>>        and compilation warnings, updated MODULE_PARM description, and
>>>        fixed potential issue with ->current_path[] being used]
>>> Signed-off-by: Thomas Song <tsong at purestorage.com>
>>> [jmeneghi: vairious changes and improvements, addressed review comments]
>>> Signed-off-by: John Meneghini <jmeneghi at redhat.com>
>>> Link: https://lore.kernel.org/linux-nvme/20240509202929.831680-1-jmeneghi@redhat.com/
>>> Tested-by: Marco Patalano <mpatalan at redhat.com>
>>> Reviewed-by: Randy Jennings <randyj at redhat.com>
>>> Tested-by: Jyoti Rani <jani at purestorage.com>
>>> ---
>>>   drivers/nvme/host/core.c      |  2 +-
>>>   drivers/nvme/host/multipath.c | 86 +++++++++++++++++++++++++++++++++--
>>>   drivers/nvme/host/nvme.h      |  9 ++++
>>>   3 files changed, 92 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index a066429b790d..1dd7c52293ff 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -110,7 +110,7 @@ struct workqueue_struct *nvme_delete_wq;
>>>   EXPORT_SYMBOL_GPL(nvme_delete_wq);
>>>     static LIST_HEAD(nvme_subsystems);
>>> -static DEFINE_MUTEX(nvme_subsystems_lock);
>>> +DEFINE_MUTEX(nvme_subsystems_lock);
>>>     static DEFINE_IDA(nvme_instance_ida);
>>>   static dev_t nvme_ctrl_base_chr_devt;
>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>>> index 5397fb428b24..0e2b6e720e95 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
>>>   static const char *nvme_iopolicy_names[] = {
>>>       [NVME_IOPOLICY_NUMA]    = "numa",
>>>       [NVME_IOPOLICY_RR]    = "round-robin",
>>> +    [NVME_IOPOLICY_QD]      = "queue-depth",
>>>   };
>>>     static int iopolicy = NVME_IOPOLICY_NUMA;
>>> @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
>>>           iopolicy = NVME_IOPOLICY_NUMA;
>>>       else if (!strncmp(val, "round-robin", 11))
>>>           iopolicy = NVME_IOPOLICY_RR;
>>> +    else if (!strncmp(val, "queue-depth", 11))
>>> +        iopolicy = NVME_IOPOLICY_QD;
>>>       else
>>>           return -EINVAL;
>>>   @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
>>>   module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
>>>       &iopolicy, 0644);
>>>   MODULE_PARM_DESC(iopolicy,
>>> -    "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
>>> +    "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
>>>     void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
>>>   {
>>> @@ -127,6 +130,11 @@ void nvme_mpath_start_request(struct request *rq)
>>>       struct nvme_ns *ns = rq->q->queuedata;
>>>       struct gendisk *disk = ns->head->disk;
>>>   +    if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) {
>>> +        atomic_inc(&ns->ctrl->nr_active);
>>> +        nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
>>> +    }
>>> +
>>>       if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
>>>           return;
>>>   @@ -140,8 +148,12 @@ void nvme_mpath_end_request(struct request *rq)
>>>   {
>>>       struct nvme_ns *ns = rq->q->queuedata;
>>>   +    if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
>>> +        atomic_dec_if_positive(&ns->ctrl->nr_active);
>>> +
>>>       if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
>>>           return;
>>> +
>>>       bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
>>>                blk_rq_bytes(rq) >> SECTOR_SHIFT,
>>>                nvme_req(rq)->start_time);
>>> @@ -330,6 +342,40 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
>>>       return found;
>>>   }
>>>   
>> I think you may also want to reset nr_active counter if in case, in-flight nvme request
>> is cancelled. If the request is cancelled then nvme_mpath_end_request() wouldn't be invoked.
>> So you may want to reset nr_active counter from nvme_cancel_request() as below:
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index bf7615cb36ee..4fea7883ce8e 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -497,8 +497,9 @@ EXPORT_SYMBOL_GPL(nvme_host_path_error);
>>     bool nvme_cancel_request(struct request *req, void *data)
>>   {
>> -       dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
>> -                               "Cancelling I/O %d", req->tag);
>> +       struct nvme_ctrl *ctrl = (struct nvme_ctrl *)data;
>> +
>> +       dev_dbg_ratelimited(ctrl->device, "Cancelling I/O %d", req->tag);
>>            /* don't abort one completed or idle request */
>>          if (blk_mq_rq_state(req) != MQ_RQ_IN_FLIGHT)
>> @@ -506,6 +507,8 @@ bool nvme_cancel_request(struct request *req, void *data)
>>            nvme_req(req)->status = NVME_SC_HOST_ABORTED_CMD;
>>          nvme_req(req)->flags |= NVME_REQ_CANCELLED;
>> +       if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
>> +               atomic_dec(&ctrl->nr_active);
> 
> Don't think this matters because cancellation only happens when we
> teardown the controller anyways...
> 
I think in case if we reset the nvme controller then we don't teardown 
controller, isn't it? In this case we cancel all pending requests, and 
later restart the controller.

Thanks,
--Nilay