[PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth"
Ewan Milne
emilne at redhat.com
Wed Sep 27 06:42:07 PDT 2023
Obviously having to look at the number of active requests on all paths
for each I/O has a performance cost, but the tradeoff is that you pay
a CPU cost in order to get better path distribution/utilization and hopefully
lower I/O latency. And, you only pay this cost if queue-depth is used.
Otherwise the patches barely touch the code path.
I am not an expert in modern CPU cache architecture, but the updates
to the atomic counters on the request/completion are all per-hctx, the
path selector only does a read, which I think is a simple (no LOCK prefix)
read on x86. The cost would be in the CPU touching the cacheline.
I tried some alternatives that did not always get the most current value of
nr_active for *each* I/O but didn't get suitable behavior. I/O requests can
be submitted very quickly, a lazy computation may not adapt fast enough.
-Ewan
On Wed, Sep 27, 2023 at 3:38 AM Hannes Reinecke <hare at suse.de> wrote:
>
> On 9/25/23 18:31, Ewan D. Milne wrote:
> > The existing iopolicies are inefficient in some cases, such as
> > the presence of a path with high latency. The round-robin
> > policy would use that path equally with faster paths, which
> > results in sub-optimal performance.
> >
> > The queue-depth policy instead sends I/O requests down the path
> > with the least amount of requests in its request queue. Paths
> > with lower latency will clear requests more quickly and have less
> > requests in their queues compared to "bad" paths. The aim is to
> > use those paths the most to bring down overall latency.
> >
> > [edm: patch developed by Thomas Song @ Pure Storage, fixed whitespace
> > and compilation warnings, updated MODULE_PARM description
> > and changed to use new blk_mq_queue_nr_active() instead of adding
> > additional atomic counters.]
> >
> > Co-developed-by: Thomas Song <tsong at purestorage.com>
> > Signed-off-by: Ewan D. Milne <emilne at redhat.com>
> > ---
> > drivers/nvme/host/multipath.c | 48 +++++++++++++++++++++++++++++++++--
> > drivers/nvme/host/nvme.h | 1 +
> > 2 files changed, 47 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> > index 0a88d7bdc5e3..deea7fd4aa95 100644
> > --- a/drivers/nvme/host/multipath.c
> > +++ b/drivers/nvme/host/multipath.c
> > @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
> > static const char *nvme_iopolicy_names[] = {
> > [NVME_IOPOLICY_NUMA] = "numa",
> > [NVME_IOPOLICY_RR] = "round-robin",
> > + [NVME_IOPOLICY_QD] = "queue-depth",
> > };
> >
> > static int iopolicy = NVME_IOPOLICY_NUMA;
> > @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
> > iopolicy = NVME_IOPOLICY_NUMA;
> > else if (!strncmp(val, "round-robin", 11))
> > iopolicy = NVME_IOPOLICY_RR;
> > + else if (!strncmp(val, "queue-depth", 11))
> > + iopolicy = NVME_IOPOLICY_QD;
> > else
> > return -EINVAL;
> >
> > @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
> > module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
> > &iopolicy, 0644);
> > MODULE_PARM_DESC(iopolicy,
> > - "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
> > + "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
> >
> > void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
> > {
> > @@ -329,6 +332,41 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
> > return found;
> > }
> >
> > +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head,
> > + int node)
> > +{
> > + struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
> > + unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
> > + unsigned int depth;
> > +
> > + list_for_each_entry_rcu(ns, &head->list, siblings) {
> > + if (nvme_path_is_disabled(ns))
> > + continue;
> > +
> > + depth = blk_mq_queue_nr_active(ns->queue);
>
> That would require an 'atomic_read' on each path for each I/O.
> Which absolutely sucks for performance.
>
> Considering that we really are only interested in a rough idea about
> path usage, wouldn't it be better to use READ_ONCE()/WRITE_ONCE()
> for accumulating the nr of requests per queue?
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke Kernel Storage Architect
> hare at suse.de +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
> Myers, Andrew McDonald, Martje Boudien Moerman
>
More information about the Linux-nvme
mailing list