[PATCH 2/3] nvme-multipath: Implement new iopolicy "queue-depth"

Wed Sep 27 06:42:07 PDT 2023

Obviously having to look at the number of active requests on all paths
for each I/O has a performance cost, but the tradeoff is that you pay
a CPU cost in order to get better path distribution/utilization and hopefully
lower I/O latency.  And, you only pay this cost if queue-depth is used.
Otherwise the patches barely touch the code path.

I am not an expert in modern CPU cache architecture, but the updates
to the atomic counters on the request/completion are all per-hctx, the
path selector only does a read, which I think is a simple (no LOCK prefix)
read on x86.  The cost would be in the CPU touching the cacheline.

I tried some alternatives that did not always get the most current value of
nr_active for *each* I/O but didn't get suitable behavior.  I/O requests can
be submitted very quickly, a lazy computation may not adapt fast enough.

-Ewan

On Wed, Sep 27, 2023 at 3:38 AM Hannes Reinecke <hare at suse.de> wrote:
>
> On 9/25/23 18:31, Ewan D. Milne wrote:
> > The existing iopolicies are inefficient in some cases, such as
> > the presence of a path with high latency. The round-robin
> > policy would use that path equally with faster paths, which
> > results in sub-optimal performance.
> >
> > The queue-depth policy instead sends I/O requests down the path
> > with the least amount of requests in its request queue. Paths
> > with lower latency will clear requests more quickly and have less
> > requests in their queues compared to "bad" paths. The aim is to
> > use those paths the most to bring down overall latency.
> >
> > [edm: patch developed by Thomas Song @ Pure Storage, fixed whitespace
> >        and compilation warnings, updated MODULE_PARM description
> >        and changed to use new blk_mq_queue_nr_active() instead of adding
> >        additional atomic counters.]
> >
> > Co-developed-by: Thomas Song <tsong at purestorage.com>
> > Signed-off-by: Ewan D. Milne <emilne at redhat.com>
> > ---
> >   drivers/nvme/host/multipath.c | 48 +++++++++++++++++++++++++++++++++--
> >   drivers/nvme/host/nvme.h      |  1 +
> >   2 files changed, 47 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> > index 0a88d7bdc5e3..deea7fd4aa95 100644
> > --- a/drivers/nvme/host/multipath.c
> > +++ b/drivers/nvme/host/multipath.c
> > @@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
> >   static const char *nvme_iopolicy_names[] = {
> >       [NVME_IOPOLICY_NUMA]    = "numa",
> >       [NVME_IOPOLICY_RR]      = "round-robin",
> > +     [NVME_IOPOLICY_QD]      = "queue-depth",
> >   };
> >
> >   static int iopolicy = NVME_IOPOLICY_NUMA;
> > @@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
> >               iopolicy = NVME_IOPOLICY_NUMA;
> >       else if (!strncmp(val, "round-robin", 11))
> >               iopolicy = NVME_IOPOLICY_RR;
> > +     else if (!strncmp(val, "queue-depth", 11))
> > +             iopolicy = NVME_IOPOLICY_QD;
> >       else
> >               return -EINVAL;
> >
> > @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
> >   module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
> >       &iopolicy, 0644);
> >   MODULE_PARM_DESC(iopolicy,
> > -     "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
> > +     "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
> >
> >   void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
> >   {
> > @@ -329,6 +332,41 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
> >       return found;
> >   }
> >
> > +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head,
> > +             int node)
> > +{
> > +     struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
> > +     unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
> > +     unsigned int depth;
> > +
> > +     list_for_each_entry_rcu(ns, &head->list, siblings) {
> > +             if (nvme_path_is_disabled(ns))
> > +                     continue;
> > +
> > +             depth = blk_mq_queue_nr_active(ns->queue);
>
> That would require an 'atomic_read' on each path for each I/O.
> Which absolutely sucks for performance.
>
> Considering that we really are only interested in a rough idea about
> path usage, wouldn't it be better to use READ_ONCE()/WRITE_ONCE()
> for accumulating the nr of requests per queue?
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare at suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
> Myers, Andrew McDonald, Martje Boudien Moerman
>