[RFC PATCHv2 2/4] nvme-multipath: add support for adaptive I/O policy
Nilay Shroff
nilay at linux.ibm.com
Tue Oct 14 01:57:59 PDT 2025
>> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
>> index c212fa952c0f..69d2f72d0e86 100644
>> --- a/drivers/nvme/host/ioctl.c
>> +++ b/drivers/nvme/host/ioctl.c
>> @@ -711,7 +711,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
>> flags |= NVME_IOCTL_PARTITION;
>> srcu_idx = srcu_read_lock(&head->srcu);
>> - ns = nvme_find_path(head);
>> + ns = nvme_find_path(head, open_for_write ? WRITE : READ);
>> if (!ns)
>> goto out_unlock;
>> @@ -742,7 +742,7 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>> int srcu_idx, ret = -EWOULDBLOCK;
>> srcu_idx = srcu_read_lock(&head->srcu);
>> - ns = nvme_find_path(head);
>> + ns = nvme_find_path(head, open_for_write ? WRITE : READ);
>> if (!ns)
>> goto out_unlock;
>>
> Are we sure that we should account for non-IO commands here, too?
> Thing is, the I/O policy should be just that, directing I/O to the
> various paths.
> But what about commands on the admin queue? Should they be influenced by
> the same policy? And can we even define what a 'read' command is?
> Wouldn't it be better to pass in a third option here (NONE?) to make it
> clear that this is a non-I/O command?
>
Yeah makes sense to differentiate between read/write and others here.
I think we should be able to differentiate between read/write vs others.
I'd implement this in next patchset.
>> @@ -762,7 +762,8 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
>> struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>> struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
>> int srcu_idx = srcu_read_lock(&head->srcu);
>> - struct nvme_ns *ns = nvme_find_path(head);
>> + struct nvme_ns *ns = nvme_find_path(head,
>> + ioucmd->file->f_mode & FMODE_WRITE ? WRITE : READ);
>> int ret = -EINVAL;
>> if (ns)
>
> See above. I really think that we should be able to pass in a third
> option for admin commands.
Yep, as mentioned will do this in the next patchset.
>
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index 3da980dc60d9..9ecdaca5e9a0 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -6,6 +6,8 @@
>> #include <linux/backing-dev.h>
>> #include <linux/moduleparam.h>
>> #include <linux/vmalloc.h>
>> +#include <linux/blk-mq.h>
>> +#include <linux/math64.h>
>> #include <trace/events/block.h>
>> #include "nvme.h"
>>
> Ouch.
> #include <math64.h>
> always spells disaster :-)
This is included for div_u64(). Well we may replace it with usual division
operator (e.g. dividend / divisor) but that may not work well on 32-bit
architecture.
>> +static void nvme_mpath_weight_work(struct work_struct *weight_work)
>> +{
>> + int cpu, srcu_idx;
>> + u32 weight;
>> + struct nvme_ns *ns;
>> + struct nvme_path_stat *stat;
>> + struct nvme_path_work *work = container_of(weight_work,
>> + struct nvme_path_work, weight_work);
>> + struct nvme_ns_head *head = work->ns->head;
>> + int rw = work->rw;
>> + u64 total_score = 0;
>> +
>> + cpu = get_cpu();
>> +
>> + srcu_idx = srcu_read_lock(&head->srcu);
>> + list_for_each_entry_srcu(ns, &head->list, siblings,
>> + srcu_read_lock_held(&head->srcu)) {
>> +
>> + stat = &this_cpu_ptr(ns->info)[rw].stat;
>> + if (!READ_ONCE(stat->slat_ns))
>> + continue;
>> + /*
>> + * Compute the path score as the inverse of smoothed
>> + * latency, scaled by NSEC_PER_SEC. Floating point
>> + * math is unavailable in the kernel, so fixed-point
>> + * scaling is used instead. NSEC_PER_SEC is chosen
>> + * because valid latencies are always < 1 second; longer
>> + * latencies are ignored.
>> + */
>> + stat->score = div_u64(NSEC_PER_SEC, READ_ONCE(stat->slat_ns));
>> +
>> + /* Compute total score. */
>> + total_score += stat->score;
>> + }
>> +
>> + if (!total_score)
>> + goto out;
>> +
>> + /*
>> + * After computing the total slatency, we derive per-path weight
>> + * (normalized to the range 0–100). The weight represents the
>> + * relative share of I/O the path should receive.
>
> Why do we use the range 0-100?
> While this is nice for human consumption, it doesn't lend itself to a
> nice computation. Maybe we should select a different scale to allow us
> for more efficient computation (0-128? Maybe?)
Yes we can normalize this to any range. Lets scale it from 0-128 as
you suggested, and it also aligns with power of 2.
>> +
>> +#define NVME_EWMA_SHIFT 3
>
> And we use this value why?
Oh yeah the reason being in the proposed code we want to assign ~87.5%
weightage to the existing (smoothed) latency and ~12.5% weightage to the
new latency. If you see the ewma_update() then you'd realize this
fact. But yes I'd add a comment above this macro to make the intent
clear.
>
>> +static inline u64 ewma_update(u64 old, u64 new)
>> +{
>> + return (old * ((1 << NVME_EWMA_SHIFT) - 1) + new) >> NVME_EWMA_SHIFT;
>> +}
>> +
>> +static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
>> +{
>> + int cpu;
>> + unsigned int rw;
>> + struct nvme_path_info *info;
>> + struct nvme_path_stat *stat;
>> + u64 now, latency, slat_ns, avg_lat_ns;
>> + struct nvme_ns_head *head = ns->head;
>> +
>> + if (list_is_singular(&head->list))
>> + return;
>> +
>> + now = ktime_get_ns();
>> + latency = now >= rq->io_start_time_ns ? now - rq->io_start_time_ns : 0;
>> + if (!latency)
>> + return;
>> +
>> + /*
>> + * As completion code path is serialized(i.e. no same completion queue
>> + * update code could run simultaneously on multiple cpu) we can safely
>> + * access per cpu nvme path stat here from another cpu (in case the
>> + * completion cpu is different from submission cpu).
>> + * The only field which could be accessed simultaneously here is the
>> + * path ->weight which may be accessed by this function as well as I/O
>> + * submission path during path selection logic and we protect ->weight
>> + * using READ_ONCE/WRITE_ONCE. Yes this may not be 100% accurate but
>> + * we also don't need to be so accurate here as the path credit would
>> + * be anyways refilled, based on path weight, once path consumes all
>> + * its credits. And we limit path weight/credit max up to 100. Please
>> + * also refer nvme_adaptive_path().
>> + */
>> + cpu = blk_mq_rq_cpu(rq);
>> + rw = rq_data_dir(rq);
>> + info = &per_cpu_ptr(ns->info, cpu)[rw];
>> + stat = &info->stat;
>> +
> Hmm. While the 'SAME_CONP' attribute should help here, I remain
> sceptical.
> Wouldn't it be possible to look at the hwq map (eg by using something
> like blk_mq_map_queue_type()) to figure out the hw context, and then
> use the first cpu from that mask?
> That way we are guaranteed to always using the same per-cpu ptr.
> Might be overkill, though; but it would be good to have some
> checks in here in case we do run on the wrong CPU.
>
Okay so I think you propose ensuring that submitting cpu should
match the cpu on which we're processing completion I/O sample.
Well, that makes sense but here we use blk_mq_rq_cpu() which
yields the submitting cpu for the request and then use submitting cpu
number to get per-cpu stat using per_cpu_ptr(). As we don't rely
on the current cpu on which completion I/O sample is being processed, I
think we're all good. Agreed?
>> + /*
>> + * If latency > ~1s then ignore this sample to prevent EWMA from being
>> + * skewed by pathological outliers (multi-second waits, controller
>> + * timeouts etc.). This keeps path scores representative of normal
>> + * performance and avoids instability from rare spikes. If such high
>> + * latency is real, ANA state reporting or keep-alive error counters
>> + * will mark the path unhealthy and remove it from the head node list,
>> + * so we safely skip such sample here.
>> + */
>> + if (unlikely(latency > NSEC_PER_SEC)) {
>> + stat->nr_ignored++;
>> + return;
>> + }
>
> I would even go so far as to indicate that EWMA is unusable here.
> Can we issue a warning / debug message here to indicate that the path
> selector is running suboptimal?
>
We can certainly do that, but as you could see above we're incrementing
->nr_ignored here, and user could then access this value through
debugfs, so is that sufficient? Or do you want an explicit WARN or debug
message to be also printed?
>> +
>> + /*
>> + * Accumulate latency samples and increment the batch count for each
>> + * ~15 second interval. When the interval expires, compute the simple
>> + * average latency over that window, then update the smoothed (EWMA)
>> + * latency. The path weight is recalculated based on this smoothed
>> + * latency.
>> + */
>
> Why 15 seconds? Can we make this changeable eg via debugfs?
Yes certainly we can make this configurable, BTW should it be
done through a new sysfs attribute instead of debugfs?
>
>> + stat->batch += latency;
>> + stat->batch_count++;
>> + stat->nr_samples++;
>> +
>> + if (now > stat->last_weight_ts &&
>> + (now - stat->last_weight_ts) >= 15 * NSEC_PER_SEC) {
>> +
>
> At the very least make this a #define. But ideally one should be able
> to modify that.
>
Yes I'd define a macro for this 15 seconds interval, and also we'll make
it configurable. I'd implement this in next patchset.
>> + stat->last_weight_ts = now;
>> +
>> + /*
>> + * Find simple average latency for the last epoch (~15 sec
>> + * interval).
>> + */
>> + avg_lat_ns = div_u64(stat->batch, stat->batch_count);
>> +
>> + /*
>> + * Calculate smooth/EWMA (Exponentially Weighted Moving Average)
>> + * latency. EWMA is preferred over simple average latency
>> + * because it smooths naturally, reduces jitter from sudden
>> + * spikes, and adapts faster to changing conditions. It also
>> + * avoids storing historical samples, and works well for both
>> + * slow and fast I/O rates.
>> + * Formula:
>> + * slat_ns = (prev_slat_ns * (WEIGHT - 1) + (latency)) / WEIGHT
>> + * With WEIGHT = 8, this assigns 7/8 (~87.5 %) weight to the
>> + * existing latency and 1/8 (~12.5%) weight to the new latency.
>> + */
>
> Similar comment to the WEIGHT. While 8 might be a nice choice, there is nothing indicating it's always the right choice.
> Please make it a #define and see if we cannot modify it via debugfs.
>
Yeah sure, as discussed earlier this value is derived from NVME_EWMA_SHIFT
and I'd make it configurable in the next patchset.
>> +int nvme_alloc_ns_stat(struct nvme_ns *ns)
>> +{
>> + int rw, cpu;
>> + struct nvme_path_work *work;
>> + gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
>> +
>> + if (!ns->head->disk)
>> + return 0;
>> +
>> + ns->info = __alloc_percpu_gfp(2 * sizeof(struct nvme_path_info),
>> + __alignof__(struct nvme_path_info), gfp);
>> + if (!ns->info)
>> + return -ENOMEM;
>> +
>> + for_each_possible_cpu(cpu) {
>> + for (rw = 0; rw < 2; rw++) {
>> + work = &per_cpu_ptr(ns->info, cpu)[rw].work;
>> + work->ns = ns;
>> + work->rw = rw;
>> + INIT_WORK(&work->weight_work, nvme_mpath_weight_work);
>> + }
>> + }
>> +
>> + return 0;
>> +}
>> +
>
> The more I think about it the more I like the idea of having a third
> direction (and not just READ and WRITE).
> The latency of I/O commands _will_ be different from the latency of
> admin commands, so the adaptive policy might come to different decisions
> for admin commands.
Yes as I said above, I'd address this in the next patchset.
>> +static struct nvme_ns *nvme_adaptive_path(struct nvme_ns_head *head,
>> + unsigned int rw)
>> +{
>> + struct nvme_ns *ns, *found = NULL;
>> + struct nvme_path_stat *stat;
>> + u32 weight;
>> + int refill = 0;
>> +
>> + get_cpu();
>> +retry:
>> + list_for_each_entry_srcu(ns, &head->list, siblings,
>> + srcu_read_lock_held(&head->srcu)) {
>> +
>> + if (nvme_path_is_disabled(ns) ||
>> + !nvme_state_is_live(ns->ana_state))
>> + continue;
>> +
>> + stat = &this_cpu_ptr(ns->info)[rw].stat;
>> +
>> + /*
>> + * When the head path-list is singular we don't calculate the
>> + * only path weight for optimization as we don't need to forward
>> + * I/O to more than one path. The another possibility is whenthe
>> + * path is newly added, we don't know its weight. So we go round
>> + * -robin for each such path and forward I/O to it.Once we start
>> + * getting response for such I/Os, the path weight calculation
>> + * would kick in and then we start using path credit for
>> + * forwarding I/O.
>> + */
>> + weight = READ_ONCE(stat->weight);
>> + if (unlikely(!weight)) {
>> + found = ns;
>> + goto out;
>> + }
>> +
>> + /*
>> + * To keep path selection logic simple, we don't distinguish
>> + * between ANA optimized and non-optimized states. The non-
>> + * optimized path is expected to have a lower weight, and
>> + * therefore fewer credits. As a result, only a small number of
>> + * I/Os will be forwarded to paths in the non-optimized state.
>> + */
>> + if (stat->credit > 0) {
>> + --stat->credit;
>> + found = ns;
>> + goto out;
>> + }
>> + }
>> +
>> + if (!found && !list_empty(&head->list)) {> + /*
>> + * Refill credits and retry.
>> + */
>> + list_for_each_entry_srcu(ns, &head->list, siblings,
>> + srcu_read_lock_held(&head->srcu)) {
>> + if (nvme_path_is_disabled(ns) ||
>> + !nvme_state_is_live(ns->ana_state))
>> + continue;
>> +
>> + stat = &this_cpu_ptr(ns->info)[rw].stat;
>> + weight = READ_ONCE(stat->weight);
>> + stat->credit = weight;
>> + refill = 1;
>> + }
>> + if (refill)
>> + goto retry;
>> + }
>
> Hmm. Loop within a loop. Not pretty.
> Can't we make do with just one loop?
> After all, if we end up here the credits for this path are use up, and need to be refilled.
> What would happen if we were _just_ refill and retry?
> IE drop the second iteration and just execute the branch
> under the loop?
>
Yes good pint. I think we should be able to address this
with one loop. I'd fix this in the next patchset.
>> @@ -581,7 +951,7 @@ static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector,
>> int srcu_idx, ret = -EWOULDBLOCK;
>> srcu_idx = srcu_read_lock(&head->srcu);
>> - ns = nvme_find_path(head);
>> + ns = nvme_find_path(head, READ);
>> if (ns)
>> ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data);
>> srcu_read_unlock(&head->srcu, srcu_idx);
>
> See the discussion about the third option and not just 'READ'/'WRITE'.
Yeah, noted!
>> diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
>> index 29430949ce2f..4f9607e9698a 100644
>> --- a/drivers/nvme/host/sysfs.c
>> +++ b/drivers/nvme/host/sysfs.c
>> @@ -194,7 +194,7 @@ static int ns_head_update_nuse(struct nvme_ns_head *head)
>> return 0;
>> srcu_idx = srcu_read_lock(&head->srcu);
>> - ns = nvme_find_path(head);
>> + ns = nvme_find_path(head, READ);
>> if (!ns)
>> goto out_unlock;
>>
> Other than that it really looks good.
Thank you:)
--Nilay
More information about the Linux-nvme
mailing list