[RFC PATCHv2 2/4] nvme-multipath: add support for adaptive I/O policy

Tue Oct 14 01:57:59 PDT 2025

>> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
>> index c212fa952c0f..69d2f72d0e86 100644
>> --- a/drivers/nvme/host/ioctl.c
>> +++ b/drivers/nvme/host/ioctl.c
>> @@ -711,7 +711,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode,
>>           flags |= NVME_IOCTL_PARTITION;
>>         srcu_idx = srcu_read_lock(&head->srcu);
>> -    ns = nvme_find_path(head);
>> +    ns = nvme_find_path(head, open_for_write ? WRITE : READ);
>>       if (!ns)
>>           goto out_unlock;
>>   @@ -742,7 +742,7 @@ long nvme_ns_head_chr_ioctl(struct file *file, unsigned int cmd,
>>       int srcu_idx, ret = -EWOULDBLOCK;
>>         srcu_idx = srcu_read_lock(&head->srcu);
>> -    ns = nvme_find_path(head);
>> +    ns = nvme_find_path(head, open_for_write ? WRITE : READ);
>>       if (!ns)
>>           goto out_unlock;
>>   
> Are we sure that we should account for non-IO commands here, too?
> Thing is, the I/O policy should be just that, directing I/O to the
> various paths.
> But what about commands on the admin queue? Should they be influenced by
> the same policy? And can we even define what a 'read' command is?
> Wouldn't it be better to pass in a third option here (NONE?) to make it
> clear that this is a non-I/O command?
> 
Yeah makes sense to differentiate between read/write and others here.
I think we should be able to differentiate between read/write vs others.
I'd implement this in next patchset.

>> @@ -762,7 +762,8 @@ int nvme_ns_head_chr_uring_cmd(struct io_uring_cmd *ioucmd,
>>       struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>>       struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head, cdev);
>>       int srcu_idx = srcu_read_lock(&head->srcu);
>> -    struct nvme_ns *ns = nvme_find_path(head);
>> +    struct nvme_ns *ns = nvme_find_path(head,
>> +            ioucmd->file->f_mode & FMODE_WRITE ? WRITE : READ);
>>       int ret = -EINVAL;
>>         if (ns)
> 
> See above. I really think that we should be able to pass in a third
> option for admin commands.
Yep, as mentioned will do this in the next patchset.

> 
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index 3da980dc60d9..9ecdaca5e9a0 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -6,6 +6,8 @@
>>   #include <linux/backing-dev.h>
>>   #include <linux/moduleparam.h>
>>   #include <linux/vmalloc.h>
>> +#include <linux/blk-mq.h>
>> +#include <linux/math64.h>
>>   #include <trace/events/block.h>
>>   #include "nvme.h"
>>  
> Ouch.
> #include <math64.h>
> always spells disaster :-)

This is included for div_u64(). Well we may replace it with usual division
operator (e.g. dividend / divisor) but that may not work well on 32-bit
architecture. 

>>   +static void nvme_mpath_weight_work(struct work_struct *weight_work)
>> +{
>> +    int cpu, srcu_idx;
>> +    u32 weight;
>> +    struct nvme_ns *ns;
>> +    struct nvme_path_stat *stat;
>> +    struct nvme_path_work *work = container_of(weight_work,
>> +            struct nvme_path_work, weight_work);
>> +    struct nvme_ns_head *head = work->ns->head;
>> +    int rw = work->rw;
>> +    u64 total_score = 0;
>> +
>> +    cpu = get_cpu();
>> +
>> +    srcu_idx = srcu_read_lock(&head->srcu);
>> +    list_for_each_entry_srcu(ns, &head->list, siblings,
>> +            srcu_read_lock_held(&head->srcu)) {
>> +
>> +        stat = &this_cpu_ptr(ns->info)[rw].stat;
>> +        if (!READ_ONCE(stat->slat_ns))
>> +            continue;
>> +        /*
>> +         * Compute the path score as the inverse of smoothed
>> +         * latency, scaled by NSEC_PER_SEC. Floating point
>> +         * math is unavailable in the kernel, so fixed-point
>> +         * scaling is used instead. NSEC_PER_SEC is chosen
>> +         * because valid latencies are always < 1 second; longer
>> +         * latencies are ignored.
>> +         */
>> +        stat->score = div_u64(NSEC_PER_SEC, READ_ONCE(stat->slat_ns));
>> +
>> +        /* Compute total score. */
>> +        total_score += stat->score;
>> +    }
>> +
>> +    if (!total_score)
>> +        goto out;
>> +
>> +    /*
>> +     * After computing the total slatency, we derive per-path weight
>> +     * (normalized to the range 0–100). The weight represents the
>> +     * relative share of I/O the path should receive.
> 
> Why do we use the range 0-100?
> While this is nice for human consumption, it doesn't lend itself to a
> nice computation. Maybe we should select a different scale to allow us
> for more efficient computation (0-128? Maybe?)
Yes we can normalize this to any range. Lets scale it from 0-128 as
you suggested, and it also aligns with power of 2.

>> +
>> +#define NVME_EWMA_SHIFT    3
> 
> And we use this value why?
Oh yeah the reason being in the proposed code we want to assign  ~87.5%
weightage to the existing (smoothed) latency and ~12.5% weightage to the 
new latency. If you see the ewma_update() then you'd realize this 
fact. But yes I'd add a comment above this macro to make the intent
clear.

> 
>> +static inline u64 ewma_update(u64 old, u64 new)
>> +{
>> +    return (old * ((1 << NVME_EWMA_SHIFT) - 1) + new) >> NVME_EWMA_SHIFT;
>> +}
>> +
>> +static void nvme_mpath_add_sample(struct request *rq, struct nvme_ns *ns)
>> +{
>> +    int cpu;
>> +    unsigned int rw;
>> +    struct nvme_path_info *info;
>> +    struct nvme_path_stat *stat;
>> +    u64 now, latency, slat_ns, avg_lat_ns;
>> +    struct nvme_ns_head *head = ns->head;
>> +
>> +    if (list_is_singular(&head->list))
>> +        return;
>> +
>> +    now = ktime_get_ns();
>> +    latency = now >= rq->io_start_time_ns ? now - rq->io_start_time_ns : 0;
>> +    if (!latency)
>> +        return;
>> +
>> +    /*
>> +     * As completion code path is serialized(i.e. no same completion queue
>> +     * update code could run simultaneously on multiple cpu) we can safely
>> +     * access per cpu nvme path stat here from another cpu (in case the
>> +     * completion cpu is different from submission cpu).
>> +     * The only field which could be accessed simultaneously here is the
>> +     * path ->weight which may be accessed by this function as well as I/O
>> +     * submission path during path selection logic and we protect ->weight
>> +     * using READ_ONCE/WRITE_ONCE. Yes this may not be 100% accurate but
>> +     * we also don't need to be so accurate here as the path credit would
>> +     * be anyways refilled, based on path weight, once path consumes all
>> +     * its credits. And we limit path weight/credit max up to 100. Please
>> +     * also refer nvme_adaptive_path().
>> +     */
>> +    cpu = blk_mq_rq_cpu(rq);
>> +    rw = rq_data_dir(rq);
>> +    info = &per_cpu_ptr(ns->info, cpu)[rw];
>> +    stat = &info->stat;
>> +
> Hmm. While the 'SAME_CONP' attribute should help here, I remain
> sceptical.
> Wouldn't it be possible to look at the hwq map (eg by using something
> like blk_mq_map_queue_type()) to figure out the hw context, and then
> use the first cpu from that mask?
> That way we are guaranteed to always using the same per-cpu ptr.
> Might be overkill, though; but it would be good to have some
> checks in here in case we do run on the wrong CPU.
> 
Okay so I think you propose ensuring that submitting cpu should 
match the cpu on which we're processing completion I/O sample.
Well, that makes sense but here we use blk_mq_rq_cpu() which
yields the submitting cpu for the request and then use submitting cpu 
number to get per-cpu stat using per_cpu_ptr(). As we don't rely
on the current cpu on which completion I/O sample is being processed, I 
think we're all good. Agreed?

>> +    /*
>> +     * If latency > ~1s then ignore this sample to prevent EWMA from being
>> +     * skewed by pathological outliers (multi-second waits, controller
>> +     * timeouts etc.). This keeps path scores representative of normal
>> +     * performance and avoids instability from rare spikes. If such high
>> +     * latency is real, ANA state reporting or keep-alive error counters
>> +     * will mark the path unhealthy and remove it from the head node list,
>> +     * so we safely skip such sample here.
>> +     */
>> +    if (unlikely(latency > NSEC_PER_SEC)) {
>> +        stat->nr_ignored++;
>> +        return;
>> +    }
> 
> I would even go so far as to indicate that EWMA is unusable here.
> Can we issue a warning / debug message here to indicate that the path
> selector is running suboptimal?
> 
We can certainly do that, but as you could see above we're incrementing 
->nr_ignored here, and user could then access this value through 
debugfs, so is that sufficient? Or do you want an explicit WARN or debug
message to be also printed?

>> +
>> +    /*
>> +     * Accumulate latency samples and increment the batch count for each
>> +     * ~15 second interval. When the interval expires, compute the simple
>> +     * average latency over that window, then update the smoothed (EWMA)
>> +     * latency. The path weight is recalculated based on this smoothed
>> +     * latency.
>> +     */
> 
> Why 15 seconds? Can we make this changeable eg via debugfs?
Yes certainly we can make this configurable, BTW should it be
done through a new sysfs attribute instead of debugfs? 

> 
>> +    stat->batch += latency;
>> +    stat->batch_count++;
>> +    stat->nr_samples++;
>> +
>> +    if (now > stat->last_weight_ts &&
>> +            (now - stat->last_weight_ts) >= 15 * NSEC_PER_SEC) {
>> +
> 
> At the very least make this a #define. But ideally one should be able
> to modify that.
> 
Yes I'd define a macro for this 15 seconds interval, and also we'll make
it configurable. I'd implement this in next patchset.

>> +        stat->last_weight_ts = now;
>> +
>> +        /*
>> +         * Find simple average latency for the last epoch (~15 sec
>> +         * interval).
>> +         */
>> +        avg_lat_ns = div_u64(stat->batch, stat->batch_count);
>> +
>> +        /*
>> +         * Calculate smooth/EWMA (Exponentially Weighted Moving Average)
>> +         * latency. EWMA is preferred over simple average latency
>> +         * because it smooths naturally, reduces jitter from sudden
>> +         * spikes, and adapts faster to changing conditions. It also
>> +         * avoids storing historical samples, and works well for both
>> +         * slow and fast I/O rates.
>> +         * Formula:
>> +         * slat_ns = (prev_slat_ns * (WEIGHT - 1) + (latency)) / WEIGHT
>> +         * With WEIGHT = 8, this assigns 7/8 (~87.5 %) weight to the
>> +         * existing latency and 1/8 (~12.5%) weight to the new latency.
>> +         */
> 
> Similar comment to the WEIGHT. While 8 might be a nice choice, there is nothing indicating it's always the right choice.
> Please make it a #define and see if we cannot modify it via debugfs.
> 
Yeah sure, as discussed earlier this value is derived from NVME_EWMA_SHIFT
and I'd make it configurable in the next patchset.

>> +int nvme_alloc_ns_stat(struct nvme_ns *ns)
>> +{
>> +    int rw, cpu;
>> +    struct nvme_path_work *work;
>> +    gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
>> +
>> +    if (!ns->head->disk)
>> +        return 0;
>> +
>> +    ns->info = __alloc_percpu_gfp(2 * sizeof(struct nvme_path_info),
>> +            __alignof__(struct nvme_path_info), gfp);
>> +    if (!ns->info)
>> +        return -ENOMEM;
>> +
>> +    for_each_possible_cpu(cpu) {
>> +        for (rw = 0; rw < 2; rw++) {
>> +            work = &per_cpu_ptr(ns->info, cpu)[rw].work;
>> +            work->ns = ns;
>> +            work->rw = rw;
>> +            INIT_WORK(&work->weight_work, nvme_mpath_weight_work);
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
> 
> The more I think about it the more I like the idea of having a third
> direction (and not just READ and WRITE).
> The latency of I/O commands _will_ be different from the latency of
> admin commands, so the adaptive policy might come to different decisions
> for admin commands.
Yes as I said above, I'd address this in the next patchset. 

>> +static struct nvme_ns *nvme_adaptive_path(struct nvme_ns_head *head,
>> +        unsigned int rw)
>> +{
>> +    struct nvme_ns *ns, *found = NULL;
>> +    struct nvme_path_stat *stat;
>> +    u32 weight;
>> +    int refill = 0;
>> +
>> +    get_cpu();
>> +retry:
>> +    list_for_each_entry_srcu(ns, &head->list, siblings,
>> +            srcu_read_lock_held(&head->srcu)) {
>> +
>> +        if (nvme_path_is_disabled(ns) ||
>> +                !nvme_state_is_live(ns->ana_state))
>> +            continue;
>> +
>> +        stat = &this_cpu_ptr(ns->info)[rw].stat;
>> +
>> +        /*
>> +         * When the head path-list is singular we don't calculate the
>> +         * only path weight for optimization as we don't need to forward
>> +         * I/O to more than one path. The another possibility is whenthe
>> +         * path is newly added, we don't know its weight. So we go round
>> +         * -robin for each such path and forward I/O to it.Once we start
>> +         * getting response for such I/Os, the path weight calculation
>> +         * would kick in and then we start using path credit for
>> +         * forwarding I/O.
>> +         */
>> +        weight = READ_ONCE(stat->weight);
>> +        if (unlikely(!weight)) {
>> +            found = ns;
>> +            goto out;
>> +        }
>> +
>> +        /*
>> +         * To keep path selection logic simple, we don't distinguish
>> +         * between ANA optimized and non-optimized states. The non-
>> +         * optimized path is expected to have a lower weight, and
>> +         * therefore fewer credits. As a result, only a small number of
>> +         * I/Os will be forwarded to paths in the non-optimized state.
>> +         */
>> +        if (stat->credit > 0) {
>> +            --stat->credit;
>> +            found = ns;
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    if (!found && !list_empty(&head->list)) {> +        /*
>> +         * Refill credits and retry.
>> +         */
>> +        list_for_each_entry_srcu(ns, &head->list, siblings,
>> +                srcu_read_lock_held(&head->srcu)) {
>> +            if (nvme_path_is_disabled(ns) ||
>> +                    !nvme_state_is_live(ns->ana_state))
>> +                continue;
>> +
>> +            stat = &this_cpu_ptr(ns->info)[rw].stat;
>> +            weight = READ_ONCE(stat->weight);
>> +            stat->credit = weight;
>> +            refill = 1;
>> +        }
>> +        if (refill)
>> +            goto retry;
>> +    }
> 
> Hmm. Loop within a loop. Not pretty.
> Can't we make do with just one loop?
> After all, if we end up here the credits for this path are use up, and need to be refilled.
> What would happen if we were _just_ refill and retry?
> IE drop the second iteration and just execute the branch
> under the loop?
> 
Yes good pint. I think we should be able to address this
with one loop. I'd fix this in the next patchset. 

>> @@ -581,7 +951,7 @@ static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector,
>>       int srcu_idx, ret = -EWOULDBLOCK;
>>         srcu_idx = srcu_read_lock(&head->srcu);
>> -    ns = nvme_find_path(head);
>> +    ns = nvme_find_path(head, READ);
>>       if (ns)
>>           ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data);
>>       srcu_read_unlock(&head->srcu, srcu_idx);
> 
> See the discussion about the third option and not just 'READ'/'WRITE'.
Yeah, noted!

>> diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
>> index 29430949ce2f..4f9607e9698a 100644
>> --- a/drivers/nvme/host/sysfs.c
>> +++ b/drivers/nvme/host/sysfs.c
>> @@ -194,7 +194,7 @@ static int ns_head_update_nuse(struct nvme_ns_head *head)
>>           return 0;
>>         srcu_idx = srcu_read_lock(&head->srcu);
>> -    ns = nvme_find_path(head);
>> +    ns = nvme_find_path(head, READ);
>>       if (!ns)
>>           goto out_unlock;
>>  
> Other than that it really looks good.
Thank you:)

--Nilay