[PATCH 4/7] nvme: implement multipath access to nvme subsystems
Mike Snitzer
snitzer at redhat.com
Thu Nov 9 13:22:17 PST 2017
On Thu, Nov 09 2017 at 12:44pm -0500,
Christoph Hellwig <hch at lst.de> wrote:
> This patch adds native multipath support to the nvme driver. For each
> namespace we create only single block device node, which can be used
> to access that namespace through any of the controllers that refer to it.
> The gendisk for each controllers path to the name space still exists
> inside the kernel, but is hidden from userspace. The character device
> nodes are still available on a per-controller basis. A new link from
> the sysfs directory for the subsystem allows to find all controllers
> for a given subsystem.
>
> Currently we will always send I/O to the first available path, this will
> be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
> ratified and implemented, at which point we will look at the ANA state
> for each namespace. Another possibility that was prototyped is to
> use the path that is closes to the submitting NUMA code, which will be
> mostly interesting for PCI, but might also be useful for RDMA or FC
> transports in the future. There is not plan to implement round robin
> or I/O service time path selectors, as those are not scalable with
> the performance rates provided by NVMe.
>
> The multipath device will go away once all paths to it disappear,
> any delay to keep it alive needs to be implemented at the controller
> level.
>
> Signed-off-by: Christoph Hellwig <hch at lst.de>
Your 0th header speaks to the NVMe multipath IO path leveraging NVMe's
lack of partial completion but I think it'd be useful to have this
header (that actually gets committed) speak to it.
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> new file mode 100644
> index 000000000000..062754ebebfd
> --- /dev/null
> +++ b/drivers/nvme/host/multipath.c
...
> +void nvme_failover_req(struct request *req)
> +{
> + struct nvme_ns *ns = req->q->queuedata;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&ns->head->requeue_lock, flags);
> + blk_steal_bios(&ns->head->requeue_list, req);
> + spin_unlock_irqrestore(&ns->head->requeue_lock, flags);
> + blk_mq_end_request(req, 0);
> +
> + nvme_reset_ctrl(ns->ctrl);
> + kblockd_schedule_work(&ns->head->requeue_work);
> +}
Also, the block core patch to introduce blk_steal_bios() already went in
but should there be a QUEUE_FLAG that gets set by drivers like NVMe that
don't support partial completion?
This would make it easier for other future drivers to know whether they
can use a more optimized IO path.
Mike
More information about the Linux-nvme
mailing list