[PATCH for-next 4/4] nvme-multipath: add multipathing for uring-passthrough commands

Mon Jul 11 21:23:32 PDT 2022

On Mon, Jul 11, 2022 at 10:56:56PM +0300, Sagi Grimberg wrote:
>
>>>>@@ -448,6 +442,14 @@ static int nvme_uring_cmd_io(struct 
>>>>nvme_ctrl *ctrl, struct nvme_ns *ns,
>>>>     pdu->meta_buffer = nvme_to_user_ptr(d.metadata);
>>>>     pdu->meta_len = d.metadata_len;
>>>>+    if (issue_flags & IO_URING_F_MPATH) {
>>>>+        req->cmd_flags |= REQ_NVME_MPATH;
>>>>+        /*
>>>>+         * we would need the buffer address (d.addr field) if we have
>>>>+         * to retry the command. Store it by repurposing ioucmd->cmd
>>>>+         */
>>>>+        ioucmd->cmd = (void *)d.addr;
>>>
>>>What does repurposing mean here?
>>
>>This field (ioucmd->cmd) was pointing to passthrough command (which
>>is embedded in SQE of io_uring). At this point we have consumed
>>passthrough command, so this field can be reused if we have to. And we
>>have to beceause failover needs recreating passthrough command.
>>Please see nvme_uring_cmd_io_retry to see how this helps in recreating
>>the fields of passthrough command. And more on this below.
>
>so ioucmd->cmd starts as an nvme_uring_cmd pointer and continues as
>an address of buffer for a later processing that may or may not
>happen in its lifetime?

Yes. See this as a no-cost way to handle fast/common path. We manage by
doing just this assignment.
Otherwise this would have involved doing high-cost (allocate/copy/deferral)
stuff for later processing that may or may not happen.

>Sounds a bit of a backwards design to me.
>
>>>>+    }
>>>>     blk_execute_rq_nowait(req, false);
>>>>     return -EIOCBQUEUED;
>>>> }
>>>>@@ -665,12 +667,135 @@ int nvme_ns_head_chr_uring_cmd(struct 
>>>>io_uring_cmd *ioucmd,
>>>>     int srcu_idx = srcu_read_lock(&head->srcu);
>>>>     struct nvme_ns *ns = nvme_find_path(head);
>>>>     int ret = -EINVAL;
>>>>+    struct device *dev = &head->cdev_device;
>>>>+
>>>>+    if (likely(ns)) {
>>>>+        ret = nvme_ns_uring_cmd(ns, ioucmd,
>>>>+                issue_flags | IO_URING_F_MPATH);
>>>>+    } else if (nvme_available_path(head)) {
>>>>+        struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>>>>+        struct nvme_uring_cmd *payload = NULL;
>>>>+
>>>>+        dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
>>>>+        /*
>>>>+         * We may requeue two kinds of uring commands:
>>>>+         * 1. command failed post submission. pdu->req will be non-null
>>>>+         * for this
>>>>+         * 2. command that could not be submitted because path was not
>>>>+         * available. For this pdu->req is set to NULL.
>>>>+         */
>>>>+        pdu->req = NULL;
>>>
>>>Relying on a pointer does not sound like a good idea to me.
>>>But why do you care at all where did this command came from?
>>>This code should not concern itself what happened prior to this
>>>execution.
>>Required, please see nvme_uring_cmd_io_retry. And this should be more
>>clear as part of responses to your other questions.
>
>I think I understand. But it looks fragile to me.
>
>>
>>>>+        /*
>>>>+         * passthrough command will not be available during retry as it
>>>>+         * is embedded in io_uring's SQE. Hence we allocate/copy here
>>>>+         */
>>>
>>>OK, that is a nice solution.
>>Please note that prefered way is to recreate the passthrough command,
>>and not to allocate it. We allocate it here because this happens very early
>>(i.e. before processing passthrough command and setting that up inside
>>struct request). Recreating requires a populated 'struct request' .
>
>I think that once a driver consumes a command as queued, it needs a
>stable copy of the command (could be in a percpu pool).
>
>The current design of nvme multipathing is that the request is stripped
>from its bios and placed on a requeue_list, and if a request was not
>formed yet (i.e. nvme available path exist but no usable) the bio is
>placed on the requeue_list.
>
>So ideally we have something sufficient like a bio, that can be linked
>on a requeue list, and is sufficient to build a request, at any point in
>time...

we could be dealing with any passthrough command here. bio is not
guranteed to exist in first place. It can very well be NULL.
As I mentioned in cover letter, this was tested for such passthrough
commands too.
And bio is not a good fit for this interface. For block-path, entry
function is nvme_ns_head_submit_bio() which speaks bio. Request is
allocated afterwards. But since request formation can happen only after
discovering a valid path, it may have to queue the bio if path does not
exist.
For passthrough, interface speaks/understands struct io_uring_cmd.
Request is allocated after. And bio may (or may not) be attached with
this request. It may have to queue the command even before request/bio
gets formed. So cardinal structure (which is
really available all the time) in this case is struct io_uring_cmd.
Hence we use that as the object around which requeuing/failover is
built.
Conceptually io_uring_cmd is the bio of this interface.

>>>>+        payload = kmalloc(sizeof(struct nvme_uring_cmd), GFP_KERNEL);
>>>>+        if (!payload) {
>>>>+            ret = -ENOMEM;
>>>>+            goto out_unlock;
>>>>+        }
>>>>+        memcpy(payload, ioucmd->cmd, sizeof(struct nvme_uring_cmd));
>>>>+        ioucmd->cmd = payload;
>>>>-    if (ns)
>>>>-        ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
>>>>+        spin_lock_irq(&head->requeue_ioucmd_lock);
>>>>+        ioucmd_list_add(&head->requeue_ioucmd_list, ioucmd);
>>>>+        spin_unlock_irq(&head->requeue_ioucmd_lock);
>>>>+        ret = -EIOCBQUEUED;
>>>>+    } else {
>>>>+        dev_warn_ratelimited(dev, "no available path - failing I/O\n");
>>>
>>>ret=-EIO ?
>>Did not do as it was initialized to -EINVAL. Do you prefer -EIO instead.
>
>It is not an invalid argument error here, it is essentially an IO error.
>So yes, that would be my preference.

sure, will change.

>>>>+    }
>>>>+out_unlock:
>>>>     srcu_read_unlock(&head->srcu, srcu_idx);
>>>>     return ret;
>>>> }
>>>>+
>>>>+int nvme_uring_cmd_io_retry(struct nvme_ns *ns, struct request *oreq,
>>>>+        struct io_uring_cmd *ioucmd, struct nvme_uring_cmd_pdu *pdu)
>>>>+{
>>>>+    struct nvme_ctrl *ctrl = ns->ctrl;
>>>>+    struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
>>>>+    struct nvme_uring_data d;
>>>>+    struct nvme_command c;
>>>>+    struct request *req;
>>>>+    struct bio *obio = oreq->bio;
>>>>+    void *meta = NULL;
>>>>+
>>>>+    memcpy(&c, nvme_req(oreq)->cmd, sizeof(struct nvme_command));
>>>>+    d.metadata = (__u64)pdu->meta_buffer;
>>>>+    d.metadata_len = pdu->meta_len;
>>>>+    d.timeout_ms = oreq->timeout;
>>>>+    d.addr = (__u64)ioucmd->cmd;
>>>>+    if (obio) {
>>>>+        d.data_len = obio->bi_iter.bi_size;
>>>>+        blk_rq_unmap_user(obio);
>>>>+    } else {
>>>>+        d.data_len = 0;
>>>>+    }
>>>>+    blk_mq_free_request(oreq);
>>>
>>>The way I would do this that in nvme_ioucmd_failover_req (or in the
>>>retry driven from command retriable failure) I would do the above,
>>>requeue it and kick the requeue work, to go over the requeue_list and
>>>just execute them again. Not sure why you even need an explicit retry
>>>code.
>>During retry we need passthrough command. But passthrough command is not
>>stable (i.e. valid only during first submission). We can make it stable
>>either by:
>>(a) allocating in nvme (b) return -EAGAIN to io_uring, and it will 
>>do allocate + deferral
>>Both add a cost. And since any command can potentially fail, that
>>means taking that cost for every IO that we issue on mpath node. Even if
>>no failure (initial or subsquent after IO) occcured.
>
>As mentioned, I think that if a driver consumes a command as queued,
>it needs a stable copy for a later reformation of the request for
>failover purposes.

So what do you propose to make that stable?
As I mentioned earlier, stable copy requires allocating/copying in fast
path. And for a condition (failover) that may not even occur.
I really think currrent solution is much better as it does not try to make
it stable. Rather it assembles pieces of passthrough command if retry
(which is rare) happens.

>>So to avoid commmon-path cost, we go about doing nothing (no allocation,
>>no deferral) in the outset and choose to recreate the passthrough
>>command if failure occured. Hope this explains the purpose of
>>nvme_uring_cmd_io_retry?
>
>I think it does, but I'm still having a hard time with it...
>
Maybe I am reiterating but few main differences that should help -

- io_uring_cmd is at the forefront, and bio/request as secondary
objects. Former is persistent object across requeue attempts and the
only thing available when we discover the path, while other two are
created every time we retry.

- Receiving bio from upper layer is a luxury that we do not have for
  passthrough. When we receive bio, pages are already mapped and we
  do not have to deal with user-specific fields, so there is more
  freedom in using arbitrary context (workers etc.). But passthrough
  command continues to point to user-space fields/buffers, so we need
  that task context.

>>
>>>>+    req = nvme_alloc_user_request(q, &c, nvme_to_user_ptr(d.addr),
>>>>+            d.data_len, nvme_to_user_ptr(d.metadata),
>>>>+            d.metadata_len, 0, &meta, d.timeout_ms ?
>>>>+            msecs_to_jiffies(d.timeout_ms) : 0,
>>>>+            ioucmd->cmd_op == NVME_URING_CMD_IO_VEC, 0, 0);
>>>>+    if (IS_ERR(req))
>>>>+        return PTR_ERR(req);
>>>>+
>>>>+    req->end_io = nvme_uring_cmd_end_io;
>>>>+    req->end_io_data = ioucmd;
>>>>+    pdu->bio = req->bio;
>>>>+    pdu->meta = meta;
>>>>+    req->cmd_flags |= REQ_NVME_MPATH;
>>>>+    blk_execute_rq_nowait(req, false);
>>>>+    return -EIOCBQUEUED;
>>>>+}
>>>>+
>>>>+void nvme_ioucmd_mpath_retry(struct io_uring_cmd *ioucmd)
>>>>+{
>>>>+    struct cdev *cdev = file_inode(ioucmd->file)->i_cdev;
>>>>+    struct nvme_ns_head *head = container_of(cdev, struct nvme_ns_head,
>>>>+            cdev);
>>>>+    int srcu_idx = srcu_read_lock(&head->srcu);
>>>>+    struct nvme_ns *ns = nvme_find_path(head);
>>>>+    unsigned int issue_flags = IO_URING_F_SQE128 | IO_URING_F_CQE32 |
>>>>+        IO_URING_F_MPATH;
>>>>+    struct device *dev = &head->cdev_device;
>>>>+
>>>>+    if (likely(ns)) {
>>>>+        struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
>>>>+        struct request *oreq = pdu->req;
>>>>+        int ret;
>>>>+
>>>>+        if (oreq == NULL) {
>>>>+            /*
>>>>+             * this was not submitted (to device) earlier. For this
>>>>+             * ioucmd->cmd points to persistent memory. Free that
>>>>+             * up post submission
>>>>+             */
>>>>+            const void *cmd = ioucmd->cmd;
>>>>+
>>>>+            ret = nvme_ns_uring_cmd(ns, ioucmd, issue_flags);
>>>>+            kfree(cmd);
>>>>+        } else {
>>>>+            /*
>>>>+             * this was submitted (to device) earlier. Use old
>>>>+             * request, bio (if it exists) and nvme-pdu to recreate
>>>>+             * the command for the discovered path
>>>>+             */
>>>>+            ret = nvme_uring_cmd_io_retry(ns, oreq, ioucmd, pdu);
>>>
>>>Why is this needed? Why is reuse important here? Why not always call
>>>nvme_ns_uring_cmd?
>>
>>Please see the previous explanation.
>>If condition is for the case when we made the passthrough command stable
>>by allocating beforehand.
>>Else is for the case when we avoided taking that cost.
>
>The current design of the multipath failover code is clean:
>1. extract bio(s) from request
>2. link in requeue_list
>3. schedule requeue_work that,
> 3.1 takes bios 1-by-1, and submits them again (exactly the same way)

It is really clean, and fits really well with bio based entry interface.
But as I said earlier, it does not go well with uring-cmd based entry
interface, and bunch of of other things which needs to be done
differently for generic passthrough command.

>I'd like us to try to follow the same design where retry is
>literally "do the exact same thing, again". That would eliminate
>two submission paths that do the same thing, but slightly different.

Exact same thing is possible if we make the common path slow i.e.
allocate/copy passthrough command and keep it alive until completion.
But that is really not the way to go I suppose.