[PATCH 4/4] nvme: add support for mq_ops->queue_rqs()
Max Gurtovoy
mgurtovoy at nvidia.com
Mon Dec 20 10:48:01 PST 2021
On 12/20/2021 6:34 PM, Jens Axboe wrote:
> On 12/20/21 8:29 AM, Max Gurtovoy wrote:
>> On 12/20/2021 4:19 PM, Jens Axboe wrote:
>>> On 12/20/21 3:11 AM, Max Gurtovoy wrote:
>>>> On 12/19/2021 4:48 PM, Jens Axboe wrote:
>>>>> On 12/19/21 5:14 AM, Max Gurtovoy wrote:
>>>>>> On 12/16/2021 7:16 PM, Jens Axboe wrote:
>>>>>>> On 12/16/21 9:57 AM, Max Gurtovoy wrote:
>>>>>>>> On 12/16/2021 6:36 PM, Jens Axboe wrote:
>>>>>>>>> On 12/16/21 9:34 AM, Max Gurtovoy wrote:
>>>>>>>>>> On 12/16/2021 6:25 PM, Jens Axboe wrote:
>>>>>>>>>>> On 12/16/21 9:19 AM, Max Gurtovoy wrote:
>>>>>>>>>>>> On 12/16/2021 6:05 PM, Jens Axboe wrote:
>>>>>>>>>>>>> On 12/16/21 9:00 AM, Max Gurtovoy wrote:
>>>>>>>>>>>>>> On 12/16/2021 5:48 PM, Jens Axboe wrote:
>>>>>>>>>>>>>>> On 12/16/21 6:06 AM, Max Gurtovoy wrote:
>>>>>>>>>>>>>>>> On 12/16/2021 11:08 AM, Christoph Hellwig wrote:
>>>>>>>>>>>>>>>>> On Wed, Dec 15, 2021 at 09:24:21AM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>>>>> + spin_lock(&nvmeq->sq_lock);
>>>>>>>>>>>>>>>>>> + while (!rq_list_empty(*rqlist)) {
>>>>>>>>>>>>>>>>>> + struct request *req = rq_list_pop(rqlist);
>>>>>>>>>>>>>>>>>> + struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> + memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
>>>>>>>>>>>>>>>>>> + absolute_pointer(&iod->cmd), sizeof(iod->cmd));
>>>>>>>>>>>>>>>>>> + if (++nvmeq->sq_tail == nvmeq->q_depth)
>>>>>>>>>>>>>>>>>> + nvmeq->sq_tail = 0;
>>>>>>>>>>>>>>>>> So this doesn't even use the new helper added in patch 2? I think this
>>>>>>>>>>>>>>>>> should call nvme_sq_copy_cmd().
>>>>>>>>>>>>>>>> I also noticed that.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So need to decide if to open code it or use the helper function.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Inline helper sounds reasonable if you have 3 places that will use it.
>>>>>>>>>>>>>>> Yes agree, that's been my stance too :-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The rest looks identical to the incremental patch I posted, so I guess
>>>>>>>>>>>>>>>>> the performance degration measured on the first try was a measurement
>>>>>>>>>>>>>>>>> error?
>>>>>>>>>>>>>>>> giving 1 dbr for a batch of N commands sounds good idea. Also for RDMA host.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But how do you moderate it ? what is the batch_sz <--> time_to_wait
>>>>>>>>>>>>>>>> algorithm ?
>>>>>>>>>>>>>>> The batching is naturally limited at BLK_MAX_REQUEST_COUNT, which is 32
>>>>>>>>>>>>>>> in total. I do agree that if we ever made it much larger, then we might
>>>>>>>>>>>>>>> want to cap it differently. But 32 seems like a pretty reasonable number
>>>>>>>>>>>>>>> to get enough gain from the batching done in various areas, while still
>>>>>>>>>>>>>>> not making it so large that we have a potential latency issue. That
>>>>>>>>>>>>>>> batch count is already used consistently for other items too (like tag
>>>>>>>>>>>>>>> allocation), so it's not specific to just this one case.
>>>>>>>>>>>>>> I'm saying that the you can wait to the batch_max_count too long and it
>>>>>>>>>>>>>> won't be efficient from latency POV.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So it's better to limit the block layar to wait for the first to come: x
>>>>>>>>>>>>>> usecs or batch_max_count before issue queue_rqs.
>>>>>>>>>>>>> There's no waiting specifically for this, it's just based on the plug.
>>>>>>>>>>>>> We just won't do more than 32 in that plug. This is really just an
>>>>>>>>>>>>> artifact of the plugging, and if that should be limited based on "max of
>>>>>>>>>>>>> 32 or xx time", then that should be done there.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But in general I think it's saner and enough to just limit the total
>>>>>>>>>>>>> size. If we spend more than xx usec building up the plug list, we're
>>>>>>>>>>>>> doing something horribly wrong. That really should not happen with 32
>>>>>>>>>>>>> requests, and we'll never eg wait on requests if we're out of tags. That
>>>>>>>>>>>>> will result in a plug flush to begin with.
>>>>>>>>>>>> I'm not aware of the plug. I hope to get to it soon.
>>>>>>>>>>>>
>>>>>>>>>>>> My concern is if the user application submitted only 28 requests and
>>>>>>>>>>>> then you'll wait forever ? or for very long time.
>>>>>>>>>>>>
>>>>>>>>>>>> I guess not, but I'm asking how do you know how to batch and when to
>>>>>>>>>>>> stop in case 32 commands won't arrive anytime soon.
>>>>>>>>>>> The plug is in the stack of the task, so that condition can never
>>>>>>>>>>> happen. If the application originally asks for 32 but then only submits
>>>>>>>>>>> 28, then once that last one is submitted the plug is flushed and
>>>>>>>>>>> requests are issued.
>>>>>>>>>> So if I'm running fio with --iodepth=28 what will plug do ? send batches
>>>>>>>>>> of 28 ? or 1 by 1 ?
>>>>>>>>> --iodepth just controls the overall depth, the batch submit count
>>>>>>>>> dictates what happens further down. If you run queue depth 28 and submit
>>>>>>>>> one at the time, then you'll get one at the time further down too. Hence
>>>>>>>>> the batching is directly driven by what the application is already
>>>>>>>>> doing.
>>>>>>>> I see. Thanks for the explanation.
>>>>>>>>
>>>>>>>> So it works only for io_uring based applications ?
>>>>>>> It's only enabled for io_uring right now, but it's generically available
>>>>>>> for anyone that wants to use it... Would be trivial to do for aio, and
>>>>>>> other spots that currently use blk_start_plug() and has an idea of how
>>>>>>> many IOs will be submitted
>>>>>> Can you please share an example application (or is it fio patches) that
>>>>>> can submit batches ? The same that was used to test this patchset is
>>>>>> fine too.
>>>>>>
>>>>>> I would like to test it with our NVMe SNAP controllers and also to
>>>>>> develop NVMe/RDMA queue_rqs code and test the perf with it.
>>>>> You should just be able to use iodepth_batch with fio. For my peak
>>>>> testing, I use t/io_uring from the fio repo. By default, it'll run QD of
>>>>> and do batches of 32 for complete and submit. You can just run:
>>>>>
>>>>> t/io_uring <dev or file>
>>>>>
>>>>> maybe adding -p0 for IRQ driven rather than polled IO.
>>>> I used your block/for-next branch and implemented queue_rqs in NVMe/RDMA
>>>> but it was never called using the t/io_uring test nor fio with
>>>> iodepth_batch=32 flag with io_uring engine.
>>>>
>>>> Any idea what might be the issue ?
>>>>
>>>> I installed fio from sources..
>>> The two main restrictions right now are a scheduler and shared tags, are
>>> you using any of those?
>> No.
>>
>> But maybe I'm missing the .commit_rqs callback. is it mandatory for this
>> feature ?
> I've only tested with nvme pci which does have it, but I don't think so.
> Unless there's some check somewhere that makes it necessary. Can you
> share the patch you're currently using on top?
The attached POC patches apply cleanly on block/for-next branch
commit 7925bb75e8effa5de85b1cf8425cd5c21f212b1d (block/for-next)
Merge: eb12bde9eba8 3427f2b2c533
Author: Jens Axboe <axboe at kernel.dk>
Date: Fri Dec 17 09:51:05 2021 -0700
Merge branch 'for-5.17/drivers' into for-next
* for-5.17/drivers:
block: remove the rsxx driver
rsxx: Drop PCI legacy power management
mtip32xx: convert to generic power management
mtip32xx: remove pointless drvdata lookups
mtip32xx: remove pointless drvdata checking
drbd: Use struct_group() to zero algs
loop: make autoclear operation asynchronous
null_blk: cast command status to integer
pktdvd: stop using bdi congestion framework.
-------------- next part --------------
From 0de2836ce21df6801db580e154296544a741b6c4 Mon Sep 17 00:00:00 2001
From: Max Gurtovoy <mgurtovoy at nvidia.com>
Date: Thu, 2 Dec 2021 19:59:00 +0200
Subject: [PATCH 1/2] nvme-rdma: prepare for queue_rqs implementation
Signed-off-by: Max Gurtovoy <mgurtovoy at nvidia.com>
---
drivers/nvme/host/rdma.c | 127 ++++++++++++++++++++++++++-------------
1 file changed, 84 insertions(+), 43 deletions(-)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 850f84d204d0..2d608cb48392 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -70,6 +70,7 @@ struct nvme_rdma_request {
struct ib_sge sge[1 + NVME_RDMA_MAX_INLINE_SEGMENTS];
u32 num_sge;
struct ib_reg_wr reg_wr;
+ struct ib_send_wr send_wr;
struct ib_cqe reg_cqe;
struct nvme_rdma_queue *queue;
struct nvme_rdma_sgl data_sgl;
@@ -1635,33 +1636,31 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
nvme_rdma_end_request(req);
}
-static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
+static void nvme_rdma_prep_send(struct nvme_rdma_queue *queue,
struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
- struct ib_send_wr *first)
+ struct ib_send_wr *wr)
{
- struct ib_send_wr wr;
- int ret;
-
- sge->addr = qe->dma;
+ sge->addr = qe->dma;
sge->length = sizeof(struct nvme_command);
- sge->lkey = queue->device->pd->local_dma_lkey;
+ sge->lkey = queue->device->pd->local_dma_lkey;
- wr.next = NULL;
- wr.wr_cqe = &qe->cqe;
- wr.sg_list = sge;
- wr.num_sge = num_sge;
- wr.opcode = IB_WR_SEND;
- wr.send_flags = IB_SEND_SIGNALED;
+ wr->next = NULL;
+ wr->wr_cqe = &qe->cqe;
+ wr->sg_list = sge;
+ wr->num_sge = num_sge;
+ wr->opcode = IB_WR_SEND;
+ wr->send_flags = IB_SEND_SIGNALED;
+}
- if (first)
- first->next = ≀
- else
- first = ≀
+static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
+ struct ib_send_wr *wr)
+{
+ int ret;
- ret = ib_post_send(queue->qp, first, NULL);
+ ret = ib_post_send(queue->qp, wr, NULL);
if (unlikely(ret)) {
dev_err(queue->ctrl->ctrl.device,
- "%s failed with error code %d\n", __func__, ret);
+ "%s failed with error code %d\n", __func__, ret);
}
return ret;
}
@@ -1715,6 +1714,7 @@ static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg)
struct nvme_rdma_qe *sqe = &ctrl->async_event_sqe;
struct nvme_command *cmd = sqe->data;
struct ib_sge sge;
+ struct ib_send_wr wr;
int ret;
ib_dma_sync_single_for_cpu(dev, sqe->dma, sizeof(*cmd), DMA_TO_DEVICE);
@@ -1730,7 +1730,8 @@ static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg)
ib_dma_sync_single_for_device(dev, sqe->dma, sizeof(*cmd),
DMA_TO_DEVICE);
- ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL);
+ nvme_rdma_prep_send(queue, sqe, &sge, 1, &wr);
+ ret = nvme_rdma_post_send(queue, &wr);
WARN_ON_ONCE(ret);
}
@@ -2034,27 +2035,35 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
return BLK_EH_RESET_TIMER;
}
-static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
- const struct blk_mq_queue_data *bd)
+static blk_status_t nvme_rdma_cleanup_rq(struct nvme_rdma_queue *queue,
+ struct request *rq, int err)
{
- struct nvme_ns *ns = hctx->queue->queuedata;
- struct nvme_rdma_queue *queue = hctx->driver_data;
- struct request *rq = bd->rq;
+ struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
+ blk_status_t ret;
+
+ nvme_rdma_unmap_data(queue, rq);
+ if (err == -EIO)
+ ret = nvme_host_path_error(rq);
+ else if (err == -ENOMEM || err == -EAGAIN)
+ ret = BLK_STS_RESOURCE;
+ else
+ ret = BLK_STS_IOERR;
+ nvme_cleanup_cmd(rq);
+ ib_dma_unmap_single(queue->device->dev, req->sqe.dma,
+ sizeof(struct nvme_command), DMA_TO_DEVICE);
+ return ret;
+}
+
+static blk_status_t nvme_rdma_prep_rq(struct nvme_rdma_queue *queue,
+ struct request *rq, struct nvme_ns *ns)
+{
+ struct ib_device *dev = queue->device->dev;
struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
struct nvme_rdma_qe *sqe = &req->sqe;
struct nvme_command *c = nvme_req(rq)->cmd;
- struct ib_device *dev;
- bool queue_ready = test_bit(NVME_RDMA_Q_LIVE, &queue->flags);
blk_status_t ret;
int err;
- WARN_ON_ONCE(rq->tag < 0);
-
- if (!nvme_check_ready(&queue->ctrl->ctrl, rq, queue_ready))
- return nvme_fail_nonready_command(&queue->ctrl->ctrl, rq);
-
- dev = queue->device->dev;
-
req->sqe.dma = ib_dma_map_single(dev, req->sqe.data,
sizeof(struct nvme_command),
DMA_TO_DEVICE);
@@ -2083,8 +2092,8 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
err = nvme_rdma_map_data(queue, rq, c);
if (unlikely(err < 0)) {
dev_err(queue->ctrl->ctrl.device,
- "Failed to map data (%d)\n", err);
- goto err;
+ "Failed to map data (%d)\n", err);
+ goto out_err;
}
sqe->cqe.done = nvme_rdma_send_done;
@@ -2092,16 +2101,13 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
ib_dma_sync_single_for_device(dev, sqe->dma,
sizeof(struct nvme_command), DMA_TO_DEVICE);
- err = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge,
- req->mr ? &req->reg_wr.wr : NULL);
- if (unlikely(err))
- goto err_unmap;
+ nvme_rdma_prep_send(queue, sqe, req->sge, req->num_sge, &req->send_wr);
+ if (req->mr)
+ req->reg_wr.wr.next = &req->send_wr;
return BLK_STS_OK;
-err_unmap:
- nvme_rdma_unmap_data(queue, rq);
-err:
+out_err:
if (err == -EIO)
ret = nvme_host_path_error(rq);
else if (err == -ENOMEM || err == -EAGAIN)
@@ -2115,6 +2121,41 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
return ret;
}
+static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
+ const struct blk_mq_queue_data *bd)
+{
+ struct nvme_ns *ns = hctx->queue->queuedata;
+ struct nvme_rdma_queue *queue = hctx->driver_data;
+ struct request *rq = bd->rq;
+ struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
+ bool queue_ready = test_bit(NVME_RDMA_Q_LIVE, &queue->flags);
+ struct ib_send_wr *wr;
+ int err;
+
+ WARN_ON_ONCE(rq->tag < 0);
+
+ if (!nvme_check_ready(&queue->ctrl->ctrl, rq, queue_ready))
+ return nvme_fail_nonready_command(&queue->ctrl->ctrl, rq);
+
+ err = nvme_rdma_prep_rq(queue, rq, ns);
+ if (unlikely(err))
+ return err;
+
+ if (req->mr)
+ wr = &req->reg_wr.wr;
+ else
+ wr = &req->send_wr;
+
+ err = nvme_rdma_post_send(queue, wr);
+ if (unlikely(err))
+ goto out_cleanup_rq;
+
+ return BLK_STS_OK;
+
+out_cleanup_rq:
+ return nvme_rdma_cleanup_rq(queue, rq, err);
+}
+
static int nvme_rdma_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob)
{
struct nvme_rdma_queue *queue = hctx->driver_data;
--
2.18.1
-------------- next part --------------
From 851a1f35420206f7b631d5d12b135e5a7c84b912 Mon Sep 17 00:00:00 2001
From: Max Gurtovoy <mgurtovoy at nvidia.com>
Date: Mon, 20 Dec 2021 20:42:49 +0200
Subject: [PATCH 2/2] nvme-rdma: add support for mq_ops->queue_rqs()
Signed-off-by: Max Gurtovoy <mgurtovoy at nvidia.com>
---
drivers/nvme/host/rdma.c | 75 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 75 insertions(+)
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 2d608cb48392..765bb57f0a55 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -2121,6 +2121,80 @@ static blk_status_t nvme_rdma_prep_rq(struct nvme_rdma_queue *queue,
return ret;
}
+static bool nvme_rdma_prep_rq_batch(struct nvme_rdma_queue *queue,
+ struct request *rq)
+{
+ bool queue_ready = test_bit(NVME_RDMA_Q_LIVE, &queue->flags);
+
+ if (unlikely(!nvme_check_ready(&queue->ctrl->ctrl, rq, queue_ready)))
+ return false;
+
+ rq->mq_hctx->tags->rqs[rq->tag] = rq;
+ return nvme_rdma_prep_rq(queue, rq, rq->q->queuedata) == BLK_STS_OK;
+}
+
+static void nvme_rdma_submit_cmds(struct nvme_rdma_queue *queue,
+ struct request **rqlist)
+{
+ struct request *first_rq = rq_list_peek(rqlist);
+ struct nvme_rdma_request *nreq = blk_mq_rq_to_pdu(first_rq);
+ struct ib_send_wr *first, *last = NULL;
+ int ret;
+
+ if (nreq->mr)
+ first = &nreq->reg_wr.wr;
+ else
+ first = &nreq->send_wr;
+
+ while (!rq_list_empty(*rqlist)) {
+ struct request *rq = rq_list_pop(rqlist);
+ struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
+ struct ib_send_wr *tmp;
+
+ tmp = last;
+ last = &req->send_wr;
+ if (tmp) {
+ if (req->mr)
+ tmp->next = &req->reg_wr.wr;
+ else
+ tmp->next = &req->send_wr;
+ }
+ }
+
+ ret = nvme_rdma_post_send(queue, first);
+ WARN_ON_ONCE(ret);
+}
+
+static void nvme_rdma_queue_rqs(struct request **rqlist)
+{
+ struct request *req = rq_list_peek(rqlist), *prev = NULL;
+ struct request *requeue_list = NULL;
+
+ do {
+ struct nvme_rdma_queue *queue = req->mq_hctx->driver_data;
+
+ if (!nvme_rdma_prep_rq_batch(queue, req)) {
+ /* detach 'req' and add to remainder list */
+ if (prev)
+ prev->rq_next = req->rq_next;
+ rq_list_add(&requeue_list, req);
+ } else {
+ prev = req;
+ }
+
+ req = rq_list_next(req);
+ if (!req || (prev && req->mq_hctx != prev->mq_hctx)) {
+ /* detach rest of list, and submit */
+ if (prev)
+ prev->rq_next = NULL;
+ nvme_rdma_submit_cmds(queue, rqlist);
+ *rqlist = req;
+ }
+ } while (req);
+
+ *rqlist = requeue_list;
+}
+
static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
{
@@ -2258,6 +2332,7 @@ static int nvme_rdma_map_queues(struct blk_mq_tag_set *set)
static const struct blk_mq_ops nvme_rdma_mq_ops = {
.queue_rq = nvme_rdma_queue_rq,
+ .queue_rqs = nvme_rdma_queue_rqs,
.complete = nvme_rdma_complete_rq,
.init_request = nvme_rdma_init_request,
.exit_request = nvme_rdma_exit_request,
--
2.18.1
More information about the Linux-nvme
mailing list