[PATCH 4/4] nvme: add support for mq_ops->queue_rqs()

Mon Dec 20 10:48:01 PST 2021

On 12/20/2021 6:34 PM, Jens Axboe wrote:
> On 12/20/21 8:29 AM, Max Gurtovoy wrote:
>> On 12/20/2021 4:19 PM, Jens Axboe wrote:
>>> On 12/20/21 3:11 AM, Max Gurtovoy wrote:
>>>> On 12/19/2021 4:48 PM, Jens Axboe wrote:
>>>>> On 12/19/21 5:14 AM, Max Gurtovoy wrote:
>>>>>> On 12/16/2021 7:16 PM, Jens Axboe wrote:
>>>>>>> On 12/16/21 9:57 AM, Max Gurtovoy wrote:
>>>>>>>> On 12/16/2021 6:36 PM, Jens Axboe wrote:
>>>>>>>>> On 12/16/21 9:34 AM, Max Gurtovoy wrote:
>>>>>>>>>> On 12/16/2021 6:25 PM, Jens Axboe wrote:
>>>>>>>>>>> On 12/16/21 9:19 AM, Max Gurtovoy wrote:
>>>>>>>>>>>> On 12/16/2021 6:05 PM, Jens Axboe wrote:
>>>>>>>>>>>>> On 12/16/21 9:00 AM, Max Gurtovoy wrote:
>>>>>>>>>>>>>> On 12/16/2021 5:48 PM, Jens Axboe wrote:
>>>>>>>>>>>>>>> On 12/16/21 6:06 AM, Max Gurtovoy wrote:
>>>>>>>>>>>>>>>> On 12/16/2021 11:08 AM, Christoph Hellwig wrote:
>>>>>>>>>>>>>>>>> On Wed, Dec 15, 2021 at 09:24:21AM -0700, Jens Axboe wrote:
>>>>>>>>>>>>>>>>>> +	spin_lock(&nvmeq->sq_lock);
>>>>>>>>>>>>>>>>>> +	while (!rq_list_empty(*rqlist)) {
>>>>>>>>>>>>>>>>>> +		struct request *req = rq_list_pop(rqlist);
>>>>>>>>>>>>>>>>>> +		struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> +		memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
>>>>>>>>>>>>>>>>>> +				absolute_pointer(&iod->cmd), sizeof(iod->cmd));
>>>>>>>>>>>>>>>>>> +		if (++nvmeq->sq_tail == nvmeq->q_depth)
>>>>>>>>>>>>>>>>>> +			nvmeq->sq_tail = 0;
>>>>>>>>>>>>>>>>> So this doesn't even use the new helper added in patch 2?  I think this
>>>>>>>>>>>>>>>>> should call nvme_sq_copy_cmd().
>>>>>>>>>>>>>>>> I also noticed that.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So need to decide if to open code it or use the helper function.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Inline helper sounds reasonable if you have 3 places that will use it.
>>>>>>>>>>>>>>> Yes agree, that's been my stance too :-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The rest looks identical to the incremental patch I posted, so I guess
>>>>>>>>>>>>>>>>> the performance degration measured on the first try was a measurement
>>>>>>>>>>>>>>>>> error?
>>>>>>>>>>>>>>>> giving 1 dbr for a batch of N commands sounds good idea. Also for RDMA host.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But how do you moderate it ? what is the batch_sz <--> time_to_wait
>>>>>>>>>>>>>>>> algorithm ?
>>>>>>>>>>>>>>> The batching is naturally limited at BLK_MAX_REQUEST_COUNT, which is 32
>>>>>>>>>>>>>>> in total. I do agree that if we ever made it much larger, then we might
>>>>>>>>>>>>>>> want to cap it differently. But 32 seems like a pretty reasonable number
>>>>>>>>>>>>>>> to get enough gain from the batching done in various areas, while still
>>>>>>>>>>>>>>> not making it so large that we have a potential latency issue. That
>>>>>>>>>>>>>>> batch count is already used consistently for other items too (like tag
>>>>>>>>>>>>>>> allocation), so it's not specific to just this one case.
>>>>>>>>>>>>>> I'm saying that the you can wait to the batch_max_count too long and it
>>>>>>>>>>>>>> won't be efficient from latency POV.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So it's better to limit the block layar to wait for the first to come: x
>>>>>>>>>>>>>> usecs or batch_max_count before issue queue_rqs.
>>>>>>>>>>>>> There's no waiting specifically for this, it's just based on the plug.
>>>>>>>>>>>>> We just won't do more than 32 in that plug. This is really just an
>>>>>>>>>>>>> artifact of the plugging, and if that should be limited based on "max of
>>>>>>>>>>>>> 32 or xx time", then that should be done there.
>>>>>>>>>>>>>
>>>>>>>>>>>>> But in general I think it's saner and enough to just limit the total
>>>>>>>>>>>>> size. If we spend more than xx usec building up the plug list, we're
>>>>>>>>>>>>> doing something horribly wrong. That really should not happen with 32
>>>>>>>>>>>>> requests, and we'll never eg wait on requests if we're out of tags. That
>>>>>>>>>>>>> will result in a plug flush to begin with.
>>>>>>>>>>>> I'm not aware of the plug. I hope to get to it soon.
>>>>>>>>>>>>
>>>>>>>>>>>> My concern is if the user application submitted only 28 requests and
>>>>>>>>>>>> then you'll wait forever ? or for very long time.
>>>>>>>>>>>>
>>>>>>>>>>>> I guess not, but I'm asking how do you know how to batch and when to
>>>>>>>>>>>> stop in case 32 commands won't arrive anytime soon.
>>>>>>>>>>> The plug is in the stack of the task, so that condition can never
>>>>>>>>>>> happen. If the application originally asks for 32 but then only submits
>>>>>>>>>>> 28, then once that last one is submitted the plug is flushed and
>>>>>>>>>>> requests are issued.
>>>>>>>>>> So if I'm running fio with --iodepth=28 what will plug do ? send batches
>>>>>>>>>> of 28 ? or 1 by 1 ?
>>>>>>>>> --iodepth just controls the overall depth, the batch submit count
>>>>>>>>> dictates what happens further down. If you run queue depth 28 and submit
>>>>>>>>> one at the time, then you'll get one at the time further down too. Hence
>>>>>>>>> the batching is directly driven by what the application is already
>>>>>>>>> doing.
>>>>>>>> I see. Thanks for the explanation.
>>>>>>>>
>>>>>>>> So it works only for io_uring based applications ?
>>>>>>> It's only enabled for io_uring right now, but it's generically available
>>>>>>> for anyone that wants to use it... Would be trivial to do for aio, and
>>>>>>> other spots that currently use blk_start_plug() and has an idea of how
>>>>>>> many IOs will be submitted
>>>>>> Can you please share an example application (or is it fio patches) that
>>>>>> can submit batches ? The same that was used to test this patchset is
>>>>>> fine too.
>>>>>>
>>>>>> I would like to test it with our NVMe SNAP controllers and also to
>>>>>> develop NVMe/RDMA queue_rqs code and test the perf with it.
>>>>> You should just be able to use iodepth_batch with fio. For my peak
>>>>> testing, I use t/io_uring from the fio repo. By default, it'll run QD of
>>>>> and do batches of 32 for complete and submit. You can just run:
>>>>>
>>>>> t/io_uring <dev or file>
>>>>>
>>>>> maybe adding -p0 for IRQ driven rather than polled IO.
>>>> I used your block/for-next branch and implemented queue_rqs in NVMe/RDMA
>>>> but it was never called using the t/io_uring test nor fio with
>>>> iodepth_batch=32 flag with io_uring engine.
>>>>
>>>> Any idea what might be the issue ?
>>>>
>>>> I installed fio from sources..
>>> The two main restrictions right now are a scheduler and shared tags, are
>>> you using any of those?
>> No.
>>
>> But maybe I'm missing the .commit_rqs callback. is it mandatory for this
>> feature ?
> I've only tested with nvme pci which does have it, but I don't think so.
> Unless there's some check somewhere that makes it necessary. Can you
> share the patch you're currently using on top?

The attached POC patches apply cleanly on block/for-next branch

commit 7925bb75e8effa5de85b1cf8425cd5c21f212b1d (block/for-next)
Merge: eb12bde9eba8 3427f2b2c533
Author: Jens Axboe <axboe at kernel.dk>
Date:   Fri Dec 17 09:51:05 2021 -0700

     Merge branch 'for-5.17/drivers' into for-next

     * for-5.17/drivers:
       block: remove the rsxx driver
       rsxx: Drop PCI legacy power management
       mtip32xx: convert to generic power management
       mtip32xx: remove pointless drvdata lookups
       mtip32xx: remove pointless drvdata checking
       drbd: Use struct_group() to zero algs
       loop: make autoclear operation asynchronous
       null_blk: cast command status to integer
       pktdvd: stop using bdi congestion framework.

-------------- next part --------------
From 0de2836ce21df6801db580e154296544a741b6c4 Mon Sep 17 00:00:00 2001
From: Max Gurtovoy <mgurtovoy at nvidia.com>
Date: Thu, 2 Dec 2021 19:59:00 +0200
Subject: [PATCH 1/2] nvme-rdma: prepare for queue_rqs implementation

Signed-off-by: Max Gurtovoy <mgurtovoy at nvidia.com>
---
 drivers/nvme/host/rdma.c | 127 ++++++++++++++++++++++++++-------------
 1 file changed, 84 insertions(+), 43 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 850f84d204d0..2d608cb48392 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -70,6 +70,7 @@ struct nvme_rdma_request {
 	struct ib_sge		sge[1 + NVME_RDMA_MAX_INLINE_SEGMENTS];
 	u32			num_sge;
 	struct ib_reg_wr	reg_wr;
+	struct ib_send_wr	send_wr;
 	struct ib_cqe		reg_cqe;
 	struct nvme_rdma_queue  *queue;
 	struct nvme_rdma_sgl	data_sgl;
@@ -1635,33 +1636,31 @@ static void nvme_rdma_send_done(struct ib_cq *cq, struct ib_wc *wc)
 		nvme_rdma_end_request(req);
 }
 
-static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
+static void nvme_rdma_prep_send(struct nvme_rdma_queue *queue,
 		struct nvme_rdma_qe *qe, struct ib_sge *sge, u32 num_sge,
-		struct ib_send_wr *first)
+		struct ib_send_wr *wr)
 {
-	struct ib_send_wr wr;
-	int ret;
-
-	sge->addr   = qe->dma;
+	sge->addr = qe->dma;
 	sge->length = sizeof(struct nvme_command);
-	sge->lkey   = queue->device->pd->local_dma_lkey;
+	sge->lkey = queue->device->pd->local_dma_lkey;
 
-	wr.next       = NULL;
-	wr.wr_cqe     = &qe->cqe;
-	wr.sg_list    = sge;
-	wr.num_sge    = num_sge;
-	wr.opcode     = IB_WR_SEND;
-	wr.send_flags = IB_SEND_SIGNALED;
+	wr->next = NULL;
+	wr->wr_cqe = &qe->cqe;
+	wr->sg_list = sge;
+	wr->num_sge = num_sge;
+	wr->opcode = IB_WR_SEND;
+	wr->send_flags = IB_SEND_SIGNALED;
+}
 
-	if (first)
-		first->next = ≀
-	else
-		first = ≀
+static int nvme_rdma_post_send(struct nvme_rdma_queue *queue,
+		struct ib_send_wr *wr)
+{
+	int ret;
 
-	ret = ib_post_send(queue->qp, first, NULL);
+	ret = ib_post_send(queue->qp, wr, NULL);
 	if (unlikely(ret)) {
 		dev_err(queue->ctrl->ctrl.device,
-			     "%s failed with error code %d\n", __func__, ret);
+			"%s failed with error code %d\n", __func__, ret);
 	}
 	return ret;
 }
@@ -1715,6 +1714,7 @@ static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg)
 	struct nvme_rdma_qe *sqe = &ctrl->async_event_sqe;
 	struct nvme_command *cmd = sqe->data;
 	struct ib_sge sge;
+	struct ib_send_wr wr;
 	int ret;
 
 	ib_dma_sync_single_for_cpu(dev, sqe->dma, sizeof(*cmd), DMA_TO_DEVICE);
@@ -1730,7 +1730,8 @@ static void nvme_rdma_submit_async_event(struct nvme_ctrl *arg)
 	ib_dma_sync_single_for_device(dev, sqe->dma, sizeof(*cmd),
 			DMA_TO_DEVICE);
 
-	ret = nvme_rdma_post_send(queue, sqe, &sge, 1, NULL);
+	nvme_rdma_prep_send(queue, sqe, &sge, 1, &wr);
+	ret = nvme_rdma_post_send(queue, &wr);
 	WARN_ON_ONCE(ret);
 }
 
@@ -2034,27 +2035,35 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
 	return BLK_EH_RESET_TIMER;
 }
 
-static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
-		const struct blk_mq_queue_data *bd)
+static blk_status_t nvme_rdma_cleanup_rq(struct nvme_rdma_queue *queue,
+		struct request *rq, int err)
 {
-	struct nvme_ns *ns = hctx->queue->queuedata;
-	struct nvme_rdma_queue *queue = hctx->driver_data;
-	struct request *rq = bd->rq;
+	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
+	blk_status_t ret;
+
+	nvme_rdma_unmap_data(queue, rq);
+	if (err == -EIO)
+		ret = nvme_host_path_error(rq);
+	else if (err == -ENOMEM || err == -EAGAIN)
+		ret = BLK_STS_RESOURCE;
+	else
+		ret = BLK_STS_IOERR;
+	nvme_cleanup_cmd(rq);
+	ib_dma_unmap_single(queue->device->dev, req->sqe.dma,
+			    sizeof(struct nvme_command), DMA_TO_DEVICE);
+	return ret;
+}
+
+static blk_status_t nvme_rdma_prep_rq(struct nvme_rdma_queue *queue,
+		struct request *rq, struct nvme_ns *ns)
+{
+	struct ib_device *dev = queue->device->dev;
 	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
 	struct nvme_rdma_qe *sqe = &req->sqe;
 	struct nvme_command *c = nvme_req(rq)->cmd;
-	struct ib_device *dev;
-	bool queue_ready = test_bit(NVME_RDMA_Q_LIVE, &queue->flags);
 	blk_status_t ret;
 	int err;
 
-	WARN_ON_ONCE(rq->tag < 0);
-
-	if (!nvme_check_ready(&queue->ctrl->ctrl, rq, queue_ready))
-		return nvme_fail_nonready_command(&queue->ctrl->ctrl, rq);
-
-	dev = queue->device->dev;
-
 	req->sqe.dma = ib_dma_map_single(dev, req->sqe.data,
 					 sizeof(struct nvme_command),
 					 DMA_TO_DEVICE);
@@ -2083,8 +2092,8 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	err = nvme_rdma_map_data(queue, rq, c);
 	if (unlikely(err < 0)) {
 		dev_err(queue->ctrl->ctrl.device,
-			     "Failed to map data (%d)\n", err);
-		goto err;
+			"Failed to map data (%d)\n", err);
+		goto out_err;
 	}
 
 	sqe->cqe.done = nvme_rdma_send_done;
@@ -2092,16 +2101,13 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	ib_dma_sync_single_for_device(dev, sqe->dma,
 			sizeof(struct nvme_command), DMA_TO_DEVICE);
 
-	err = nvme_rdma_post_send(queue, sqe, req->sge, req->num_sge,
-			req->mr ? &req->reg_wr.wr : NULL);
-	if (unlikely(err))
-		goto err_unmap;
+	nvme_rdma_prep_send(queue, sqe, req->sge, req->num_sge, &req->send_wr);
+	if (req->mr)
+		req->reg_wr.wr.next = &req->send_wr;
 
 	return BLK_STS_OK;
 
-err_unmap:
-	nvme_rdma_unmap_data(queue, rq);
-err:
+out_err:
 	if (err == -EIO)
 		ret = nvme_host_path_error(rq);
 	else if (err == -ENOMEM || err == -EAGAIN)
@@ -2115,6 +2121,41 @@ static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return ret;
 }
 
+static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
+		const struct blk_mq_queue_data *bd)
+{
+	struct nvme_ns *ns = hctx->queue->queuedata;
+	struct nvme_rdma_queue *queue = hctx->driver_data;
+	struct request *rq = bd->rq;
+	struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
+	bool queue_ready = test_bit(NVME_RDMA_Q_LIVE, &queue->flags);
+	struct ib_send_wr *wr;
+	int err;
+
+	WARN_ON_ONCE(rq->tag < 0);
+
+	if (!nvme_check_ready(&queue->ctrl->ctrl, rq, queue_ready))
+		return nvme_fail_nonready_command(&queue->ctrl->ctrl, rq);
+
+	err = nvme_rdma_prep_rq(queue, rq, ns);
+	if (unlikely(err))
+		return err;
+
+	if (req->mr)
+		wr = &req->reg_wr.wr;
+	else
+		wr = &req->send_wr;
+
+	err = nvme_rdma_post_send(queue, wr);
+	if (unlikely(err))
+		goto out_cleanup_rq;
+
+	return BLK_STS_OK;
+
+out_cleanup_rq:
+	return nvme_rdma_cleanup_rq(queue, rq, err);
+}
+
 static int nvme_rdma_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob)
 {
 	struct nvme_rdma_queue *queue = hctx->driver_data;
-- 
2.18.1

-------------- next part --------------
From 851a1f35420206f7b631d5d12b135e5a7c84b912 Mon Sep 17 00:00:00 2001
From: Max Gurtovoy <mgurtovoy at nvidia.com>
Date: Mon, 20 Dec 2021 20:42:49 +0200
Subject: [PATCH 2/2] nvme-rdma: add support for mq_ops->queue_rqs()

Signed-off-by: Max Gurtovoy <mgurtovoy at nvidia.com>
---
 drivers/nvme/host/rdma.c | 75 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 2d608cb48392..765bb57f0a55 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -2121,6 +2121,80 @@ static blk_status_t nvme_rdma_prep_rq(struct nvme_rdma_queue *queue,
 	return ret;
 }
 
+static bool nvme_rdma_prep_rq_batch(struct nvme_rdma_queue *queue,
+		struct request *rq)
+{
+	bool queue_ready = test_bit(NVME_RDMA_Q_LIVE, &queue->flags);
+
+	if (unlikely(!nvme_check_ready(&queue->ctrl->ctrl, rq, queue_ready)))
+		return false;
+
+	rq->mq_hctx->tags->rqs[rq->tag] = rq;
+	return nvme_rdma_prep_rq(queue, rq, rq->q->queuedata) == BLK_STS_OK;
+}
+
+static void nvme_rdma_submit_cmds(struct nvme_rdma_queue *queue,
+		struct request **rqlist)
+{
+	struct request *first_rq = rq_list_peek(rqlist);
+	struct nvme_rdma_request *nreq = blk_mq_rq_to_pdu(first_rq);
+	struct ib_send_wr *first, *last = NULL;
+	int ret;
+
+	if (nreq->mr)
+		first = &nreq->reg_wr.wr;
+	else
+		first = &nreq->send_wr;
+
+	while (!rq_list_empty(*rqlist)) {
+		struct request *rq = rq_list_pop(rqlist);
+		struct nvme_rdma_request *req = blk_mq_rq_to_pdu(rq);
+		struct ib_send_wr *tmp;
+
+		tmp = last;
+		last = &req->send_wr;
+		if (tmp) {
+			if (req->mr)
+				tmp->next = &req->reg_wr.wr;
+			else
+				tmp->next = &req->send_wr;
+		}
+	}
+
+	ret = nvme_rdma_post_send(queue, first);
+	WARN_ON_ONCE(ret);
+}
+
+static void nvme_rdma_queue_rqs(struct request **rqlist)
+{
+	struct request *req = rq_list_peek(rqlist), *prev = NULL;
+	struct request *requeue_list = NULL;
+
+	do {
+		struct nvme_rdma_queue *queue = req->mq_hctx->driver_data;
+
+		if (!nvme_rdma_prep_rq_batch(queue, req)) {
+			/* detach 'req' and add to remainder list */
+			if (prev)
+				prev->rq_next = req->rq_next;
+			rq_list_add(&requeue_list, req);
+		} else {
+			prev = req;
+		}
+
+		req = rq_list_next(req);
+		if (!req || (prev && req->mq_hctx != prev->mq_hctx)) {
+			/* detach rest of list, and submit */
+			if (prev)
+				prev->rq_next = NULL;
+			nvme_rdma_submit_cmds(queue, rqlist);
+			*rqlist = req;
+		}
+	} while (req);
+
+	*rqlist = requeue_list;
+}
+
 static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
 		const struct blk_mq_queue_data *bd)
 {
@@ -2258,6 +2332,7 @@ static int nvme_rdma_map_queues(struct blk_mq_tag_set *set)
 
 static const struct blk_mq_ops nvme_rdma_mq_ops = {
 	.queue_rq	= nvme_rdma_queue_rq,
+	.queue_rqs	= nvme_rdma_queue_rqs,
 	.complete	= nvme_rdma_complete_rq,
 	.init_request	= nvme_rdma_init_request,
 	.exit_request	= nvme_rdma_exit_request,
-- 
2.18.1