[PATCH 1/4] blk-mq: introduce BLK_MQ_F_SCHED_USE_HW_TAG
Ming Lei
ming.lei at redhat.com
Fri May 5 16:33:56 PDT 2017
On Sat, May 06, 2017 at 06:54:09AM +0800, Ming Lei wrote:
> On Thu, May 04, 2017 at 08:06:15AM -0600, Jens Axboe wrote:
> > On 05/03/2017 08:51 PM, Ming Lei wrote:
> > > On Wed, May 03, 2017 at 08:13:03PM -0600, Jens Axboe wrote:
> > >> On 05/03/2017 08:01 PM, Ming Lei wrote:
> > >>> On Thu, May 4, 2017 at 5:40 AM, Omar Sandoval <osandov at osandov.com> wrote:
> > >>>> On Thu, May 04, 2017 at 04:13:51AM +0800, Ming Lei wrote:
> > >>>>> On Thu, May 4, 2017 at 12:46 AM, Omar Sandoval <osandov at osandov.com> wrote:
> > >>>>>> On Fri, Apr 28, 2017 at 11:15:36PM +0800, Ming Lei wrote:
> > >>>>>>> When blk-mq I/O scheduler is used, we need two tags for
> > >>>>>>> submitting one request. One is called scheduler tag for
> > >>>>>>> allocating request and scheduling I/O, another one is called
> > >>>>>>> driver tag, which is used for dispatching IO to hardware/driver.
> > >>>>>>> This way introduces one extra per-queue allocation for both tags
> > >>>>>>> and request pool, and may not be as efficient as case of none
> > >>>>>>> scheduler.
> > >>>>>>>
> > >>>>>>> Also currently we put a default per-hctx limit on schedulable
> > >>>>>>> requests, and this limit may be a bottleneck for some devices,
> > >>>>>>> especialy when these devices have a quite big tag space.
> > >>>>>>>
> > >>>>>>> This patch introduces BLK_MQ_F_SCHED_USE_HW_TAG so that we can
> > >>>>>>> allow to use hardware/driver tags directly for IO scheduling if
> > >>>>>>> devices's hardware tag space is big enough. Then we can avoid
> > >>>>>>> the extra resource allocation and make IO submission more
> > >>>>>>> efficient.
> > >>>>>>>
> > >>>>>>> Signed-off-by: Ming Lei <ming.lei at redhat.com>
> > >>>>>>> ---
> > >>>>>>> block/blk-mq-sched.c | 10 +++++++++-
> > >>>>>>> block/blk-mq.c | 35 +++++++++++++++++++++++++++++------
> > >>>>>>> include/linux/blk-mq.h | 1 +
> > >>>>>>> 3 files changed, 39 insertions(+), 7 deletions(-)
> > >>>>>>
> > >>>>>> One more note on this: if we're using the hardware tags directly, then
> > >>>>>> we are no longer limited to q->nr_requests requests in-flight. Instead,
> > >>>>>> we're limited to the hw queue depth. We probably want to maintain the
> > >>>>>> original behavior,
> > >>>>>
> > >>>>> That need further investigation, and generally scheduler should be happy with
> > >>>>> more requests which can be scheduled.
> > >>>>>
> > >>>>> We can make it as one follow-up.
> > >>>>
> > >>>> If we say nr_requests is 256, then we should honor that. So either
> > >>>> update nr_requests to reflect the actual depth we're using or resize the
> > >>>> hardware tags.
> > >>>
> > >>> Firstly nr_requests is set as 256 from blk-mq inside instead of user
> > >>> space, it won't be a big deal to violate that.
> > >>
> > >> The legacy scheduling layer used 2*128 by default, that's why I used the
> > >> "magic" 256 internally. FWIW, I agree with Omar here. If it's set to
> > >> 256, we must honor that. Users will tweak this value down to trade peak
> > >> performance for latency, it's important that it does what it advertises.
> > >
> > > In case of scheduling with hw tags, we share tags between scheduler and
> > > dispatching, if we resize(only decrease actually) the tags, dispatching
> > > space(hw tags) is decreased too. That means the actual usable device tag
> > > space need to be decreased much.
> >
> > I think the solution here is to handle it differently. Previous, we had
> > requests and tags independent. That meant that we could have an
> > independent set of requests for scheduling, then assign tags as we need
> > to dispatch them to hardware. This is how the old schedulers worked, and
> > with the scheduler tags, this is how the new blk-mq scheduling works as
> > well.
> >
> > Once you start treating them as one space again, we run into this issue.
> > I can think of two solutions:
> >
> > 1) Keep our current split, so we can schedule independently of hardware
> > tags.
>
> Actually hw tag depends on scheduler tag as I explained, so both aren't
> independently, and even they looks split.
>
> Also I am not sure how we do that if we need to support scheduling with
> hw tag, could you explain it a bit?
>
> >
> > 2) Throttle the queue depth independently. If the user asks for a depth
> > of, eg, 32, retain a larger set of requests but limit the queue depth
> > on the device side fo 32.
>
> If I understand correctly, we can support scheduling with hw tag by this
> patchset plus setting q->nr_requests as size of hw tag space. Otherwise
> it isn't easy to throttle the queue depth independently because hw tag
> actually depends on scheduler tag.
>
> The 3rd one is to follow Omar's suggestion, by resizing the queue depth
> as q->nr_requests simply.
>
> >
> > This is much easier to support with split hardware and scheduler tags...
> >
> > >>> Secondly, when there is enough tags available, it might hurt
> > >>> performance if we don't use them all.
> > >>
> > >> That's mostly bogus. Crazy large tag depths have only one use case -
> > >> synthetic peak performance benchmarks from manufacturers. We don't want
> > >> to allow really deep queues. Nothing good comes from that, just a lot of
> > >> pain and latency issues.
> > >
> > > Given device provides so high queue depth, it might be reasonable to just
> > > allow to use them up. For example of NVMe, once mq scheduler is enabled,
> > > the actual size of device tag space is just 256 at default, even though
> > > the hardware provides very big tag space(>= 10K).
> >
> > Correct.
> >
> > > The problem is that lifetime of sched tag is same with request's
> > > lifetime(from submission to completion), and it covers lifetime of
> > > device tag. In theory sched tag should have been freed just after
> > > the rq is dispatched to driver. Unfortunately we can't do that because
> > > request is allocated from sched tag set.
> >
> > Yep
>
> Request copying like what your first post of mq scheduler patch did
> may fix this issue, and in that way we can make both two tag space
> independent, but with extra cost. What do you think about this approach?
Or something like the following(I am on a trip now, and it is totally
un-tested)?
---
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8f72f16b498a..ce6bb849a395 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -855,6 +855,17 @@ static inline unsigned int queued_to_index(unsigned int queued)
return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
}
+static void blk_mq_replace_rq(struct blk_mq_hw_ctx *hctx, struct request *rq)
+{
+ int sched_tag = rq->internal_tag;
+ struct request *rq_for_replacement = hctx->tags->static_rqs[rq->tag];
+
+ hctx->sched_tags->static_rqs[rq->internal_tag] = rq_for_replacement;
+
+ blk_mq_put_tag(hctx, hctx->sched_tags, rq->mq_ctx, rq->internal_tag);
+ rq->internal_tag = -1;
+}
+
bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
bool wait)
{
@@ -878,7 +889,9 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
rq->rq_flags |= RQF_MQ_INFLIGHT;
atomic_inc(&data.hctx->nr_active);
}
data.hctx->tags->rqs[rq->tag] = rq;
+ blk_mq_replace_rq(data.hctx, rq);
}
done:
@@ -887,38 +900,6 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
return rq->tag != -1;
}
-static void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
- struct request *rq)
-{
- blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
- rq->tag = -1;
-
- if (rq->rq_flags & RQF_MQ_INFLIGHT) {
- rq->rq_flags &= ~RQF_MQ_INFLIGHT;
- atomic_dec(&hctx->nr_active);
- }
-}
-
-static void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx,
- struct request *rq)
-{
- if (rq->tag == -1 || rq->internal_tag == -1)
- return;
-
- __blk_mq_put_driver_tag(hctx, rq);
-}
-
-static void blk_mq_put_driver_tag(struct request *rq)
-{
- struct blk_mq_hw_ctx *hctx;
-
- if (rq->tag == -1 || rq->internal_tag == -1)
- return;
-
- hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu);
- __blk_mq_put_driver_tag(hctx, rq);
-}
-
/*
* If we fail getting a driver tag because all the driver tags are already
* assigned and on the dispatch list, BUT the first entry does not have a
@@ -1041,7 +1022,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
queued++;
break;
case BLK_MQ_RQ_QUEUE_BUSY:
- blk_mq_put_driver_tag_hctx(hctx, rq);
list_add(&rq->queuelist, list);
__blk_mq_requeue_request(rq);
break;
@@ -1069,7 +1049,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
* tag for the next request already, free it again.
*/
rq = list_first_entry(list, struct request, queuelist);
- blk_mq_put_driver_tag(rq);
spin_lock(&hctx->lock);
list_splice_init(list, &hctx->dispatch);
Thanks,
Ming
More information about the Linux-nvme
mailing list