[PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

Mon Jan 29 19:16:33 PST 2018

On Tue, Jan 30, 2018 at 5:22 AM, Bart Van Assche <Bart.VanAssche at wdc.com> wrote:
> On Mon, 2018-01-29 at 15:33 -0500, Mike Snitzer wrote:
>> +              * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
>> +              * bit is set, run queue after 10ms to avoid IO stalls
>> +              * that could otherwise occur if the queue is idle.
>>                */
>> -             if (!blk_mq_sched_needs_restart(hctx) ||
>> +             needs_restart = blk_mq_sched_needs_restart(hctx);
>> +             if (!needs_restart ||
>>                   (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>>                       blk_mq_run_hw_queue(hctx, true);
>> +             else if (needs_restart && (ret == BLK_STS_RESOURCE))
>> +                     blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
>>       }
>
> The above code only calls blk_mq_delay_run_hw_queue() if the following condition
> is met: !(!needs_restart || (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
> && (needs_restart && (ret == BLK_STS_RESOURCE)). That boolean expression can be
> simplified into the following: needs_restart && ret == BLK_STS_RESOURCE &&
> !(no_tag && list_empty_careful(&hctx->dispatch_wait.entry)). From blk-mq-sched.h:

No, the expression of  (needs_restart && (ret == BLK_STS_RESOURCE)) clearly
expresses what we want to do.

When RESTART is working, and meantime BLK_STS_RESOURCE is returned
from driver, we need to avoid IO hang by blk_mq_delay_run_hw_queue().

>
> static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
> {
>         return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> }
>
> In other words, the above code will not call blk_mq_delay_run_hw_queue() if
> BLK_MQ_S_SCHED_RESTART is cleared after it got set and before the above code is
> reached. The BLK_MQ_S_SCHED_RESTART bit gets cleared if a request completes
> concurrently with the above code. From blk-mq-sched.c:

That won't be an issue, given any time the SCHED_RESTART is cleared, the queue
will be run again, so we needn't to worry about that.

>
> static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
> {
>         if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
>                 return false;
>
>         if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
>                 struct request_queue *q = hctx->queue;
>
>                 if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
>                         atomic_dec(&q->shared_hctx_restart);
>         } else
>                 clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
>
>         return blk_mq_run_hw_queue(hctx, true);
> }
>
> The above blk_mq_run_hw_queue() call may finish either before or after

If the above blk_mq_run_hw_queue() finishes before blk_mq_dispatch_rq_list()
examines the flag, blk_mq_dispatch_rq_list() will see the flag cleared, and run
queue again by the following code branch:

                if (!blk_mq_sched_needs_restart(hctx) ||
                    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
                        blk_mq_run_hw_queue(hctx, true);

If blk_mq_run_hw_queue() finishes after blk_mq_dispatch_rq_list() examines
the flag, this blk_mq_run_hw_queue() will see the new added request, and
still everything is fine. Even blk_mq_delay_run_hw_queue() can be started
too.

> blk_mq_dispatch_rq_list() examines the BLK_MQ_S_SCHED_RESTART flag.
>
> That's why I wrote in previous e-mails that this patch and also the previous
> versions of this patch change a systematic call of blk_mq_delay_run_hw_queue()
> into a call that may or may not happen, depending on whether or not a request
> completes concurrently with request queueing. Sorry but I think that means
> that the above change combined with changing BLK_STS_RESOURCE into
> BLK_STS_DEV_RESOURCE is wrong and why I expect that this will result in
> sporadic queue stalls. As you know sporadic queue stalls are annoying and hard
> to debug.

Now there is debugfs, it isn't difficult to deal with that if
reporter'd like to cowork.

>
> Plenty of e-mails have already been exchanged about versions one to four of this
> patch but so far nobody has even tried to contradict the above.

No, I don't see the issue, let's revisit the main change again:

+       needs_restart = blk_mq_sched_needs_restart(hctx);
+       if (!needs_restart ||
            (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
            blk_mq_run_hw_queue(hctx, true);
+       else if (needs_restart && (ret == BLK_STS_RESOURCE))
+           blk_mq_delay_run_hw_queue(hctx, 10);

If SCHED_RESTART isn't set, queue is run immediately, otherwise when
BLK_STS_RESOURCE is returned, we avoid IO hang by
blk_mq_delay_run_hw_queue(hctx, XXX).

And we don't cover  BLK_STS_DEV_RESOURCE above because it is documented
clearly that  BLK_STS_DEV_RESOURCE is only returned by driver iff queue will be
rerun in future if resource is available.

Is there  any hole in above code?

Thanks,
Ming Lei