[PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

Mon Jan 29 13:22:13 PST 2018

On Mon, 2018-01-29 at 15:33 -0500, Mike Snitzer wrote:
> +		 * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
> +		 * bit is set, run queue after 10ms to avoid IO stalls
> +		 * that could otherwise occur if the queue is idle.
>  		 */
> -		if (!blk_mq_sched_needs_restart(hctx) ||
> +		needs_restart = blk_mq_sched_needs_restart(hctx);
> +		if (!needs_restart ||
>  		    (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>  			blk_mq_run_hw_queue(hctx, true);
> +		else if (needs_restart && (ret == BLK_STS_RESOURCE))
> +			blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
>  	}

The above code only calls blk_mq_delay_run_hw_queue() if the following condition
is met: !(!needs_restart || (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
&& (needs_restart && (ret == BLK_STS_RESOURCE)). That boolean expression can be
simplified into the following: needs_restart && ret == BLK_STS_RESOURCE && 
!(no_tag && list_empty_careful(&hctx->dispatch_wait.entry)). From blk-mq-sched.h:

static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
{
	return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
}

In other words, the above code will not call blk_mq_delay_run_hw_queue() if
BLK_MQ_S_SCHED_RESTART is cleared after it got set and before the above code is
reached. The BLK_MQ_S_SCHED_RESTART bit gets cleared if a request completes
concurrently with the above code. From blk-mq-sched.c:

static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx)
{
	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
		return false;

	if (hctx->flags & BLK_MQ_F_TAG_SHARED) {
		struct request_queue *q = hctx->queue;

		if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
			atomic_dec(&q->shared_hctx_restart);
	} else
		clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);

	return blk_mq_run_hw_queue(hctx, true);
}

The above blk_mq_run_hw_queue() call may finish either before or after
blk_mq_dispatch_rq_list() examines the BLK_MQ_S_SCHED_RESTART flag.

That's why I wrote in previous e-mails that this patch and also the previous
versions of this patch change a systematic call of blk_mq_delay_run_hw_queue()
into a call that may or may not happen, depending on whether or not a request
completes concurrently with request queueing. Sorry but I think that means
that the above change combined with changing BLK_STS_RESOURCE into
BLK_STS_DEV_RESOURCE is wrong and why I expect that this will result in
sporadic queue stalls. As you know sporadic queue stalls are annoying and hard
to debug.

Plenty of e-mails have already been exchanged about versions one to four of this
patch but so far nobody has even tried to contradict the above.

Bart.