[PATCH] blk-mq: avoid to hang in the cpuhp offline handler

John Garry john.garry at huawei.com
Thu Sep 22 01:47:09 PDT 2022


On 20/09/2022 03:17, Ming Lei wrote:
> For avoiding to trigger io timeout when one hctx becomes inactive, we
> drain IOs when all CPUs of one hctx are offline. However, driver's
> timeout handler may require cpus_read_lock, such as nvme-pci,
> pci_alloc_irq_vectors_affinity() is called in nvme-pci reset context,
> and irq_build_affinity_masks() needs cpus_read_lock().
> 
> Meantime when blk-mq's cpuhp offline handler is called, cpus_write_lock
> is held, so deadlock is caused.
> 
> Fixes the issue by breaking the wait loop if enough long time elapses,
> and these in-flight not drained IO still can be handled by timeout
> handler.

I don't think that that this is a good idea - that is because often 
drivers cannot safely handle scenario of timeout of an IO which has 
actually completed. NVMe timeout handler may poll for completion, but 
SCSI does not.

Indeed, if we were going to allow the timeout handler handle these 
in-flight IO then there is no point in having this hotplug handler in 
the first place.

> 
> Cc: linux-nvme at lists.infradead.org
> Reported-by: Yi Zhang <yi.zhang at redhat.com>
> Fixes: bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are offline")
> Signed-off-by: Ming Lei <ming.lei at redhat.com>
> ---
>   block/blk-mq.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c96c8c4f751b..4585985b8537 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3301,6 +3301,7 @@ static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
>   	return true;
>   }
>   
> +#define BLK_MQ_MAX_OFFLINE_WAIT_MSECS 3000

That's so low compared to default SCSI sd timeout.

>   static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
>   {
>   	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> @@ -3326,8 +3327,13 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
>   	 * frozen and there are no requests.
>   	 */
>   	if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) {
> -		while (blk_mq_hctx_has_requests(hctx))
> +		unsigned int wait_ms = 0;
> +
> +		while (blk_mq_hctx_has_requests(hctx) && wait_ms <
> +				BLK_MQ_MAX_OFFLINE_WAIT_MSECS) {
>   			msleep(5);
> +			wait_ms += 5;
> +		}
>   		percpu_ref_put(&hctx->queue->q_usage_counter);
>   	}
>   

Thanks,
John



More information about the Linux-nvme mailing list