[PATCH V4 resend] lib/group_cpus.c: avoid to acquire cpu hotplug lock in group_cpus_evenly

Wed Dec 6 16:41:44 PST 2023

Hi Ming,

On Mon, Nov 20, 2023 at 04:35:59PM +0800, Ming Lei wrote:
> group_cpus_evenly() could be part of storage driver's error handler,
> such as nvme driver, when may happen during CPU hotplug, in which
> storage queue has to drain its pending IOs because all CPUs associated
> with the queue are offline and the queue is becoming inactive. And
> handling IO needs error handler to provide forward progress.
> 
> Then dead lock is caused:
> 
> 1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's
> handler is waiting for inflight IO
> 
> 2) error handler is waiting for CPU hotplug lock
> 
> 3) inflight IO can't be completed in blk-mq's CPU hotplug handler because
> error handling can't provide forward progress.
> 
> Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(),
> in which two stage spreads are taken: 1) the 1st stage is over all present
> CPUs; 2) the end stage is over all other CPUs.
> 
> Turns out the two stage spread just needs consistent 'cpu_present_mask', and
> remove the CPU hotplug lock by storing it into one local cache. This way
> doesn't change correctness, because all CPUs are still covered.
> 
> Cc: Keith Busch <kbusch at kernel.org>
> Cc: linux-nvme at lists.infradead.org
> Cc: linux-block at vger.kernel.org
> Reported-by: Yi Zhang <yi.zhang at redhat.com>
> Reported-by: Guangwu Zhang <guazhang at redhat.com>
> Tested-by: Guangwu Zhang <guazhang at redhat.com>
> Reviewed-by: Chengming Zhou <zhouchengming at bytedance.com>
> Reviewed-by: Jens Axboe <axboe at kernel.dk>
> Signed-off-by: Ming Lei <ming.lei at redhat.com>
> ---
>  lib/group_cpus.c | 22 ++++++++++++++++------
>  1 file changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/group_cpus.c b/lib/group_cpus.c
> index aa3f6815bb12..ee272c4cefcc 100644
> --- a/lib/group_cpus.c
> +++ b/lib/group_cpus.c
> @@ -366,13 +366,25 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
>  	if (!masks)
>  		goto fail_node_to_cpumask;
>  
> -	/* Stabilize the cpumasks */
> -	cpus_read_lock();
>  	build_node_to_cpumask(node_to_cpumask);
>  
> +	/*
> +	 * Make a local cache of 'cpu_present_mask', so the two stages
> +	 * spread can observe consistent 'cpu_present_mask' without holding
> +	 * cpu hotplug lock, then we can reduce deadlock risk with cpu
> +	 * hotplug code.
> +	 *
> +	 * Here CPU hotplug may happen when reading `cpu_present_mask`, and
> +	 * we can live with the case because it only affects that hotplug
> +	 * CPU is handled in the 1st or 2nd stage, and either way is correct
> +	 * from API user viewpoint since 2-stage spread is sort of
> +	 * optimization.
> +	 */
> +	cpumask_copy(npresmsk, data_race(cpu_present_mask));

Now that you initialize the npresmsk explicitly, you can allocate it
using alloc_cpumask_var().

The same actually holds for nmsk too, and even before this patch. Maybe
fix it in a separate prepending patch?

> +
>  	/* grouping present CPUs first */
>  	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
> -				  cpu_present_mask, nmsk, masks);
> +				  npresmsk, nmsk, masks);
>  	if (ret < 0)
>  		goto fail_build_affinity;
>  	nr_present = ret;
> @@ -387,15 +399,13 @@ struct cpumask *group_cpus_evenly(unsigned int numgrps)
>  		curgrp = 0;
>  	else
>  		curgrp = nr_present;
> -	cpumask_andnot(npresmsk, cpu_possible_mask, cpu_present_mask);
> +	cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk);
>  	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
>  				  npresmsk, nmsk, masks);

The first thing the helper does is checking if nprepmask is empty.
cpumask_andnot() returns false in that case. So, assuming that present
cpumask in the previous call can't be empty, we can save few cycles if
drop corresponding check in the helper and do like this:

	if (cpumask_andnot(npresmsk, cpu_possible_mask, npresmsk) == 0) {
                nr_others = 0;
                goto fail_build_affinity;
        }

  	ret = __group_cpus_evenly(curgrp, numgrps, node_to_cpumask,
  				  npresmsk, nmsk, masks);

Although, it's not related to this patch directly. So, if you fix
zalloc_cpumask_var(), the patch looks good to me.

Reviewed-by: Yury Norov <yury.norov at gmail.com>

>  	if (ret >= 0)
>  		nr_others = ret;
>  
>   fail_build_affinity:
> -	cpus_read_unlock();
> -
>  	if (ret >= 0)
>  		WARN_ON(nr_present + nr_others < numgrps);
>  
> -- 
> 2.41.0
>