[PATCH v5 1/2] blk-mq: add tagset quiesce interface

Mon Jul 27 22:23:15 EDT 2020

On 7/27/20 8:17 PM, Ming Lei wrote:
> On Mon, Jul 27, 2020 at 07:51:16PM -0600, Jens Axboe wrote:
>> On 7/27/20 7:40 PM, Ming Lei wrote:
>>> On Mon, Jul 27, 2020 at 04:10:21PM -0700, Sagi Grimberg wrote:
>>>> drivers that have shared tagsets may need to quiesce potentially a lot
>>>> of request queues that all share a single tagset (e.g. nvme). Add an interface
>>>> to quiesce all the queues on a given tagset. This interface is useful because
>>>> it can speedup the quiesce by doing it in parallel.
>>>>
>>>> For tagsets that have BLK_MQ_F_BLOCKING set, we use call_srcu to all hctxs
>>>> in parallel such that all of them wait for the same rcu elapsed period with
>>>> a per-hctx heap allocated rcu_synchronize. for tagsets that don't have
>>>> BLK_MQ_F_BLOCKING set, we simply call a single synchronize_rcu as this is
>>>> sufficient.
>>>>
>>>> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
>>>> ---
>>>>  block/blk-mq.c         | 66 ++++++++++++++++++++++++++++++++++++++++++
>>>>  include/linux/blk-mq.h |  4 +++
>>>>  2 files changed, 70 insertions(+)
>>>>
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index abcf590f6238..c37e37354330 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -209,6 +209,42 @@ void blk_mq_quiesce_queue_nowait(struct request_queue *q)
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue_nowait);
>>>>  
>>>> +static void blk_mq_quiesce_blocking_queue_async(struct request_queue *q)
>>>> +{
>>>> +	struct blk_mq_hw_ctx *hctx;
>>>> +	unsigned int i;
>>>> +
>>>> +	blk_mq_quiesce_queue_nowait(q);
>>>> +
>>>> +	queue_for_each_hw_ctx(q, hctx, i) {
>>>> +		WARN_ON_ONCE(!(hctx->flags & BLK_MQ_F_BLOCKING));
>>>> +		hctx->rcu_sync = kmalloc(sizeof(*hctx->rcu_sync), GFP_KERNEL);
>>>> +		if (!hctx->rcu_sync)
>>>> +			continue;
>>>
>>> This approach of quiesce/unquiesce tagset is good abstraction.
>>>
>>> Just one more thing, please allocate a rcu_sync array because hctx is
>>> supposed to not store scratch stuff.
>>
>> I'd be all for not stuffing this in the hctx, but how would that work?
>> The only thing I can think of that would work reliably is batching the
>> queue+wait into units of N. We could potentially have many thousands of
>> queues, and it could get iffy (and/or unreliable) in terms of allocation
>> size. Looks like rcu_synchronize is 48-bytes on my local install, and it
>> doesn't take a lot of devices at current CPU counts to make an alloc
>> covering all of it huge. Let's say 64 threads, and 32 devices, then
>> we're already at 64*32*48 bytes which is an order 5 allocation. Not
>> friendly, and not going to be reliable when you need it. And if we start
>> batching in reasonable counts, then we're _almost_ back to doing a queue
>> or two at the time... 32 * 48 is 1536 bytes, so we could only do two at
>> the time for single page allocations.
> 
> We can convert to order 0 allocation by one extra indirect array. 

I guess that could work, and would just be one extra alloc + free if we
still retain the batch. That'd take it to 16 devices (at 32 CPUs) per
round, potentially way less of course if we have more CPUs. So still
somewhat limiting, rather than do all at once.

-- 
Jens Axboe