[PATCH v5 1/2] blk-mq: add tagset quiesce interface

Sagi Grimberg sagi at grimberg.me
Mon Jul 27 23:29:43 EDT 2020


>>>>>>> +static void blk_mq_quiesce_blocking_queue_async(struct request_queue *q)
>>>>>>> +{
>>>>>>> +	struct blk_mq_hw_ctx *hctx;
>>>>>>> +	unsigned int i;
>>>>>>> +
>>>>>>> +	blk_mq_quiesce_queue_nowait(q);
>>>>>>> +
>>>>>>> +	queue_for_each_hw_ctx(q, hctx, i) {
>>>>>>> +		WARN_ON_ONCE(!(hctx->flags & BLK_MQ_F_BLOCKING));
>>>>>>> +		hctx->rcu_sync = kmalloc(sizeof(*hctx->rcu_sync), GFP_KERNEL);
>>>>>>> +		if (!hctx->rcu_sync)
>>>>>>> +			continue;
>>>>>>
>>>>>> This approach of quiesce/unquiesce tagset is good abstraction.
>>>>>>
>>>>>> Just one more thing, please allocate a rcu_sync array because hctx is
>>>>>> supposed to not store scratch stuff.
>>>>>
>>>>> I'd be all for not stuffing this in the hctx, but how would that work?
>>>>> The only thing I can think of that would work reliably is batching the
>>>>> queue+wait into units of N. We could potentially have many thousands of
>>>>> queues, and it could get iffy (and/or unreliable) in terms of allocation
>>>>> size. Looks like rcu_synchronize is 48-bytes on my local install, and it
>>>>> doesn't take a lot of devices at current CPU counts to make an alloc
>>>>> covering all of it huge. Let's say 64 threads, and 32 devices, then
>>>>> we're already at 64*32*48 bytes which is an order 5 allocation. Not
>>>>> friendly, and not going to be reliable when you need it. And if we start
>>>>> batching in reasonable counts, then we're _almost_ back to doing a queue
>>>>> or two at the time... 32 * 48 is 1536 bytes, so we could only do two at
>>>>> the time for single page allocations.
>>>>
>>>> We can convert to order 0 allocation by one extra indirect array.
>>>
>>> I guess that could work, and would just be one extra alloc + free if we
>>> still retain the batch. That'd take it to 16 devices (at 32 CPUs) per
>>> round, potentially way less of course if we have more CPUs. So still
>>> somewhat limiting, rather than do all at once.
>>
>> With the approach in blk_mq_alloc_rqs(), each allocated page can be
>> added to one list, so the indirect array can be saved. Then it is
>> possible to allocate for any size queues/devices since every
>> allocation is just for single page in case that it is needed, even no
>> pre-calculation is required.
> 
> As long as we watch the complexity, don't think we need to go overboard
> here in the risk of adding issues for the failure path.

No we don't. I prefer not to do it. And if this turns out to be that bad
we can later convert it to a complicated page vector.

I'll move forward with this approach.



More information about the Linux-nvme mailing list