[PATCH 0/3] improve nvme quiesce time for large amount of namespaces

Wed Oct 12 18:37:35 PDT 2022

On 2022/10/12 19:13, Sagi Grimberg wrote:
> 
> 
> On 10/12/22 11:43, Chao Leng wrote:
>> Add Ming Lei.
>>
>> On 2022/10/12 14:37, Sagi Grimberg wrote:
>>>
>>>>> On Sun, Jul 31, 2022 at 01:23:36PM +0300, Sagi Grimberg wrote:
>>>>>> But maybe we can avoid that, and because we allocate
>>>>>> the connect_q ourselves, and fully know that it should
>>>>>> not be apart of the tagset quiesce, perhaps we can introduce
>>>>>> a new interface like:
>>>>>> -- 
>>>>>> static inline int nvme_ctrl_init_connect_q(struct nvme_ctrl *ctrl)
>>>>>> {
>>>>>>     ctrl->connect_q = blk_mq_init_queue_self_quiesce(ctrl->tagset);
>>>>>>     if (IS_ERR(ctrl->connect_q))
>>>>>>         return PTR_ERR(ctrl->connect_q);
>>>>>>     return 0;
>>>>>> }
>>>>>> -- 
>>>>>>
>>>>>> And then blk_mq_quiesce_tagset can simply look into a per request-queue
>>>>>> self_quiesce flag and skip as needed.
>>>>>
>>>>> I'd just make that a queue flag set after allocation to keep the
>>>>> interface simple, but otherwise this seems like the right thing
>>>>> to do.
>>>> Now the code used NVME_NS_STOPPED to avoid unpaired stop/start.
>>>> If we use blk_mq_quiesce_tagset, It will cause the above mechanism to fail.
>>>> I review the code, only pci can not ensure secure stop/start pairing.
>>>> So there is a choice, We only use blk_mq_quiesce_tagset on fabrics, not PCI.
>>>> Do you think that's acceptable?
>>>> If that's acceptable, I will try to send a patch set.
>>>
>>> I don't think that this is acceptable. But I don't understand how
>>> NVME_NS_STOPPED would change anything in the behavior of tagset-wide
>>> quiesce?
>> If use blk_mq_quiesce_tagset, it will quiesce all queues of all ns,
>> but can not set NVME_NS_STOPPED of all ns. The mechanism of NVME_NS_STOPPED
>> will be invalidated.
>> NVMe-pci has very complicated quiesce/unquiesce use pattern, quiesce/unquiesce
>> may be called unpaired.
>> It will cause some backward. There may be some bugs in this scenario:
>> A thread: quiesce the queue
>> B thread: quiesce the queue
>> A thread end, and does not unquiesce the queue.
>> B thread: unquiesce the queue, and do something which need the queue must be unquiesed.
>>
>> Of course, I don't think it is a good choice to guarantee paired access through NVME_NS_STOPPED,
>> there exist unexpected unquiesce and start queue too early.
>> But now that the code has done so, the backward should be unacceptable.
>> such as this scenario:
>> A thread: quiesce the queue
>> B thread: want to quiesce the queue but do nothing because NVME_NS_STOPPED is already set.
>> A thread: unquiesce the queue
>> Now the queue is unquiesced too early for B thread.
>> B thread: do something which need the queue must be quiesced.
>>
>> Introduce NVME_NS_STOPPED link:
>> https://lore.kernel.org/all/20211014081710.1871747-5-ming.lei@redhat.com/
> 
> I think we can just change a ns flag to a controller flag ala:
> NVME_CTRL_STOPPED, and then do:
> 
> void nvme_stop_queues(struct nvme_ctrl *ctrl)
> {
>      if (!test_and_set_bit(NVME_CTRL_STOPPED, &ns->flags))
>          blk_mq_quiesce_tagset(ctrl->tagset);
> }
> EXPORT_SYMBOL_GPL(nvme_stop_queues);
> 
> void nvme_start_queues(struct nvme_ctrl *ctrl)
> {
>      if (test_and_clear_bit(NVME_CTRL_STOPPED, &ns->flags))
>          blk_mq_unquiesce_tagset(ctrl->tagset);
> }
> EXPORT_SYMBOL_GPL(nvme_start_queues);
> 
> Won't that achieve the same result?
No, nvme_set_queue_dying call nvme_start_ns_queue for one ns.
ctrl->flags can not cover this scenario.
This still results in unpaired quiesce/unquiesce.

Maybe it is necessary to clean up the confused quiesce/unquiesce of nvme-pci,
Thus blk_mq_quiesce_tagset can be easy to use for nvme-pci.
Before that, we could only optimize batched quiesce/unquiesce based on ns,
Although I also think using blk_mq_quiesce_tagset is a better choice.