[PATCH 0/3] improve nvme quiesce time for large amount of namespaces

Wed Oct 12 19:06:56 PDT 2022

On 2022/10/13 9:37, Chao Leng wrote:
> 
> 
> On 2022/10/12 19:13, Sagi Grimberg wrote:
>>
>>
>> On 10/12/22 11:43, Chao Leng wrote:
>>> Add Ming Lei.
>>>
>>> On 2022/10/12 14:37, Sagi Grimberg wrote:
>>>>
>>>>>> On Sun, Jul 31, 2022 at 01:23:36PM +0300, Sagi Grimberg wrote:
>>>>>>> But maybe we can avoid that, and because we allocate
>>>>>>> the connect_q ourselves, and fully know that it should
>>>>>>> not be apart of the tagset quiesce, perhaps we can introduce
>>>>>>> a new interface like:
>>>>>>> -- 
>>>>>>> static inline int nvme_ctrl_init_connect_q(struct nvme_ctrl *ctrl)
>>>>>>> {
>>>>>>>     ctrl->connect_q = blk_mq_init_queue_self_quiesce(ctrl->tagset);
>>>>>>>     if (IS_ERR(ctrl->connect_q))
>>>>>>>         return PTR_ERR(ctrl->connect_q);
>>>>>>>     return 0;
>>>>>>> }
>>>>>>> -- 
>>>>>>>
>>>>>>> And then blk_mq_quiesce_tagset can simply look into a per request-queue
>>>>>>> self_quiesce flag and skip as needed.
>>>>>>
>>>>>> I'd just make that a queue flag set after allocation to keep the
>>>>>> interface simple, but otherwise this seems like the right thing
>>>>>> to do.
>>>>> Now the code used NVME_NS_STOPPED to avoid unpaired stop/start.
>>>>> If we use blk_mq_quiesce_tagset, It will cause the above mechanism to fail.
>>>>> I review the code, only pci can not ensure secure stop/start pairing.
>>>>> So there is a choice, We only use blk_mq_quiesce_tagset on fabrics, not PCI.
>>>>> Do you think that's acceptable?
>>>>> If that's acceptable, I will try to send a patch set.
>>>>
>>>> I don't think that this is acceptable. But I don't understand how
>>>> NVME_NS_STOPPED would change anything in the behavior of tagset-wide
>>>> quiesce?
>>> If use blk_mq_quiesce_tagset, it will quiesce all queues of all ns,
>>> but can not set NVME_NS_STOPPED of all ns. The mechanism of NVME_NS_STOPPED
>>> will be invalidated.
>>> NVMe-pci has very complicated quiesce/unquiesce use pattern, quiesce/unquiesce
>>> may be called unpaired.
>>> It will cause some backward. There may be some bugs in this scenario:
>>> A thread: quiesce the queue
>>> B thread: quiesce the queue
>>> A thread end, and does not unquiesce the queue.
>>> B thread: unquiesce the queue, and do something which need the queue must be unquiesed.
>>>
>>> Of course, I don't think it is a good choice to guarantee paired access through NVME_NS_STOPPED,
>>> there exist unexpected unquiesce and start queue too early.
>>> But now that the code has done so, the backward should be unacceptable.
>>> such as this scenario:
>>> A thread: quiesce the queue
>>> B thread: want to quiesce the queue but do nothing because NVME_NS_STOPPED is already set.
>>> A thread: unquiesce the queue
>>> Now the queue is unquiesced too early for B thread.
>>> B thread: do something which need the queue must be quiesced.
>>>
>>> Introduce NVME_NS_STOPPED link:
>>> https://lore.kernel.org/all/20211014081710.1871747-5-ming.lei@redhat.com/
>>
>> I think we can just change a ns flag to a controller flag ala:
>> NVME_CTRL_STOPPED, and then do:
>>
>> void nvme_stop_queues(struct nvme_ctrl *ctrl)
>> {
>>      if (!test_and_set_bit(NVME_CTRL_STOPPED, &ns->flags))
>>          blk_mq_quiesce_tagset(ctrl->tagset);
>> }
>> EXPORT_SYMBOL_GPL(nvme_stop_queues);
>>
>> void nvme_start_queues(struct nvme_ctrl *ctrl)
>> {
>>      if (test_and_clear_bit(NVME_CTRL_STOPPED, &ns->flags))
>>          blk_mq_unquiesce_tagset(ctrl->tagset);
>> }
>> EXPORT_SYMBOL_GPL(nvme_start_queues);
>>
>> Won't that achieve the same result?
> No, nvme_set_queue_dying call nvme_start_ns_queue for one ns.
> ctrl->flags can not cover this scenario.
> This still results in unpaired quiesce/unquiesce.
> 
> Maybe it is necessary to clean up the confused quiesce/unquiesce of nvme-pci,
> Thus blk_mq_quiesce_tagset can be easy to use for nvme-pci.
> Before that, we could only optimize batched quiesce/unquiesce based on ns,
> Although I also think using blk_mq_quiesce_tagset is a better choice.
We can do some something special for nvme_set_queue_dying.
Thus can achieve the same effect with NVME_NS_STOPPED.
This should look acceptable.

What do you think about this?

Show the code:

--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -5100,13 +5100,14 @@ static void nvme_stop_ns_queue(struct nvme_ns *ns)
   * holding bd_butex.  This will end buffered writers dirtying pages that can't
   * be synced.
   */
-static void nvme_set_queue_dying(struct nvme_ns *ns)
+static void nvme_set_queue_dying(struct nvme_ns *ns, bool start_queue)
  {
         if (test_and_set_bit(NVME_NS_DEAD, &ns->flags))
                 return;

         blk_mark_disk_dead(ns->disk);
-       nvme_start_ns_queue(ns);
+       if (start_queue)
+               blk_mq_unquiesce_queue(ns->queue);

         set_capacity_and_notify(ns->disk, 0);
  }
@@ -5121,15 +5122,18 @@ static void nvme_set_queue_dying(struct nvme_ns *ns)
  void nvme_kill_queues(struct nvme_ctrl *ctrl)
  {
         struct nvme_ns *ns;
+       bool start_queue = false;

         down_read(&ctrl->namespaces_rwsem);

         /* Forcibly unquiesce queues to avoid blocking dispatch */
         if (ctrl->admin_q && !blk_queue_dying(ctrl->admin_q))
                 nvme_start_admin_queue(ctrl);
+        if (test_and_clear_bit(NVME_CTRL_STOPPED, &ctrl->flags))
+               start_queue = true;

         list_for_each_entry(ns, &ctrl->namespaces, list)
-               nvme_set_queue_dying(ns);
+               nvme_set_queue_dying(ns, start_queue);

         up_read(&ctrl->namespaces_rwsem);
  }

> 
> .