[PATCH v1 0/5] Fix keep-alive mechanism for fabrics

Thu Apr 12 05:29:07 PDT 2018

On 04/11/2018 07:48 PM, James Smart wrote:
> On 4/11/2018 7:40 AM, Max Gurtovoy wrote:
>>
>>
>> On 4/11/2018 5:07 PM, Sagi Grimberg wrote:
>>>
>>>>> I don't understand what you mean by a "property of the admin queue",
>>>>> there is no such definition in the spec. In any event, the keep alive
>>>>> mechanism was defined to detect host/controller health on a periodic
>>>>> basis.
>>>>>
>>>>> The spec clearly states that the host needs to send keep alive at
>>>>> a faster rate than the Keep alive timeout accounting things such
>>>>> as transport roundtrip times, transport delays, keep alive timer
>>>>> granularity and so on.
>>>>>
>>>>> To me, this is a clear case where your use-case requires a larger
>>>>> keep alive timeout. I'm not sure why sprinkling the keep alive 
>>>>> execution
>>>>> around solve/help anything. To me this looks completely redundant
>>>>> (sorry).
>>>>
>>>> I think that starting keep-alive timer at the target side after 
>>>> admin connect and starting keep-alive timer at the host side after 
>>>> all io-queues connect is wrong.
>>>> In my solution I start keep-alive mechanism in the host side also 
>>>> after admin connect (exactly as the target side).
> 
> well - true, but I believe it should be started after the controller is 
> enabled - not just that the admin queue has been created.  I don't think 
> even the admin queue can be serviced till the adapter is enabled. Thus, 
> I would agree with moving nvme_start_keep_alive() from nvme_start_ctrl() 
> to nvme_enable_ctrl().

I'm fine with that.

>  We would also want the "enablement" to detect 
> the keepalive style (today's keepalive or TBKA) supported by the time 
> that call is made.   Given that needs to be known, I believe there will 
> always be a small window where a couple of things have to be accessed 
> before the KA starts.

I honestly don't understand why should this even be an issue. If someone
is running a heavy load where keep alive timeout is not sufficient, then
it should use a larger keep alive.

> I strongly support Sagi's statements on trying to keep the KA start/stop 
> in common code rather than sprinkling it around.   I agree with the 
> current placement of the stop code in nvme_stop_ctrl().   The stop ctrl 
> should always be called prior to the admin queue being torn down in the 
> transport (as part of kicking off the reset of the controller).

Yea, we need to keep it where it is. the start can move to controller
enable, and in fact, the target should also do that for the target
and update the timer in nvmet_start_ctrl.
--

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index fe151d672241..a81bf4d5e60c 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -643,6 +643,7 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl)
         }

         ctrl->csts = NVME_CSTS_RDY;
+       mod_delayed_work(system_wq, &ctrl->ka_work, ctrl->kato * HZ);
  }

  static void nvmet_clear_ctrl(struct nvmet_ctrl *ctrl)
--

This would get the correct behavior and still be able to
detect host death before the controller was enabled.

This together with moving the keep alive timer start in nvme_enable_ctrl
should be enough and not propagating the handle to every transport
driver, we have enough of that already.
--
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e1b708ee6783..4b5c3f7addeb 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1739,7 +1739,14 @@ int nvme_enable_ctrl(struct nvme_ctrl *ctrl, u64 cap)
         ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config);
         if (ret)
                 return ret;
-       return nvme_wait_ready(ctrl, cap, true);
+
+       ret = nvme_wait_ready(ctrl, cap, true);
+       if (ret)
+               return ret;
+
+       nvme_start_keep_alive(ctrl);
+
+       return 0;
  }
  EXPORT_SYMBOL_GPL(nvme_enable_ctrl);

@@ -3393,9 +3400,6 @@ EXPORT_SYMBOL_GPL(nvme_stop_ctrl);

  void nvme_start_ctrl(struct nvme_ctrl *ctrl)
  {
-       if (ctrl->kato)
-               nvme_start_keep_alive(ctrl);
-
         if (ctrl->queue_count > 1) {
                 nvme_queue_scan(ctrl);
                 queue_work(nvme_wq, &ctrl->async_event_work);
--

>>> That does not guarantee anything either. as you were simply able to
>>> minimize the effect.
>>>
>>> An alternative patch would also be able to minimize the effect:
>>> -- 
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index e1b708ee6783..a84198b1e0d8 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -847,7 +847,7 @@ static void nvme_start_keep_alive(struct 
>>> nvme_ctrl *ctrl)
>>>          INIT_DELAYED_WORK(&ctrl->ka_work, nvme_keep_alive_work);
>>>          memset(&ctrl->ka_cmd, 0, sizeof(ctrl->ka_cmd));
>>>          ctrl->ka_cmd.common.opcode = nvme_admin_keep_alive;
>>> -       schedule_delayed_work(&ctrl->ka_work, ctrl->kato * HZ);
>>> +       schedule_delayed_work(&ctrl->ka_work, 0);
>>>   }
>>>
>>>   void nvme_stop_keep_alive(struct nvme_ctrl *ctrl)
>>> -- 
>>>
>>> This way the first keep alive will execute immediately the first time
>>> and not kato after...
>>
>> This is a good patch, and I can add it to the series.
> 
> Uh... I wouldn't want to do this.  You can easily burn lots of cycles 
> sending a KA.

Its only for the first invocation. but I think its useless anyways...

> I would lean toward making people get used to setting KATO to a large 
> enough value that it can be used equally well with TBKA - thus it should 
> be something like this instead:
>         schedule_delayed_work(&ctrl->ka_work, (ctrl->kato * HZ) / 2);
> 
> This is somewhat independent from the grace value - as grace is a 
> scaling factor to accommodate congestion and delayed processing.

This is fine too.