[PATCH v1 0/5] Fix keep-alive mechanism for fabrics

Thu Apr 12 06:32:33 PDT 2018

On 4/12/2018 3:29 PM, Sagi Grimberg wrote:
> 
> 
> On 04/11/2018 07:48 PM, James Smart wrote:
>> On 4/11/2018 7:40 AM, Max Gurtovoy wrote:
>>>
>>>
>>> On 4/11/2018 5:07 PM, Sagi Grimberg wrote:
>>>>
>>>>>> I don't understand what you mean by a "property of the admin queue",
>>>>>> there is no such definition in the spec. In any event, the keep alive
>>>>>> mechanism was defined to detect host/controller health on a periodic
>>>>>> basis.
>>>>>>
>>>>>> The spec clearly states that the host needs to send keep alive at
>>>>>> a faster rate than the Keep alive timeout accounting things such
>>>>>> as transport roundtrip times, transport delays, keep alive timer
>>>>>> granularity and so on.
>>>>>>
>>>>>> To me, this is a clear case where your use-case requires a larger
>>>>>> keep alive timeout. I'm not sure why sprinkling the keep alive 
>>>>>> execution
>>>>>> around solve/help anything. To me this looks completely redundant
>>>>>> (sorry).
>>>>>
>>>>> I think that starting keep-alive timer at the target side after 
>>>>> admin connect and starting keep-alive timer at the host side after 
>>>>> all io-queues connect is wrong.
>>>>> In my solution I start keep-alive mechanism in the host side also 
>>>>> after admin connect (exactly as the target side).
>>
>> well - true, but I believe it should be started after the controller 
>> is enabled - not just that the admin queue has been created.  I don't 
>> think even the admin queue can be serviced till the adapter is 
>> enabled. Thus, I would agree with moving nvme_start_keep_alive() from 
>> nvme_start_ctrl() to nvme_enable_ctrl().
> 
> I'm fine with that.
> 
>>   We would also want the "enablement" to detect the keepalive style 
>> (today's keepalive or TBKA) supported by the time that call is made.   
>> Given that needs to be known, I believe there will always be a small 
>> window where a couple of things have to be accessed before the KA starts.
> 
> I honestly don't understand why should this even be an issue. If someone
> is running a heavy load where keep alive timeout is not sufficient, then
> it should use a larger keep alive.
> 
>> I strongly support Sagi's statements on trying to keep the KA 
>> start/stop in common code rather than sprinkling it around.   I agree 
>> with the current placement of the stop code in nvme_stop_ctrl().   The 
>> stop ctrl should always be called prior to the admin queue being torn 
>> down in the transport (as part of kicking off the reset of the 
>> controller).
> 
> Yea, we need to keep it where it is. the start can move to controller
> enable, and in fact, the target should also do that for the target
> and update the timer in nvmet_start_ctrl.
> -- 
> diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
> index fe151d672241..a81bf4d5e60c 100644
> --- a/drivers/nvme/target/core.c
> +++ b/drivers/nvme/target/core.c
> @@ -643,6 +643,7 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl)
>          }
> 
>          ctrl->csts = NVME_CSTS_RDY;
> +       mod_delayed_work(system_wq, &ctrl->ka_work, ctrl->kato * HZ);
>   }
> 
>   static void nvmet_clear_ctrl(struct nvmet_ctrl *ctrl)
> -- 
> 
> This would get the correct behavior and still be able to
> detect host death before the controller was enabled.
> 
> This together with moving the keep alive timer start in nvme_enable_ctrl
> should be enough and not propagating the handle to every transport
> driver, we have enough of that already.

Great, this is what I wanted to do in the first place but I was 
concerned about stopping the keep-alive in nvme_disable_ctrl.
Are you sending a new series or I will ?

> -- 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index e1b708ee6783..4b5c3f7addeb 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1739,7 +1739,14 @@ int nvme_enable_ctrl(struct nvme_ctrl *ctrl, u64 
> cap)
>          ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC, 
> ctrl->ctrl_config);
>          if (ret)
>                  return ret;
> -       return nvme_wait_ready(ctrl, cap, true);
> +
> +       ret = nvme_wait_ready(ctrl, cap, true);
> +       if (ret)
> +               return ret;
> +
> +       nvme_start_keep_alive(ctrl);
> +
> +       return 0;
>   }
>   EXPORT_SYMBOL_GPL(nvme_enable_ctrl);
> 
> @@ -3393,9 +3400,6 @@ EXPORT_SYMBOL_GPL(nvme_stop_ctrl);
> 
>   void nvme_start_ctrl(struct nvme_ctrl *ctrl)
>   {
> -       if (ctrl->kato)
> -               nvme_start_keep_alive(ctrl);
> -
>          if (ctrl->queue_count > 1) {
>                  nvme_queue_scan(ctrl);
>                  queue_work(nvme_wq, &ctrl->async_event_work);
> -- 
> 
>>>> That does not guarantee anything either. as you were simply able to
>>>> minimize the effect.
>>>>
>>>> An alternative patch would also be able to minimize the effect:
>>>> -- 
>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>> index e1b708ee6783..a84198b1e0d8 100644
>>>> --- a/drivers/nvme/host/core.c
>>>> +++ b/drivers/nvme/host/core.c
>>>> @@ -847,7 +847,7 @@ static void nvme_start_keep_alive(struct 
>>>> nvme_ctrl *ctrl)
>>>>          INIT_DELAYED_WORK(&ctrl->ka_work, nvme_keep_alive_work);
>>>>          memset(&ctrl->ka_cmd, 0, sizeof(ctrl->ka_cmd));
>>>>          ctrl->ka_cmd.common.opcode = nvme_admin_keep_alive;
>>>> -       schedule_delayed_work(&ctrl->ka_work, ctrl->kato * HZ);
>>>> +       schedule_delayed_work(&ctrl->ka_work, 0);
>>>>   }
>>>>
>>>>   void nvme_stop_keep_alive(struct nvme_ctrl *ctrl)
>>>> -- 
>>>>
>>>> This way the first keep alive will execute immediately the first time
>>>> and not kato after...
>>>
>>> This is a good patch, and I can add it to the series.
>>
>> Uh... I wouldn't want to do this.  You can easily burn lots of cycles 
>> sending a KA.
> 
> Its only for the first invocation. but I think its useless anyways...

So this one is out.

> 
>> I would lean toward making people get used to setting KATO to a large 
>> enough value that it can be used equally well with TBKA - thus it 
>> should be something like this instead:
>>         schedule_delayed_work(&ctrl->ka_work, (ctrl->kato * HZ) / 2);
>>
>> This is somewhat independent from the grace value - as grace is a 
>> scaling factor to accommodate congestion and delayed processing.
> 
> This is fine too.