[PATCH 3/3] nvme: start keep-alive after admin queue setup

Sagi Grimberg sagi at grimberg.me
Tue Nov 21 01:16:22 PST 2023



On 11/21/23 09:29, Hannes Reinecke wrote:
> On 11/20/23 20:05, Sagi Grimberg wrote:
>>
>>
>> On 11/20/23 18:01, Hannes Reinecke wrote:
>>> On 11/20/23 14:39, Sagi Grimberg wrote:
>>>>
>>>>> Setting up I/O queues might take quite some time on larger and/or
>>>>> busy setups, so KATO might expire before all I/O queues could be
>>>>> set up.
>>>>> Fix this by start keep alive from the ->init_ctrl_finish() callback,
>>>>> and stopping it when calling nvme_cancel_admin_tagset().
>>>>
>>>> If this is a fix, the title should describe the issue it is fixing, and
>>>> the body should say how it is fixing it.
>>>>
>>>>> Signed-off-by: Hannes Reinecke <hare at suse.de>
>>>>> ---
>>>>>   drivers/nvme/host/core.c | 6 +++---
>>>>>   drivers/nvme/host/fc.c   | 6 ++++++
>>>>>   2 files changed, 9 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>>> index 62612f87aafa..f48b4f735d2d 100644
>>>>> --- a/drivers/nvme/host/core.c
>>>>> +++ b/drivers/nvme/host/core.c
>>>>> @@ -483,6 +483,7 @@ EXPORT_SYMBOL_GPL(nvme_cancel_tagset);
>>>>>   void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl)
>>>>>   {
>>>>> +    nvme_stop_keep_alive(ctrl);
>>>>>       if (ctrl->admin_tagset) {
>>>>>           blk_mq_tagset_busy_iter(ctrl->admin_tagset,
>>>>>                   nvme_cancel_request, ctrl);
>>>>
>>>> There is a cross dependency here, now nvme_cancel_admin_tagset needs to
>>>> have the keep-alive stopped first, which may be waiting on I/O, which
>>>> needs to be cancelled...
>>>>
>>>> Keep in mind that kato can be arbitrarily long, and now this function
>>>> may be blocked on this kato period.
>>>>
>>>> I also think that now the function is doing something that is more
>>>> than simply cancelling the inflight admin tagset, as it is named.
>>>>
>>> I am having a hard time following this reasoning.
>>> While I do accept that nvme_stop_keep_alive() might trigger I/O
>>> (ie if the work queue has just been started when calling 
>>> cancel_delayed_work), nvme_tcp_error_recovery_work() has this:
>>>
>>>      nvme_stop_keep_alive(ctrl);
>>>      flush_work(&ctrl->async_event_work);
>>>      nvme_tcp_teardown_io_queues(ctrl, false);
>>>      /* unquiesce to fail fast pending requests */
>>>      nvme_unquiesce_io_queues(ctrl);
>>>      nvme_tcp_teardown_admin_queue(ctrl, false);
>>>
>>> and nvme_tcp_teardown_admin_queue() calls nvme_cancel_admin_tagset().
>>> So by your above reasoning this code should be wrong, too.
>>> What am I missing?
>>
>> Need to dig through the history, but it can most definitely move to
>> after the teardown. It could be from the earlier days where the
>> transport fencing was not as reliable particularly for admin requests.
> 
> Ah, no. We are both wrong. Colophon of cancel_delayed_work():
> 
>   * Note:
>   * The work callback function may still be running on return, unless
>   * it returns %true and the work doesn't re-arm itself.  Explicitly 
> flush or
>   * use cancel_delayed_work_sync() to wait on it.
> 
> Hence we won't be blocked by running I/O when calling 
> nvme_stop_keep_alive().
> So the correct syntax would indeed be calling nvme_stop_keep_alive()
> first, and then disable/cancel the admin tagset (which will terminate
> any outstanding keep-alive requests).

Yes, you're correct. because keep-alive is not a sync command it
won't block on io. However I still don't see why should we change
where it exist today (i.e. the choice to move it to the odd location
of nvme_cancel_admin_tagset).

I agree that we can move start_keep_alive earlier.



More information about the Linux-nvme mailing list