[PATCH 3/3] nvme: start keep-alive after admin queue setup

Sagi Grimberg sagi at grimberg.me
Mon Nov 20 05:39:16 PST 2023


> Setting up I/O queues might take quite some time on larger and/or
> busy setups, so KATO might expire before all I/O queues could be
> set up.
> Fix this by start keep alive from the ->init_ctrl_finish() callback,
> and stopping it when calling nvme_cancel_admin_tagset().

If this is a fix, the title should describe the issue it is fixing, and
the body should say how it is fixing it.

> Signed-off-by: Hannes Reinecke <hare at suse.de>
> ---
>   drivers/nvme/host/core.c | 6 +++---
>   drivers/nvme/host/fc.c   | 6 ++++++
>   2 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 62612f87aafa..f48b4f735d2d 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -483,6 +483,7 @@ EXPORT_SYMBOL_GPL(nvme_cancel_tagset);
>   
>   void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl)
>   {
> +	nvme_stop_keep_alive(ctrl);
>   	if (ctrl->admin_tagset) {
>   		blk_mq_tagset_busy_iter(ctrl->admin_tagset,
>   				nvme_cancel_request, ctrl);

There is a cross dependency here, now nvme_cancel_admin_tagset needs to
have the keep-alive stopped first, which may be waiting on I/O, which
needs to be cancelled...

Keep in mind that kato can be arbitrarily long, and now this function
may be blocked on this kato period.

I also think that now the function is doing something that is more
than simply cancelling the inflight admin tagset, as it is named.

> @@ -3200,6 +3201,8 @@ int nvme_init_ctrl_finish(struct nvme_ctrl *ctrl, bool was_suspended)
>   	clear_bit(NVME_CTRL_DIRTY_CAPABILITY, &ctrl->flags);
>   	ctrl->identified = true;
>   
> +	nvme_start_keep_alive(ctrl);
> +

I'm fine with moving it here. But instead, maybe just change
nvme_start_keep_alive() to use a zero delay and keep it where it
is? will that help?

>   	return 0;
>   }
>   EXPORT_SYMBOL_GPL(nvme_init_ctrl_finish);
> @@ -4333,7 +4336,6 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
>   {
>   	nvme_mpath_stop(ctrl);
>   	nvme_auth_stop(ctrl);
> -	nvme_stop_keep_alive(ctrl);
>   	nvme_stop_failfast_work(ctrl);
>   	flush_work(&ctrl->async_event_work);
>   	cancel_work_sync(&ctrl->fw_act_work);
> @@ -4344,8 +4346,6 @@ EXPORT_SYMBOL_GPL(nvme_stop_ctrl);
>   
>   void nvme_start_ctrl(struct nvme_ctrl *ctrl)
>   {
> -	nvme_start_keep_alive(ctrl);
> -
>   	nvme_enable_aen(ctrl);
>   
>   	/*
> diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
> index a15b37750d6e..a9affc8b755b 100644
> --- a/drivers/nvme/host/fc.c
> +++ b/drivers/nvme/host/fc.c
> @@ -2530,6 +2530,12 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues)
>   	 * clean up the admin queue. Same thing as above.
>   	 */
>   	nvme_quiesce_admin_queue(&ctrl->ctrl);
> +
> +	/*
> +	 * Open-coding nvme_cancel_admin_tagset() as fc
> +	 * is not using nvme_cancel_request().
> +	 */
> +	nvme_stop_keep_alive(ctrl);
>   	blk_sync_queue(ctrl->ctrl.admin_q);
>   	blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
>   				nvme_fc_terminate_exchange, &ctrl->ctrl);

What does this fix? This should really be split out of the patch.



More information about the Linux-nvme mailing list