[PATCH v6 5/5] nvme-fabrics: handle transient auth failures
Sagi Grimberg
sagi at grimberg.me
Mon Apr 15 13:51:19 PDT 2024
On 15/04/2024 15:42, Daniel Wagner wrote:
> Retry authentication a couple of times to avoid error out on transient
> errors. E.g. if a new key is deployed on the host and the target. At the
> same time the connection drops. The host might use the wrong key to
> reconnect.
>
> Suggested-by: Sagi Grimberg <sagi at grimberg.me>
> Signed-off-by: Daniel Wagner <dwagner at suse.de>
> ---
> drivers/nvme/host/auth.c | 4 ++--
> drivers/nvme/host/fabrics.c | 10 +++++++++-
> drivers/nvme/host/fc.c | 2 ++
> drivers/nvme/host/nvme.h | 1 +
> drivers/nvme/host/rdma.c | 1 +
> drivers/nvme/host/tcp.c | 1 +
> 6 files changed, 16 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/nvme/host/auth.c b/drivers/nvme/host/auth.c
> index a264b3ae078b..368dcc6ee11b 100644
> --- a/drivers/nvme/host/auth.c
> +++ b/drivers/nvme/host/auth.c
> @@ -797,7 +797,7 @@ static void nvme_queue_auth_work(struct work_struct *work)
> NVME_AUTH_DHCHAP_MESSAGE_SUCCESS1);
> if (ret) {
> chap->status = ret;
> - chap->error = -ECONNREFUSED;
> + chap->error = -EKEYREJECTED;
> return;
> }
>
> @@ -818,7 +818,7 @@ static void nvme_queue_auth_work(struct work_struct *work)
> ret = nvme_auth_process_dhchap_success1(ctrl, chap);
> if (ret) {
> /* Controller authentication failed */
> - chap->error = -ECONNREFUSED;
> + chap->error = -EKEYREJECTED;
> goto fail2;
> }
>
> diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
> index d9a73b1b41c4..6dfa45dce232 100644
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -573,8 +573,16 @@ EXPORT_SYMBOL_GPL(nvmf_connect_io_queue);
> */
> bool nvmf_should_reconnect(struct nvme_ctrl *ctrl, int status)
> {
> - if (status < 0)
> + if (status < 0) {
> + /*
> + * authentication errors can be transient, thus retry a couple
> + * of times before giving up.
> + */
> + if (status == -EKEYREJECTED &&
> + ++ctrl->nr_auth_retries < 3)
> + return true;
I did not suggest nr_auth_retries. Where is the 3 coming from? The
controller already
has a number of reconnects before it gives up, no reason to add another one.
Just don't return false based on the status if it is a transient
authentication error.
The patch just needs to be modified from:
if (status < 0)
to
if (status < 0 && status != -EKEYREJECTED Plus a comment that
explains it.
More information about the Linux-nvme
mailing list