[PATCH 1/3] nvme-core: improve avoiding false remove namespace

Thu Aug 20 00:33:22 EDT 2020

> nvme_revalidate_disk translate return error to 0 if it is not a fatal
> error, thus avoid false remove namespace. If return error less than 0,
> now only ENOMEM be translated to 0, but other error except ENODEV,
> such as EAGAIN or EBUSY etc, also need translate to 0.
> Another reason for improving the error translation: If request timeout
> when connect, __nvme_submit_sync_cmd will return
> NVME_SC_HOST_ABORTED_CMD(>0). At this time, should terminate the
> connect process, but falsely continue the connect process,
> this may cause deadlock. Many functions which call
> __nvme_submit_sync_cmd treat error code(> 0) as target not support and
> continue, but NVME_SC_HOST_ABORTED_CMD and NVME_SC_HOST_PATH_ERROR both
> are cancled io by host, to fix this bug, we need set the flag:
> NVME_REQ_CANCELLED, thus __nvme_submit_sync_cmd will translate return
> error to INTR. This is conflict with error translation of
> nvme_revalidate_disk, may cause false remove namespace.
> 
> Signed-off-by: Chao Leng <lengchao at huawei.com>
> ---
>   drivers/nvme/host/core.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 88cff309d8e4..43ac8a1ad65d 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -2130,10 +2130,10 @@ static int _nvme_revalidate_disk(struct gendisk *disk)
>   	 * Only fail the function if we got a fatal error back from the
>   	 * device, otherwise ignore the error and just move on.
>   	 */
> -	if (ret == -ENOMEM || (ret > 0 && !(ret & NVME_SC_DNR)))
> -		ret = 0;
> -	else if (ret > 0)
> +	if (ret > 0 && (ret & NVME_SC_DNR))
>   		ret = blk_status_to_errno(nvme_error_status(ret));
> +	else if (ret != -ENODEV)
> +		ret = 0;
>   	return ret;

We really need to take a step back here, I really don't like how
we are growing implicit assumptions on how statuses are interpreted.

Why don't we remove the -ENODEV error propagation back and instead
take care of it in the specific call-sites where we want to ignore
errors with proper quirks?