[bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests

Tue Apr 5 13:51:40 PDT 2022

> On Apr 5, 2022, at 11:21 AM, Jonathan Derrick <jonathan.derrick at linux.dev> wrote:
> 
> 
> 
> On 4/5/2022 12:14 AM, Christoph Hellwig wrote:
>> On Mon, Apr 04, 2022 at 02:30:12PM -0600, Keith Busch wrote:
>>>> Eg, nvme0: blah blah command set not supported
>>> 
>>> The new print in the completion handler is pretty generic. I don't think it can
>>> readily tell the difference from a harmless error. Maybe pr_err is too high?
>>> 
>>> Or maybe since enough people have been concerned about *this* specific
>>> identify, maybe it should be restricted to 2.0 devices where it's mandatory. I
>>> was reluctant to do that at first since the initial device I tested was 1.4,
>>> but it was a prototype and we should be fine without the non-mdts limits
>>> anyway.
>> 
>> What SCSI does is to add RQF_QUIET to all internal passthrough commands,
>> and then skips printing the SCSI specific error messages in addition
>> if that flag is set.
>> 
>> This would be the nvme version of that:
>> 
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 7e07dd69262a7..9346cd4cf5820 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -366,7 +366,8 @@ static inline void nvme_end_req(struct request *req)
>> {
>> 	blk_status_t status = nvme_error_status(nvme_req(req)->status);
>> 
>> -	if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS))
>> +	if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS &&
>> +		     !(req->rq_flags & RQF_QUIET)))
>> 		nvme_log_error(req);
>> 	nvme_end_req_zoned(req);
>> 	nvme_trace_bio_complete(req);
>> @@ -648,6 +649,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd)
>> 	cmd->common.flags &= ~NVME_CMD_SGL_ALL;
>> 
>> 	req->cmd_flags |= REQ_FAILFAST_DRIVER;
>> +	req->rq_flags |= RQF_QUIET;
>> 	if (req->mq_hctx->type == HCTX_TYPE_POLL)
>> 		req->cmd_flags |= REQ_POLLED;
>> 	nvme_clear_nvme_request(req);
> 
> 
> That's good too.
> How about this so it's limited to debug loglevels:

I don’t think we want to limit it to debug loglevels.  The main purpose of the patch was to allow for debugging issues of live customer systems.

Alan

> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index f204c6f78b5b..871ad2421284 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -303,9 +303,10 @@ static void nvme_log_error(struct request *req)
> {
>        struct nvme_ns *ns = req->q->queuedata;
>        struct nvme_request *nr = nvme_req(req);
> +       int level = req->rq_flags & RQF_QUIET ? KERN_DEBUG : KERN_ERR;
> 
>        if (ns) {
> -               pr_err_ratelimited("%s: %s(0x%x) @ LBA %llu, %llu blocks, %s (sct 0x%x / sc 0x%x) %s%s\n",
> +               printk_ratelimited(level "%s: %s(0x%x) @ LBA %llu, %llu blocks, %s (sct 0x%x / sc 0x%x) %s%s\n",
>                       ns->disk ? ns->disk->disk_name : "?",
>                       nvme_get_opcode_str(nr->cmd->common.opcode),
>                       nr->cmd->common.opcode,
> @@ -319,7 +320,7 @@ static void nvme_log_error(struct request *req)
>                return;
>        }
> 
> -       pr_err_ratelimited("%s: %s(0x%x), %s (sct 0x%x / sc 0x%x) %s%s\n",
> +       printk_ratelimited(level "%s: %s(0x%x), %s (sct 0x%x / sc 0x%x) %s%s\n",
>                           dev_name(nr->ctrl->device),
>                           nvme_get_admin_opcode_str(nr->cmd->common.opcode),
>                           nr->cmd->common.opcode,
> @@ -651,6 +652,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd)
>        cmd->common.flags &= ~NVME_CMD_SGL_ALL;
> 
>        req->cmd_flags |= REQ_FAILFAST_DRIVER;
> +       req->rq_flags |= RQF_QUIET;
>        if (req->mq_hctx->type == HCTX_TYPE_POLL)
>                req->cmd_flags |= REQ_POLLED;
>        nvme_clear_nvme_request(req);
>