[PATCH 1/1] nvme: Add verbose error logging

Sun Dec 12 01:52:26 PST 2021

On 12/10/21 6:56 PM, Keith Busch wrote:
> On Fri, Dec 10, 2021 at 10:54:45AM -0500, Martin K. Petersen wrote:
>>
>> Hi Chaitanya!
>>
>>> Although I always prefer descriptive error messages, I've not seen even
>>> half of these errors in the field. This will :-
>>>
>>> 1. Increase the size of the driver code.
>>> 2. Require constant patching of the driver code when spec is updated,
>>>     I'm afraid this will turn into churn.
>>
>> Updating the corresponding list in SCSI hasn't been a big problem.
>>
>>> Also, we can easily track down the error in the NVMe spec from the
>>> error code that driver is printing currently, is that not sufficient ?
>>
>> Only the block layer status is captured.
>>
>> In addition, customers are not really armed with the NVMe error decoder
>> ring and are left with no explanation as to why an I/O failed. We have
>> had good results in the field with the verbose SCSI error logging and
>> that's why this patch was modeled to work the same way.
> 
> If they are not familiar with the decoder, does seeing something like
> "Conflicting Attributes" really provide the user a better sense than
> "0x280"? Either way, it sounds like they'd just open a bug with someone
> who is familiar with the spec, in which case numeric codes should be
> sufficient.
> 
> Are the existing nvme trace events not providing enough information? The
> existing ones appear to have even greater detail than what this patch
> provides. Or is there some other reason why the kernel log is preferred
> over the event tracer for capturing these types of errors?

Maybe the string info be added to the trace events? If so, we may
need a verbosity level for the trace to record "only errors".

Although the one advantage for the log would be the first time error
status (or for generally rare errors).

> On the patch itself:
> 
> I'm not sure having it be a new Kconfig option is a good idea since it
> requires a recompile to enable/disable. Could this be toggled with a
> static_key instead so we may disable/enable verbose logging at runtime?
> 
> The new "struct nvme_string_table" is a bit inefficient. Unknown errors
> will scan the entire table, which will slow down our hot completion path
> on any error. The codes are small and mostly sequential, so I think you
> could create a sparse "static const char * const[]", then the "code"
> becomes the index that can be looked up in constant time. The total
> space of the array shouldn't be much different than the proposal.

100% agree.

> A significant amount of the new strings are specific to admin commands,
> yet errors for admin commands skip the logging. Why define unreachable
> strings? Or perhaps a better question: why skip logging admin errors?
> 
> IO errors can happen in large batches, so perhaps pr_err() should be
> ratelimted.

Agree.