[PATCH 1/1] nvme: Add verbose error logging

Mon Dec 13 00:09:41 PST 2021

On 12/12/21 1:52 AM, Sagi Grimberg wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 12/10/21 6:56 PM, Keith Busch wrote:
>> On Fri, Dec 10, 2021 at 10:54:45AM -0500, Martin K. Petersen wrote:
>>>
>>> Hi Chaitanya!
>>>
>>>> Although I always prefer descriptive error messages, I've not seen even
>>>> half of these errors in the field. This will :-
>>>>
>>>> 1. Increase the size of the driver code.
>>>> 2. Require constant patching of the driver code when spec is updated,
>>>>     I'm afraid this will turn into churn.
>>>
>>> Updating the corresponding list in SCSI hasn't been a big problem.
>>>
>>>> Also, we can easily track down the error in the NVMe spec from the
>>>> error code that driver is printing currently, is that not sufficient ?
>>>
>>> Only the block layer status is captured.
>>>
>>> In addition, customers are not really armed with the NVMe error decoder
>>> ring and are left with no explanation as to why an I/O failed. We have
>>> had good results in the field with the verbose SCSI error logging and
>>> that's why this patch was modeled to work the same way.
>>
>> If they are not familiar with the decoder, does seeing something like
>> "Conflicting Attributes" really provide the user a better sense than
>> "0x280"? Either way, it sounds like they'd just open a bug with someone
>> who is familiar with the spec, in which case numeric codes should be
>> sufficient.
>>
>> Are the existing nvme trace events not providing enough information? The
>> existing ones appear to have even greater detail than what this patch
>> provides. Or is there some other reason why the kernel log is preferred
>> over the event tracer for capturing these types of errors?
> 
> Maybe the string info be added to the trace events? If so, we may
> need a verbosity level for the trace to record "only errors".
> 
> Although the one advantage for the log would be the first time error
> status (or for generally rare errors).
> 
>> On the patch itself:
>>
>> I'm not sure having it be a new Kconfig option is a good idea since it
>> requires a recompile to enable/disable. Could this be toggled with a
>> static_key instead so we may disable/enable verbose logging at runtime?
>>
>> The new "struct nvme_string_table" is a bit inefficient. Unknown errors
>> will scan the entire table, which will slow down our hot completion path
>> on any error. The codes are small and mostly sequential, so I think you
>> could create a sparse "static const char * const[]", then the "code"
>> becomes the index that can be looked up in constant time. The total
>> space of the array shouldn't be much different than the proposal.
> 
> 100% agree.
> 
>

Ideally we should document the performance numbers if we are going
to add anything in the fast path.