[PATCH 1/1] nvme: Add verbose error logging

Martin K. Petersen martin.petersen at oracle.com
Mon Dec 13 19:01:30 PST 2021


Keith,

> If they are not familiar with the decoder, does seeing something like
> "Conflicting Attributes" really provide the user a better sense than
> "0x280"? Either way, it sounds like they'd just open a bug with someone
> who is familiar with the spec, in which case numeric codes should be
> sufficient.

An English error string is very valuable in a support situation. It is
much easier to discuss the implications of "Unrecoverable Read Error"
than some random hex number.

> Are the existing nvme trace events not providing enough information?
> The existing ones appear to have even greater detail than what this
> patch provides. Or is there some other reason why the kernel log is
> preferred over the event tracer for capturing these types of errors?

Logging is imperative. We have to be able to root cause an issue after
the fact.

> I'm not sure having it be a new Kconfig option is a good idea since it
> requires a recompile to enable/disable. Could this be toggled with a
> static_key instead so we may disable/enable verbose logging at
> runtime?

In SCSI it's a Kconfig option because the embedded folks do not want to
waste 8K. We did it the same way here. Enabling verbose logging probably
doesn't make much sense unless you're an enterprise distro.

> The new "struct nvme_string_table" is a bit inefficient. Unknown errors
> will scan the entire table, which will slow down our hot completion path
> on any error. The codes are small and mostly sequential, so I think you
> could create a sparse "static const char * const[]", then the "code"
> becomes the index that can be looked up in constant time. The total
> space of the array shouldn't be much different than the proposal.

We can look into that.

> A significant amount of the new strings are specific to admin
> commands, yet errors for admin commands skip the logging. Why define
> unreachable strings? Or perhaps a better question: why skip logging
> admin errors?

Good point.

> IO errors can happen in large batches, so perhaps pr_err() should be
> ratelimted.

Sure.

-- 
Martin K. Petersen	Oracle Linux Engineering



More information about the Linux-nvme mailing list