nvme smart-log intermittently corrupt from family of SSDs

Keith Busch kbusch at kernel.org
Thu Sep 22 08:43:26 PDT 2022


On Thu, Sep 22, 2022 at 10:37:29AM -0500, Nick Neumann wrote:
> I noticed something odd with the smart data from various sized HP
> FX900 Pro SSDs. Intermittently, after hours of use, host_writes would
> be 0 (but data_units_written would not). After experimenting, I'm
> seeing that intermittently, when running either "smartctl -x" or "nvme
> smart-log", the data coming back is, uh, junk?

I'd recommend taking this sighting to the device vendor. Either the drive is
returning garbage, or possibly nothing at all and the tool is reporting stale
data from where ever the buffer was allocated. The drive did respond to the
command with a success, otherwise the tooling wouldn't attempt to print the
data.
 
> sudo nvme smart-log /dev/nvme0n1
> Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
> critical_warning                    : 0
> temperature                         : 48 C
> available_spare                     : 100%
> available_spare_threshold           : 25%
> percentage_used                     : 0%
> data_units_read                     : 793353
> data_units_written                  : 4938926
> host_read_commands                  : 18675956
> host_write_commands                 : 40381344
> controller_busy_time                : 0
> power_cycles                        : 9
> power_on_hours                      : 15
> unsafe_shutdowns                    : 1
> media_errors                        : 0
> num_err_log_entries                 : 217026295083295649698117473405925064704
> Warning Temperature Time            : 1323299686
> Critical Composite Temperature Time : 4108885430
> Temperature Sensor 1                : 38041 C
> Temperature Sensor 2                : 55225 C
> Temperature Sensor 3                : 23326 C
> Temperature Sensor 4                : 38685 C
> Temperature Sensor 5                : 57125 C
> Temperature Sensor 6                : 45486 C
> Temperature Sensor 7                : 59856 C
> Temperature Sensor 8                : 2107 C
> Thermal Management T1 Trans Count   : 1702871148
> Thermal Management T2 Trans Count   : 312071619
> Thermal Management T1 Total Time    : 406823074
> Thermal Management T2 Total Time    : 468107429
> 
> Even odder, when the host_write_commands was 0, the rest of the data
> was sane. So the incorrectness is not always obvious...
> 
> Any recommendations or thoughts on dealing with this? Or am I just out
> of luck when it comes to relying on anything from these drives' smart
> logs?



More information about the Linux-nvme mailing list