NVMe SMART Log critical warning during benchmark - is it really critical?
Nick Neumann
nick at pcpartpicker.com
Fri Sep 30 17:43:57 PDT 2022
I was running fio to fill an NVMe SSD with a sequential write. During
this, the drive gets hot, eventually climbing to 84C or higher.
The drive lists 84C for Warning Comp. Temp. Threshold, and 88C for
Critical Comp. Temp. Threshold.
When it hits 84C, it returns Critical Warning 0x02 if I query its
SMART log page. The NVMe docs say "when this is enabled the drive has
a problem". But the docs also say that Critical Warning is set when
"exceeding the temperature threshold and/or throttling".
These two things in the NVMe docs seem contradictory. Throttling
happens, especially with some higher end consumer drives under
sustained load, It seems odd to consider the drive having "a problem"
when something that happens under unremarkable but heavy use happens.
If using Critical Warning like this is appropriate, it feels out of
place to have it bundled in with the other reasons for Critical
Warning. Those reasons are degraded/read only mode due to media
errors, and hardware failure
I've had other drives run the same fio test and thermal throttle, and
*not* set Critical Warning. I'd expect Critical Warning for
temperature only if a drive got so hot that data loss could occur, and
would expect thermal throttling to happen first.
Any insight on this? How Critical is Critical Warning 0x02? Thanks in advance.
More information about the Linux-nvme
mailing list