smartctl "kills" specific drives on kernel 5.13 but works fine on 5.15 - why?

Thu Sep 15 13:48:01 PDT 2022

On Thu, Sep 15, 2022 at 3:03 PM Keith Busch <kbusch at kernel.org> wrote:

> Not sure what MDTS has to do with this. The error log was originally defined to
> be a max 4k size which is below the smallest possible MDTS.
>
> My guess is smartclt tricked the driver into allocating a PRP List, but the
> controller instead accessed it as a PRP entry, which could corrupt memory or
> fail the transaction if data direction is enforced by the memory controller.
> Why that causes the nvme controller to fail as you've described is weird,
> though.

I definitely don't know this stuff very well - the smartctl bug
commentary was referencing the nvme-cli commit where log pages are
transferred in 4k chunks to avoid having to worry about exceeding the
MDTS value. The problematic drives have error logs larger than 4K.

I believe the logic in the smartctl commentary was along the lines of
"well, the MDTS is large enough that we should be able to transfer
more than 4k at a time, but we're currently crashing. And nvme-cli
does it 4k at a time always, and if we change to that, the crash goes
away, so let's do that."

As to the allocation, smartctl calls into nvme with
nvme_admin_get_log_page and passes a buffer (that smartctl allocates)
of size n * sizeof(nvme_error_log_page), where n is the number of
error log entries it is trying to read. The fix in smarmontools moved
from trying to read all of the error log entries at once via a single
call to nvme_adming_get_log_page, to doing 4K bytes at a time.

Not sure how helpful any of that is; it's where my current understanding is at.

Thanks,
Nick