smartctl "kills" specific drives on kernel 5.13 but works fine on 5.15 - why?

Thu Sep 15 13:03:14 PDT 2022

On Thu, Sep 15, 2022 at 12:50:54PM -0500, Nick Neumann wrote:
> On Wed, Sep 14, 2022 at 2:44 PM Nick Neumann <nick at pcpartpicker.com> wrote:
> >
> > I'm running ubuntu 20.04 LTS with HWE, which reports kernel 5.13.0-51
> > generic . Both a crucial P5 1TB and Crucial P5 2TB behave rather
> > poorly. With one drive installed, running
> >
> > sudo smartctl -x /dev/nvme0
> >
> > will output some info, then hang for a while, and then print
> > "NVME_IOCTL_ADMIN_CMD: Interrupted system call"
> >
> > From that point on, the drives are gone from the system until I cut
> > and restore power (reboot is not enough).
> >
> > Running smartctl against the drives works fine in windows and in
> > Ubuntu 22.04 LTS, which reports kernel 5.15.0-43
> >
> > I thought for sure I'd find that a quirk for the drives had been added
> > between kernels 5.13 and 5.15, but alas, I don't see one. The PCI
> > Vendor/Device ID is 1344:5405 for the 1TB model, and while the crucial
> > P2 has a quirk in drivers/nvme/host/pci.c, it has a different vendor
> > ID altogether (c0a9).
> >
> > Any thoughts on where I can look or what I might compare to try to
> > figure out what changed to get the Crucial P5 drives behaving? I was
> > hoping there was some setting I could tweak to get them going without
> > having to move to 22.04 LTS. (I've tried
> > "nvme_core.default_ps_max_latency_us=0" and various values for
> > "pci_aspm" with no luck.)
> 
> Figured this out. It isn't a linux kernel change, but rather a
> smartctl change. (In hindsight I should have started digging there
> first.)
> 
> The issue was https://www.smartmontools.org/ticket/1404, fixed by the
> 7.2 release (and Ubuntu 20.04LTS is on 7.1). The fix in smartmontools
> was to change to reading logs 4KB at a time, just like nvme did in
> https://github.com/linux-nvme/nvme-cli/commit/465a4d. (The device
> advertises that it has an MDTS of 9 so, as far as I understand,
> reading in 4KB chunks should not be necessary; the smartmontools
> author was not certain where the blame for the issue really belonged,
> but changing to work like nvme-cli avoids it.)
> 
> For now I'll avoid reading the error log via smartctl on problematic
> drives until I can move to a later smartmontools version.

Not sure what MDTS has to do with this. The error log was originally defined to
be a max 4k size which is below the smallest possible MDTS.

My guess is smartclt tricked the driver into allocating a PRP List, but the
controller instead accessed it as a PRP entry, which could corrupt memory or
fail the transaction if data direction is enforced by the memory controller.
Why that causes the nvme controller to fail as you've described is weird,
though.