Intermittent init failure with KLEVV CRAS C910 4TB device
David Gibson
david at gibson.dropbear.id.au
Thu Feb 20 02:51:08 PST 2025
Hi,
I'm trying to debug an intermittent failure to initialize the NVMe
driver with 4TB Klevv CRAS C910 M.2 NVMe SSDs. These were very cheap
devices, so I'm guessing it's a hardware or firmware flaw, but I'm
hoping it will be possible to find a driver workaround.
I'm a kernel developer, but unfamiliar with the NVMe or storage
subsystems, I'm hoping someone here can help me debug this.
Approximately every second boot, the nvme driver fails to probe these
SSDs with the errors:
[ 36.175032] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
[ 36.175032] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Possibly salient observations:
* Occurs both with a Debian distro kernel and a locally built
v6.14-rc3-60-g6537cfb395f3 kernel
* I have two identical devices (consecutive serial numbers)
* I've only ever seen the failure at first probe, never a failure
once the device is already probed and in use. Once successfully
booted, I was able to copy ~2.5T onto the devices without
additional problems.
* I'm currently booting from a different NVMe drive of another brand
& model, and I see the failures around one boot out of two
* I was previously attempting to boot from these devices, and I hit
the error nearly every boot.
* When booting from these devices, GRUB didn't appear to have any
trouble, leading to the odd situation of the kernel failing to see
the disk it was loaded from.
* The devices appear in lspcsi as:
01:00.0 Non-Volatile memory controller: Realtek Semiconductor Co., Ltd. RTS5772DL NVMe SSD Controller (DRAM-less) (rev 01)
02:00.0 Non-Volatile memory controller: Realtek Semiconductor Co., Ltd. RTS5772DL NVMe SSD Controller (DRAM-less) (rev 01)
I've made some instrumentation in my kernel and have discovered:
* The failure always seems to occur in nvme_disable_ctrl() called
from nvme_pci_configure_admin_queue().
* We appear to be writing 0 to CC[EN] ok, but CSTS[RDY] is never
cleared.
* Increasing the timeout tenfold does not help
* Enabling NVME_QUIRK_DELAY_BEFORE_CHK_RDY doesn't appear to change
anything
* Even if I also vastly increase NVME_QUIRK_DELAY_AMOUNT
* Enabling NVME_QUIRK_QDEPTH_ONE doesn't appear to change anything
* Adding a large msleep() before writing CC[EN]=0 doesn't seem to help
The devices also appear to have a bad NID (although only one of the 3
ids is non-unique, AFAICT). Enabling NVME_QUIRK_BOGUS_NID works
around that problem, but doesn't appear to affect the probe problem.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20250220/e2e00b7d/attachment-0001.sig>
More information about the Linux-nvme
mailing list