2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

R, Monish Kumar monish.kumar.r at intel.com
Thu Jun 9 02:32:02 PDT 2022


Hi Jason,

I would like to provide justification for this Samsung X5 SSD fix added.
We were facing SSD enumeration issue after cold / warm reboot with device 
connected ends up with probe failures.

When I debug on this issue, I could find that this device was not enumerating 
once the system got booted. Moreover, we were facing this enumeration issue
specific to this device. 

Based on analysis, due to deep power state of the device fails to enumerate.
So, added the following quirks as a workaround fixe and it helps to enumerate the device after cold/warm reboot. If new Samsung X5 SSD's are working fine as expected, we can remove those 
fix. 

Regarding the PCI-Id's, I have confirmed from the logs and it shows as vendor ID : 0x144d 
device ID : 0xa808. I am not sure about why Samsung 970 EVO Plus have the same PCI-Ids.

Logs for reference : 
After connecting Samsung X5 SSD.

lspci
04:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983

dmesg
Line 1478: <6>[  112.838998] pci 0000:04:00.0: [144d:a808] type 00 class 0x010802
Line 1479: <6>[  112.845765] pci 0000:04:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
Line 1480: <6>[  112.853715] pci 0000:04:00.0: 8.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x4 link at 0000:00:07.0 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 link)
Line 1481: <6>[  112.870536] pci 0000:04:00.0: Adding to iommu group 22
Line 1498: <6>[  113.019698] pci 0000:04:00.0: BAR 0: assigned [mem 0x83000000-0x83003fff 64bit]

Regards,
Monish Kumar R

-----Original Message-----
From: Jason A. Donenfeld <Jason at zx2c4.com> 
Sent: 09 June 2022 14:04
To: R, Monish Kumar <monish.kumar.r at intel.com>
Cc: open list:NVM EXPRESS DRIVER <linux-nvme at lists.infradead.org>; Sagi Grimberg <sagi at grimberg.me>; alan.adamson at oracle.com; LKML <linux-kernel at vger.kernel.org>; Yi Zhang <yi.zhang at redhat.com>; Keith Busch <kbusch at kernel.org>; axboe at fb.com; Christoph Hellwig <hch at lst.de>; Rao, Abhijeet <abhijeet.rao at intel.com>
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

Hey again,

Figured it out. 2.3 seconds to be exact... It looks like this is caused by:

bc360b0b1611 ("nvme-pci: add quirks for Samsung X5 SSDs") https://lore.kernel.org/all/20220316075449.18906-1-monish.kumar.r@intel.com/

This commit doesn't have any justification and got applied without much discussion. Perhaps Monish could supply some more info about why this is needed here? FTR, I have no issues on my system when reverting that. Perhaps it should be reverted. (I can send a revert commit for that if necessary.)

Looking further, however, the PCIe ID is said to be for a "Samsung X5", which Google says is a portable thunderbolt drive. Is the PCIe ID correct? On my system, this is the PCIe ID of a Samsung 970 EVO Plus.
Is it possible that Monish copied and pasted the wrong PCIe ID? Or has Samsung *reused* the same PCIe ID on both devices? In which case, we'd need some additional data for that quirk to avoid the delay.

Also note that this (potentially errant) commit has been backported to stable.

Jason


More information about the Linux-nvme mailing list