nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed

Mark Ruijter mruijter at primelogic.nl
Wed Jan 18 02:27:41 PST 2023


For what it's worth, I see the exact same problem while running SUSE Linux Enterprise Server 15 SP3.

lithium:~ # dmesg | grep nvme4
[    3.371400] nvme nvme4: pci function 0000:21:00.0
[   41.333886] nvme nvme4: Device not ready; aborting reset, CSTS=0x9
[   41.334802] nvme nvme4: Removing after probe failure status: -19
[  759.291672] nvme nvme4: pci function 0000:21:00.0
[  797.300033] nvme nvme4: Device not ready; aborting reset, CSTS=0x9
[  797.300038] nvme nvme4: Removing after probe failure status: -19
lithium:~ #

Attempts to recover from this state by removing the drives from the PCI space and rescanning the PCI bus also fail.
Rebooting the system does solve it.

It's fairly easy to reproduce the problem on systems that contain >= 8 drives.

Thanks,

Mark Ruijter

On 15/12/2022, 23:31, "Linux-nvme on behalf of J. Hart" <linux-nvme-bounces at lists.infradead.org on behalf of jfhart085 at gmail.com> wrote:

    I've tried the obvious ones and that didn't help either.  I guess I'll 
    have to give up on it and return it as defective.  I'll go back to 
    normal operation and to try and find a controller/device combination 
    that works with the linux driver if there are any.

    In any case, thanks again very much for your kind assistance.

    J. Hart

    On 12/16/22 2:34 AM, Keith Busch wrote:
    > On Thu, Dec 15, 2022 at 10:33:30PM +0900, J. Hart wrote:
    >> [ +26.890018] nvme nvme0: I/O 0 (Write) QID 1 timeout, aborting
    >> [Dec15 21:35] nvme nvme0: I/O 0 QID 1 timeout, reset controller
    >> [ +30.719998] nvme nvme0: I/O 13 QID 0 timeout, reset controller
    >> [Dec15 21:38] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
    >> [  +0.014796] nvme nvme0: Abort status: 0x371
    >> [Dec15 21:40] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
    >> [  +0.000024] nvme nvme0: Removing after probe failure status: -19
    >> [Dec15 21:42] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
    >> [  +0.000324] nvme0n1: detected capacity change from 976773168 to 0
    > 
    > This looks like your device is completely unresponsive: no ack to IO
    > commands, admin commands, or reset sequences. Unfortunately these are
    > typically firmware bugs. Without additional guidance from the vendor,
    > we don't really have many options to try from the driver: just disabling
    > some optional power and performance capabilities, though that often
    > doesn't help either.





More information about the Linux-nvme mailing list