nvme-pci timeout issue

Chaitanya Kulkarni chaitanyak at nvidia.com
Wed Jan 10 19:54:19 PST 2024


Hi all,

After running the test that triggers the nvme timeout for nvme-pci, the
device under test is lingering in an inconsistent state. Here are steps :-

1. Load the driver.
2. Trigger nvme_timeout.
3. After timeout handler gets triggered it fails the application
    with I/O error.
4. lsblk and nvme listdoesn't show the device anymore.
5. ls does show the device.
6. Any write to it fails (e.g. dd) since device has 0 capacity.

Is this accepted behavior? if it is then a malfunctioning device is
lingering in the system and applications end up accessing it as if
it is functioning properly. Can we avoid this scenario?

How about we remove the device from the system? If you're all okay with
it, I'll send a patch for the nvme-pci timeout to remove the device that
has zero capacity. Otherwise please suggest how to deal with this scenario.

With this confusing behavior I'm not entirely sure what is the expected
scenario to pass timeout testcase.

Please have a look at detailed test log below.

-ck

Test Log :-

blktests (master) # lsblk | grep nvme
nvme0n1 259:0    0    1G  0 disk
blktests (master) # nvme list
Node                  Generic               SN 
Model                                    Namespace 
Usage                      Format           FW Rev
--------------------- --------------------- -------------------- 
---------------------------------------- --------- 
-------------------------- ---------------- --------
/dev/nvme0n1          /dev/ng0n1            foo QEMU NVMe 
Ctrl                           1           1.07  GB / 1.07  GB    512   
B +  0 B   1.0
blktests (master) # dd if=/dev/zero of=/dev/nvme0n1 bs=4k count=1 
oflag=direct
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000555306 s, 7.4 MB/s
blktests (master) #
blktests (master) #
blktests (master) #
blktests (master) #  nvme_trtype=pci ./check nvme/050
nvme/050 (test nvme-pci timeout with fio jobs.) [passed]
     runtime  0.093s  ...  90.901s
blktests (master) # dmesg  -c
[  263.714133] run blktests nvme/050 at 2024-01-10 19:16:57
[  269.104834] nvme 0000:00:04.0: vgaarb: pci_notify
[  269.126931] pci 0000:00:04.0: vgaarb: pci_notify
[  274.111859] nvme 0000:00:04.0: vgaarb: pci_notify
[  274.111878] nvme 0000:00:04.0: runtime IRQ mapping not provided by arch
[  274.112099] nvme nvme0: pci function 0000:00:04.0
[  274.145615] nvme 0000:00:04.0: enabling bus mastering
[  274.146125] nvme 0000:00:04.0: save config 0x00: 0x00101b36
[  274.146132] nvme 0000:00:04.0: save config 0x04: 0x00100507
[  274.146138] nvme 0000:00:04.0: save config 0x08: 0x01080202
[  274.146143] nvme 0000:00:04.0: save config 0x0c: 0x00000000
[  274.146149] nvme 0000:00:04.0: save config 0x10: 0xfebd0004
[  274.146154] nvme 0000:00:04.0: save config 0x14: 0x00000000
[  274.146167] nvme 0000:00:04.0: save config 0x18: 0x00000000
[  274.146173] nvme 0000:00:04.0: save config 0x1c: 0x00000000
[  274.146180] nvme 0000:00:04.0: save config 0x20: 0x00000000
[  274.146185] nvme 0000:00:04.0: save config 0x24: 0x00000000
[  274.146191] nvme 0000:00:04.0: save config 0x28: 0x00000000
[  274.146196] nvme 0000:00:04.0: save config 0x2c: 0x11001af4
[  274.146201] nvme 0000:00:04.0: save config 0x30: 0x00000000
[  274.146206] nvme 0000:00:04.0: save config 0x34: 0x00000040
[  274.146212] nvme 0000:00:04.0: save config 0x38: 0x00000000
[  274.146217] nvme 0000:00:04.0: save config 0x3c: 0x0000010b
[  274.158737] nvme nvme0: 48/0/0 default/read/poll queues
[  274.165837] nvme nvme0: Ignoring bogus Namespace Identifiers
[  274.169187] nvme 0000:00:04.0: vgaarb: pci_notify
[  302.150515] run blktests nvme/050 at 2024-01-10 19:17:35
[  302.391621] FAULT_INJECTION: forcing a failure.
                name fail_io_timeout, interval 10, probability 99, space 
0, times -1

[...]

[  302.402142] FAULT_INJECTION: forcing a failure.
                name fail_io_timeout, interval 10, probability 99, space 
0, times -1
[  302.402649] FAULT_INJECTION: forcing a failure.
                name fail_io_timeout, interval 10, probability 99, space 
0, times -1
[  332.561584] nvme nvme0: I/O tag 0 (e000) opcode 0x2 (Read) QID 1 
timeout, aborting req_op:READ(0) size:4096
[  332.561631] nvme nvme0: I/O tag 832 (4340) opcode 0x2 (Read) QID 2 
timeout, aborting req_op:READ(0) size:4096
[  332.561685] nvme nvme0: I/O tag 896 (f380) opcode 0x2 (Read) QID 3 
timeout, aborting req_op:READ(0) size:4096
[  332.561703] nvme nvme0: Abort status: 0x0
[  332.561707] nvme nvme0: Abort status: 0x0
[  332.561720] nvme nvme0: I/O tag 768 (d300) opcode 0x2 (Read) QID 4 
timeout, aborting req_op:READ(0) size:4096
[  332.561739] nvme nvme0: Abort status: 0x0
[  332.561744] nvme nvme0: I/O tag 64 (5040) opcode 0x2 (Read) QID 5 
timeout, aborting req_op:READ(0) size:4096

[...]

[  332.563064] nvme nvme0: Abort status: 0x0
[  332.563067] nvme nvme0: I/O tag 0 (1000) opcode 0x2 (Read) QID 45 
timeout, aborting req_op:READ(0) size:4096
[  332.563088] nvme nvme0: I/O tag 704 (32c0) opcode 0x2 (Read) QID 46 
timeout, aborting req_op:READ(0) size:4096
[  332.563117] nvme nvme0: Abort status: 0x0
[  332.563124] nvme nvme0: Abort status: 0x0
[  332.563127] nvme nvme0: I/O tag 384 (b180) opcode 0x2 (Read) QID 47 
timeout, aborting req_op:READ(0) size:4096
[  332.563147] nvme nvme0: I/O tag 320 (8140) opcode 0x2 (Read) QID 48 
timeout, aborting req_op:READ(0) size:4096
[  332.563170] nvme nvme0: Abort status: 0x0
[  332.563179] nvme nvme0: Abort status: 0x0
[  362.769789] nvme nvme0: I/O tag 0 (e000) opcode 0x2 (Read) QID 1 
timeout, reset controller
[  362.871902] nvme 0000:00:04.0: enabling bus mastering
[  362.872350] nvme 0000:00:04.0: save config 0x00: 0x00101b36
[  362.872356] nvme 0000:00:04.0: save config 0x04: 0x00100507
[  362.872362] nvme 0000:00:04.0: save config 0x08: 0x01080202
[  362.872367] nvme 0000:00:04.0: save config 0x0c: 0x00000000
[  362.872372] nvme 0000:00:04.0: save config 0x10: 0xfebd0004
[  362.872378] nvme 0000:00:04.0: save config 0x14: 0x00000000
[  362.872383] nvme 0000:00:04.0: save config 0x18: 0x00000000
[  362.872388] nvme 0000:00:04.0: save config 0x1c: 0x00000000
[  362.872394] nvme 0000:00:04.0: save config 0x20: 0x00000000
[  362.872399] nvme 0000:00:04.0: save config 0x24: 0x00000000
[  362.872404] nvme 0000:00:04.0: save config 0x28: 0x00000000
[  362.872417] nvme 0000:00:04.0: save config 0x2c: 0x11001af4
[  362.872423] nvme 0000:00:04.0: save config 0x30: 0x00000000
[  362.872428] nvme 0000:00:04.0: save config 0x34: 0x00000040
[  362.872433] nvme 0000:00:04.0: save config 0x38: 0x00000000
[  362.872438] nvme 0000:00:04.0: save config 0x3c: 0x0000010b
[  362.883064] nvme nvme0: 48/0/0 default/read/poll queues
[  362.884880] FAULT_INJECTION: forcing a failure.
                name fail_io_timeout, interval 10, probability 99, space 
0, times -1
[  362.884907] FAULT_INJECTION: forcing a failure.
                name fail_io_timeout, interval 10, probability 99, space 
0, times -1
[  362.884970] FAULT_INJECTION: forcing a failure.
                name fail_io_timeout, interval 10, probability 99, space 
0, times -1
[  362.885972] FAULT_INJECTION: forcing a failure.
                name fail_io_timeout, interval 10, probability 99, space 
0, times -1
[  392.973015] nvme nvme0: I/O tag 960 (a3c0) QID 32 timeout, disable 
controller
[  393.038995] I/O error, dev nvme0n1, sector 84856 op 0x0:(READ) flags 
0x800 phys_seg 1 prio class 2
[  393.038995] I/O error, dev nvme0n1, sector 1345216 op 0x0:(READ) 
flags 0x800 phys_seg 1 prio class 2
[  393.039000] I/O error, dev nvme0n1, sector 1104936 op 0x0:(READ) 
flags 0x800 phys_seg 1 prio class 2
[  393.039017] I/O error, dev nvme0n1, sector 1270088 op 0x0:(READ) 
flags 0x800 phys_seg 1 prio class 2
[  393.045604] nvme nvme0: failed to mark controller live state
[  393.045611] nvme nvme0: Disabling device after reset failure: -19
blktests (master) #
blktests (master) #
blktests (master) # lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda      8:0    0   50G  0 disk
├─sda1   8:1    0    1G  0 part /boot
└─sda2   8:2    0   49G  0 part /home
sdb      8:16   0  100G  0 disk /mnt/data
sr0     11:0    1 1024M  0 rom
vda    252:0    0    5G  0 disk
blktests (master) #
blktests (master) # nvme list
Node                  Generic               SN 
Model                                    Namespace 
Usage                      Format           FW Rev
--------------------- --------------------- -------------------- 
---------------------------------------- --------- 
-------------------------- ---------------- --------
blktests (master) #
blktests (master) # ls /dev/nvme0
/dev/nvme0
blktests (master) # ls /dev/nvme0n1
/dev/nvme0n1
blktests (master) # dd if=/dev/zero of=/dev/nvme0n1 bs=4k count=1 
oflag=direct
dd: error writing '/dev/nvme0n1': No space left on device
1+0 records in
0+0 records out
0 bytes copied, 0.000204128 s, 0.0 kB/s




More information about the Linux-nvme mailing list