Problem with SPCC 256GB NVMe 1.3 drive - refcount_t: underflow; use-after-free.
Bradley Chapman
chapman6235 at comcast.net
Sun Jan 17 13:58:34 EST 2021
All,
I recently plugged a 256GB SPCC NVMe 1.3 drive into the secondary slot
on my Asus X570-P motherboard, running a Ryzen 5 3600 CPU. After
partitioning and formatting the drive, it is detected thusly by the
5.9.15 and 5.10.6 kernels:
[ 1.653074] nvme nvme1: pci function 0000:04:00.0
[ 1.657181] nvme nvme1: missing or invalid SUBNQN field.
[ 1.662294] nvme nvme1: allocated 64 MiB host memory buffer.
[ 1.663105] nvme nvme1: 15/0/0 default/read/poll queues
[ 1.665815] nvme1n1: p1
However, any I/O to the drive (including mounting its filesystem) causes
the following errors to appear in the dmesg. These errors occur with
both the 5.9.15 kernel and the 5.10.6 kernel, and with X570-P BIOS
version 1406 and version 3001. I have modified the BIOS settings to
specify that a GEN 3 device is plugged into the M.2_2 slot instead of
allowing the BIOS to auto-detect the drive.
[ 2745.659502] refcount_t: underflow; use-after-free.
[ 2745.659510] WARNING: CPU: 2 PID: 0 at lib/refcount.c:28
refcount_warn_saturate+0xab/0xf0
[ 2745.659510] Modules linked in: rfcomm(E) cmac(E) bnep(E)
binfmt_misc(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) uas(E)
usb_storage(E) btusb(E) btrtl(E) crct10dif_pclmul(E) btbcm(E)
crc32_pclmul(E) btintel(E) ghash_clmulni_intel(E) bluetooth(E) rfkill(E)
aesni_intel(E) crypto_simd(E) cryptd(E) glue_helper(E) efi_pstore(E)
jitterentropy_rng(E) drbg(E) ccp(E) ansi_cprng(E) ecdh_generic(E) ecc(E)
acpi_cpufreq(E) nft_counter(E) efivarfs(E) crc32c_intel(E)
[ 2745.659527] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G E
5.10.6-BET #1
[ 2745.659528] Hardware name: System manufacturer System Product
Name/PRIME X570-P, BIOS 3001 12/04/2020
[ 2745.659529] RIP: 0010:refcount_warn_saturate+0xab/0xf0
[ 2745.659530] Code: 05 af d2 72 01 01 e8 7a 06 87 00 0f 0b c3 80 3d 9d
d2 72 01 00 75 90 48 c7 c7 78 60 44 af c6 05 8d d2 72 01 01 e8 5b 06 87
00 <0f> 0b c3 80 3d 7c d2 72 01 00 0f 85 6d ff ff ff 48 c7 c7 d0 60 44
[ 2745.659531] RSP: 0018:ffffaf1880298f30 EFLAGS: 00010086
[ 2745.659532] RAX: 0000000000000000 RBX: ffff9873cf3bc300 RCX:
0000000000000027
[ 2745.659533] RDX: 0000000000000027 RSI: ffff987acea92e80 RDI:
ffff987acea92e88
[ 2745.659533] RBP: ffff9873d0e661f0 R08: 0000000000000000 R09:
c0000000ffffdfff
[ 2745.659534] R10: ffffaf1880298d50 R11: ffffaf1880298d48 R12:
0000000000000001
[ 2745.659534] R13: ffff9873d0f98580 R14: ffff9873cdf8ac00 R15:
0000000000000000
[ 2745.659535] FS: 0000000000000000(0000) GS:ffff987acea80000(0000)
knlGS:0000000000000000
[ 2745.659536] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2745.659536] CR2: 00005588763902c8 CR3: 0000000107248000 CR4:
0000000000350ee0
[ 2745.659537] Call Trace:
[ 2745.659538] <IRQ>
[ 2745.659541] nvme_irq+0x104/0x190
[ 2745.659543] __handle_irq_event_percpu+0x2e/0xd0
[ 2745.659545] handle_irq_event_percpu+0x33/0x80
[ 2745.659545] handle_irq_event+0x39/0x70
[ 2745.659547] handle_edge_irq+0x7c/0x1a0
[ 2745.659549] asm_call_irq_on_stack+0x12/0x20
[ 2745.659549] </IRQ>
[ 2745.659551] common_interrupt+0xd7/0x160
[ 2745.659552] asm_common_interrupt+0x1e/0x40
[ 2745.659554] RIP: 0010:cpuidle_enter_state+0xd2/0x2e0
[ 2745.659555] Code: e8 93 22 6a ff 31 ff 49 89 c5 e8 29 2c 6a ff 45 84
ff 74 12 9c 58 f6 c4 02 0f 85 c4 01 00 00 31 ff e8 a2 d8 6f ff fb 45 85
f6 <0f> 88 c9 00 00 00 49 63 ce be 68 00 00 00 4c 2b 2c 24 48 89 ca 48
[ 2745.659556] RSP: 0018:ffffaf188014fe80 EFLAGS: 00000202
[ 2745.659557] RAX: ffff987acea9ce00 RBX: 0000000000000002 RCX:
000000000000001f
[ 2745.659557] RDX: 0000027f460f1f90 RSI: 00000000239f5229 RDI:
0000000000000000
[ 2745.659558] RBP: ffff9873c1a4e800 R08: 0000000000000002 R09:
000000000001c600
[ 2745.659558] R10: 0000090da145abf0 R11: ffff987acea9be24 R12:
ffffffffaf6d38e0
[ 2745.659559] R13: 0000027f460f1f90 R14: 0000000000000002 R15:
0000000000000000
[ 2745.659561] cpuidle_enter+0x30/0x50
[ 2745.659562] do_idle+0x24f/0x290
[ 2745.659564] cpu_startup_entry+0x1b/0x20
[ 2745.659566] start_secondary+0x10b/0x150
[ 2745.659567] secondary_startup_64_no_verify+0xb0/0xbb
[ 2745.659569] ---[ end trace be84281f034198f3 ]---
[ 2776.138874] nvme nvme1: I/O 414 QID 3 timeout, aborting
[ 2776.138886] nvme nvme1: I/O 415 QID 3 timeout, aborting
[ 2776.138891] nvme nvme1: I/O 416 QID 3 timeout, aborting
[ 2776.138895] nvme nvme1: I/O 417 QID 3 timeout, aborting
[ 2776.138912] nvme nvme1: Abort status: 0x0
[ 2776.138921] nvme nvme1: I/O 428 QID 3 timeout, aborting
[ 2776.138922] nvme nvme1: Abort status: 0x0
[ 2776.138925] nvme nvme1: Abort status: 0x0
[ 2776.138974] nvme nvme1: Abort status: 0x0
[ 2776.138977] nvme nvme1: Abort status: 0x0
[ 2806.346792] nvme nvme1: I/O 414 QID 3 timeout, reset controller
[ 2806.363566] nvme nvme1: 15/0/0 default/read/poll queues
[ 2836.554298] nvme nvme1: I/O 415 QID 3 timeout, disable controller
[ 2836.672064] blk_update_request: I/O error, dev nvme1n1, sector 16350
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672072] blk_update_request: I/O error, dev nvme1n1, sector 16093
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672074] blk_update_request: I/O error, dev nvme1n1, sector 15836
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672076] blk_update_request: I/O error, dev nvme1n1, sector 15579
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672078] blk_update_request: I/O error, dev nvme1n1, sector 15322
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672080] blk_update_request: I/O error, dev nvme1n1, sector 15065
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672082] blk_update_request: I/O error, dev nvme1n1, sector 14808
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672083] blk_update_request: I/O error, dev nvme1n1, sector 14551
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672085] blk_update_request: I/O error, dev nvme1n1, sector 14294
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672087] blk_update_request: I/O error, dev nvme1n1, sector 14037
op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
[ 2836.672121] nvme nvme1: failed to mark controller live state
[ 2836.672123] nvme nvme1: Removing after probe failure status: -19
[ 2836.689016] Aborting journal on device dm-0-8.
[ 2836.689024] Buffer I/O error on dev dm-0, logical block 25198592,
lost sync page write
[ 2836.689027] JBD2: Error -5 detected when updating journal superblock
for dm-0-8.
[ 2836.723821] percpu ref (hd_struct_free) <= 0 (-28) after switching to
atomic
[ 2836.723828] WARNING: CPU: 8 PID: 0 at lib/percpu-refcount.c:196
percpu_ref_switch_to_atomic_rcu+0x139/0x140
[ 2836.723828] Modules linked in: rfcomm(E) cmac(E) bnep(E)
binfmt_misc(E) nls_ascii(E) nls_cp437(E) vfat(E) fat(E) uas(E)
usb_storage(E) btusb(E) btrtl(E) crct10dif_pclmul(E) btbcm(E)
crc32_pclmul(E) btintel(E) ghash_clmulni_intel(E) bluetooth(E) rfkill(E)
aesni_intel(E) crypto_simd(E) cryptd(E) glue_helper(E) efi_pstore(E)
jitterentropy_rng(E) drbg(E) ccp(E) ansi_cprng(E) ecdh_generic(E) ecc(E)
acpi_cpufreq(E) nft_counter(E) efivarfs(E) crc32c_intel(E)
[ 2836.723844] CPU: 8 PID: 0 Comm: swapper/8 Tainted: G W E
5.10.6-BET #1
[ 2836.723845] Hardware name: System manufacturer System Product
Name/PRIME X570-P, BIOS 3001 12/04/2020
[ 2836.723847] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x139/0x140
[ 2836.723848] Code: 80 3d f9 f0 72 01 00 0f 85 52 ff ff ff 49 8b 54 24
e0 49 8b 74 24 e8 48 c7 c7 88 5f 44 af c6 05 db f0 72 01 01 e8 ad 24 87
00 <0f> 0b e9 2e ff ff ff 41 55 49 89 f5 41 54 55 48 89 fd 53 48 83 ec
[ 2836.723849] RSP: 0018:ffffaf18803a0f20 EFLAGS: 00010282
[ 2836.723850] RAX: 0000000000000000 RBX: 7fffffffffffffe3 RCX:
0000000000000027
[ 2836.723850] RDX: 0000000000000027 RSI: ffff987acec12e80 RDI:
ffff987acec12e88
[ 2836.723851] RBP: 0000369db0c0e3c8 R08: 0000000000000000 R09:
c0000000ffffdfff
[ 2836.723851] R10: ffffaf18803a0d40 R11: ffffaf18803a0d38 R12:
ffff9873c0bbbda0
[ 2836.723852] R13: ffffffffaf765f10 R14: 0000000000000202 R15:
ffffffffaf6060c0
[ 2836.723853] FS: 0000000000000000(0000) GS:ffff987acec00000(0000)
knlGS:0000000000000000
[ 2836.723853] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2836.723854] CR2: 0000558899b414c0 CR3: 0000000101cd2000 CR4:
0000000000350ee0
[ 2836.723854] Call Trace:
[ 2836.723855] <IRQ>
[ 2836.723859] rcu_core+0x196/0x420
[ 2836.723862] __do_softirq+0xc9/0x214
[ 2836.723863] asm_call_irq_on_stack+0x12/0x20
[ 2836.723864] </IRQ>
[ 2836.723866] do_softirq_own_stack+0x31/0x40
[ 2836.723867] irq_exit_rcu+0x9a/0xa0
[ 2836.723869] sysvec_apic_timer_interrupt+0x2c/0x80
[ 2836.723870] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 2836.723872] RIP: 0010:cpuidle_enter_state+0xd2/0x2e0
[ 2836.723873] Code: e8 93 22 6a ff 31 ff 49 89 c5 e8 29 2c 6a ff 45 84
ff 74 12 9c 58 f6 c4 02 0f 85 c4 01 00 00 31 ff e8 a2 d8 6f ff fb 45 85
f6 <0f> 88 c9 00 00 00 49 63 ce be 68 00 00 00 4c 2b 2c 24 48 89 ca 48
[ 2836.723874] RSP: 0018:ffffaf188017fe80 EFLAGS: 00000202
[ 2836.723874] RAX: ffff987acec1ce00 RBX: 0000000000000002 RCX:
000000000000001f
[ 2836.723875] RDX: 0000029479ea3f98 RSI: 00000000239f5229 RDI:
0000000000000000
[ 2836.723875] RBP: ffff9873c1a4ec00 R08: 0000000000000002 R09:
000000000001c600
[ 2836.723876] R10: 00000959d0ea6498 R11: ffff987acec1be24 R12:
ffffffffaf6d38e0
[ 2836.723876] R13: 0000029479ea3f98 R14: 0000000000000002 R15:
0000000000000000
[ 2836.723878] cpuidle_enter+0x30/0x50
[ 2836.723880] do_idle+0x24f/0x290
[ 2836.723882] cpu_startup_entry+0x1b/0x20
[ 2836.723884] start_secondary+0x10b/0x150
[ 2836.723885] secondary_startup_64_no_verify+0xb0/0xbb
[ 2836.723887] ---[ end trace be84281f034198f4 ]---
After these errors are generated, the device becomes inaccessible and
unmounting its filesystem (which does not hang in D state) generates
additional errors:
[ 2868.181018] Buffer I/O error on dev dm-0, logical block 0, lost sync
page write
[ 2868.181022] EXT4-fs (dm-0): I/O error while writing superblock
After the filesystem is unmounted the device no longer appears in the
output of lsblk(8) and its device node(s) disappear after the kernel
removes the device. Prior to the I/O failures, the nvme error-log
command returns no error entries for any of the 64 log entries present.
nvme fw-log and nvme smart-log return the following output:
Firmware Log for device:nvme1
afi : 0x20
Smart Log for NVME device:nvme1 namespace-id:ffffffff
critical_warning : 0
temperature : 48 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 0%
data_units_read : 234
data_units_written : 2,149
host_read_commands : 4,202
host_write_commands : 421,917
controller_busy_time : 0
power_cycles : 7
power_on_hours : 11
unsafe_shutdowns : 0
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 48 C
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
I've checked the kernel change logs and I know that the refcount_t error
has been occurring in other kernel subsystems and was subsequently fixed
in recent kernel point releases, so I will be trying to reproduce this
error with the most recent 5.10 and 5.11-rc kernels.
Any suggestions on what else to try next?
Thanks!
Brad
More information about the Linux-nvme
mailing list