nvme: machine check when running nvme subsystem-reset /dev/nvme0 against direct attach via PCIE slot

Thu Sep 26 23:10:05 PDT 2024

On 9/27/24 02:41, Laurence Oberman wrote:
> Hi Keith
> Hope all is well
> 
> Quick question (expected or not)
> 
> It was reported to Red Hat, seeing issues with using a
> "nvme subsystem-reset /dev/nvme0" command to test resets.
> 
> On multiple servers I tested on two types of nvme attached devices
> These are not the rootfs devices
> 
> 1. The front slot (hotplug) devices in a 2.5in format 
> reset and after some time recover (what is expected)
> 
> Example of one working
> 
> Does not trap and land up as a machine-check
> 
> [ 2215.440468] pcieport 0000:10:01.1: AER: Multiple Uncorrected (Non-
> Fatal) error received: 0000:12:13.0
> [ 2215.440532] pcieport 0000:12:13.0: PCIe Bus Error:
> severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester
> ID)
> [ 2215.440536] pcieport 0000:12:13.0:   device [10b5:8748] error
> status/mask=00100000/00000000
> [ 2215.440540] pcieport 0000:12:13.0:    [20] UnsupReq              
> (First)
> [ 2215.440544] pcieport 0000:12:13.0: AER:   TLP Header: 40009001
> 1000000f e9211000 12000000
> [ 2215.441813] systemd-journald[2173]: Sent WATCHDOG=1 notification.
> [ 2216.937498] {1}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 4
> [ 2216.937505] {1}[Hardware Error]: event severity: info
> [ 2216.937508] {1}[Hardware Error]:  Error 0, type: fatal
> [ 2216.937511] {1}[Hardware Error]:  fru_text: PcieError
> [ 2216.937514] {1}[Hardware Error]:   section_type: PCIe error
> [ 2216.937515] {1}[Hardware Error]:   port_type: 4, root port
> [ 2216.937517] {1}[Hardware Error]:   version: 0.2
> [ 2216.937519] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
> [ 2216.937522] {1}[Hardware Error]:   device_id: 0000:10:01.1
> [ 2216.937524] {1}[Hardware Error]:   slot: 3
> [ 2216.937525] {1}[Hardware Error]:   secondary_bus: 0x11
> [ 2216.937526] {1}[Hardware Error]:   vendor_id: 0x1022, device_id:
> 0x1453
> [ 2216.937528] {1}[Hardware Error]:   class_code: 060400
> [ 2216.937529] {1}[Hardware Error]:   bridge: secondary_status: 0x2000,
> control: 0x0012
> [ 2216.937530] {1}[Hardware Error]:   aer_uncor_status: 0x00000000,
> aer_uncor_mask: 0x04500000
> [ 2216.937532] {1}[Hardware Error]:   aer_uncor_severity: 0x004e2030
> [ 2216.937532] {1}[Hardware Error]:   TLP Header: 00000000 00000000
> 00000000 00000000
> [ 2216.937629] pcieport 0000:10:01.1: AER: aer_status: 0x00000000,
> aer_mask: 0x04500000
> [ 2216.937634] pcieport 0000:10:01.1: AER: aer_layer=Transaction Layer,
> aer_agent=Receiver ID
> [ 2216.937638] pcieport 0000:10:01.1: AER: aer_uncor_severity:
> 0x004e2030
> [ 2216.937645] nvme nvme4: frozen state error detected, reset
> controller
> [ 2217.071095] nvme nvme10: frozen state error detected, reset
> controller
> [ 2217.096928] nvme nvme0: frozen state error detected, reset
> controller
> [ 2217.118947] nvme nvme18: frozen state error detected, reset
> controller
> [ 2217.138945] nvme nvme6: frozen state error detected, reset
> controller
> [ 2217.164918] nvme nvme14: frozen state error detected, reset
> controller
> [ 2217.186902] nvme nvme20: frozen state error detected, reset
> controller
> [ 2279.420266] nvme 0000:1a:00.0: Unable to change power state from
> D3cold to D0, device inaccessible
> [ 2279.420329] nvme nvme22: Disabling device after reset failure: -19
> [ 2279.464727] pcieport 0000:12:13.0: AER: device recovery failed
> [ 2279.464823] pcieport 0000:12:13.0: pciehp: pcie_do_write_cmd: no
> response from device
> 
> Port resets and recovers
> 
> [ 2279.593196] pcieport 0000:10:01.1: AER: Root Port link has been
> reset (0)
> [ 2279.593699] nvme nvme4: restart after slot reset
> [ 2279.593949] nvme nvme10: restart after slot reset
> [ 2279.594222] nvme nvme0: restart after slot reset
> [ 2279.594453] nvme nvme18: restart after slot reset
> [ 2279.594728] nvme nvme6: restart after slot reset
> [ 2279.594984] nvme nvme14: restart after slot reset
> [ 2279.595226] nvme nvme20: restart after slot reset
> [ 2279.595435] pcieport 0000:12:13.0: pciehp: Slot(19): Card present
> [ 2279.595441] pcieport 0000:12:13.0: pciehp: Slot(19): Link Up
> [ 2279.609081] nvme nvme4: Shutdown timeout set to 8 seconds
> [ 2279.617532] nvme nvme0: Shutdown timeout set to 8 seconds
> [ 2279.617533] nvme nvme14: Shutdown timeout set to 8 seconds
> [ 2279.618028] nvme nvme6: Shutdown timeout set to 8 seconds
> [ 2279.618207] nvme nvme18: Shutdown timeout set to 8 seconds
> [ 2279.618290] nvme nvme10: Shutdown timeout set to 8 seconds
> [ 2279.618308] nvme nvme20: Shutdown timeout set to 8 seconds
> [ 2279.631961] nvme nvme4: 32/0/0 default/read/poll queues
> [ 2279.643293] nvme nvme14: 32/0/0 default/read/poll queues
> [ 2279.643372] nvme nvme0: 32/0/0 default/read/poll queues
> [ 2279.644881] nvme nvme6: 32/0/0 default/read/poll queues
> [ 2279.644966] nvme nvme10: 32/0/0 default/read/poll queues
> [ 2279.645030] nvme nvme18: 32/0/0 default/read/poll queues
> [ 2279.645132] nvme nvme20: 32/0/0 default/read/poll queues
> [ 2279.645202] pcieport 0000:10:01.1: AER: device recovery successful
> 
> 2. Any kernel upstream latest 6.11, RHEL8 or RHEL9 causes 
> a machine check and panics the box when its against a nvme in a 
> PCIE slot
> 
>   263.862919] mce: [Hardware Error]: CPU 12: Machine Check Exception: 5
> Bank 6: ba00000000000e0b
> [  263.862924] mce: [Hardware Error]: RIP !INEXACT!
> 10:<ffffffff8571dce4> {intel_idle+0x54/0x90}
> [  263.862931] mce: [Hardware Error]: TSC 7a47d8d62ba6dd MISC 83100000 
> [  263.862933] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME 1727384194
> SOCKET 1 APIC 40 microcode d0003a5
> [  263.862936] mce: [Hardware Error]: Run the above through 'mcelog --
> ascii'
> [  263.885254] mce: [Hardware Error]: Machine check: Processor context
> corrupt
> [  263.885259] Kernel panic - not syncing: Fatal machine check
> 
> Hardware event. This is not a software error.
> CPU 0 BANK 0 TSC 7a47d8d62ba6dd 
> RIP !INEXACT! 10:ffffffff8571dce4
> TIME 1727384194 Thu Sep 26 16:56:34 2024
> MCG status:
> MCi status:
> Machine check not valid
> Corrected error
> MCA: No Error
> STATUS 0 MCGSTATUS 0
> CPUID Vendor Intel Family 6 Model 106 Step 6
> RIP: intel_idle+0x54/0x90}
> SOCKET 1 APIC 40 microcode d0003a5
> Run the above through 'mcelog --ascii'
> Machine check: Processor context corrupt
> 
> Regards
> Laurence
> 
> 
> 
I think the Keith's email address is not correct. Adding the correct email address of Keith here.

BTW, Keith recently help fixed an issue in kernel v6.11 with nvme subsystem-reset command to ensure 
that we recover the nvme disk on PPC. On PPC architecture, we use EEH to recover the disk post 
subsystem-reset but yours is Intel and that uses AER for recovery. So I'm not sure if that same 
commit 210b1f6576e8("nvme-pci: do not directly handle subsys reset fallout") which was merged in 
kernel v6.11 causing a side effect on the Intel machine. 

Would you please revert the above commit and see if that help fix the observed symptom on your
Intel machine? 

Thanks,
--Nilay