nvme: machine check when running nvme subsystem-reset /dev/nvme0 against direct attach via PCIE slot

Fri Sep 27 06:06:51 PDT 2024


On 9/27/24 17:48, Laurence Oberman wrote:
> On Fri, 2024-09-27 at 11:40 +0530, Nilay Shroff wrote:
>>
>>
>> On 9/27/24 02:41, Laurence Oberman wrote:
>>> Hi Keith
>>> Hope all is well
>>>
>>> Quick question (expected or not)
>>>
>>> It was reported to Red Hat, seeing issues with using a
>>> "nvme subsystem-reset /dev/nvme0" command to test resets.
>>>
>>> On multiple servers I tested on two types of nvme attached devices
>>> These are not the rootfs devices
>>>
>>> 1. The front slot (hotplug) devices in a 2.5in format 
>>> reset and after some time recover (what is expected)
>>>
>>> Example of one working
>>>
>>> Does not trap and land up as a machine-check
>>>
>>> [ 2215.440468] pcieport 0000:10:01.1: AER: Multiple Uncorrected
>>> (Non-
>>> Fatal) error received: 0000:12:13.0
>>> [ 2215.440532] pcieport 0000:12:13.0: PCIe Bus Error:
>>> severity=Uncorrected (Non-Fatal), type=Transaction Layer,
>>> (Requester
>>> ID)
>>> [ 2215.440536] pcieport 0000:12:13.0:   device [10b5:8748] error
>>> status/mask=00100000/00000000
>>> (First)
>>> [ 2215.440544] pcieport 0000:12:13.0: AER:   TLP Header: 40009001
>>> 1000000f e9211000 12000000
>>> [ 2215.441813] systemd-journald[2173]: Sent WATCHDOG=1
>>> notification.
>>> [ 2216.937498] {1}[Hardware Error]: Hardware error from APEI
>>> Generic
>>> Hardware Error Source: 4
>>> [ 2216.937505] {1}[Hardware Error]: event severity: info
>>> [ 2216.937508] {1}[Hardware Error]:  Error 0, type: fatal
>>> [ 2216.937511] {1}[Hardware Error]:  fru_text: PcieError
>>> [ 2216.937514] {1}[Hardware Error]:   section_type: PCIe error
>>> [ 2216.937515] {1}[Hardware Error]:   port_type: 4, root port
>>> [ 2216.937517] {1}[Hardware Error]:   version: 0.2
>>> [ 2216.937519] {1}[Hardware Error]:   command: 0x0407, status:
>>> 0x0010
>>> [ 2216.937522] {1}[Hardware Error]:   device_id: 0000:10:01.1
>>> [ 2216.937524] {1}[Hardware Error]:   slot: 3
>>> [ 2216.937525] {1}[Hardware Error]:   secondary_bus: 0x11
>>> [ 2216.937526] {1}[Hardware Error]:   vendor_id: 0x1022, device_id:
>>> 0x1453
>>> [ 2216.937528] {1}[Hardware Error]:   class_code: 060400
>>> [ 2216.937529] {1}[Hardware Error]:   bridge: secondary_status:
>>> 0x2000,
>>> control: 0x0012
>>> [ 2216.937530] {1}[Hardware Error]:   aer_uncor_status: 0x00000000,
>>> aer_uncor_mask: 0x04500000
>>> [ 2216.937532] {1}[Hardware Error]:   aer_uncor_severity:
>>> 0x004e2030
>>> [ 2216.937532] {1}[Hardware Error]:   TLP Header: 00000000 00000000
>>> 00000000 00000000
>>> [ 2216.937629] pcieport 0000:10:01.1: AER: aer_status: 0x00000000,
>>> aer_mask: 0x04500000
>>> [ 2216.937634] pcieport 0000:10:01.1: AER: aer_layer=Transaction
>>> Layer,
>>> aer_agent=Receiver ID
>>> [ 2216.937638] pcieport 0000:10:01.1: AER: aer_uncor_severity:
>>> 0x004e2030
>>> [ 2216.937645] nvme nvme4: frozen state error detected, reset
>>> controller
>>> [ 2217.071095] nvme nvme10: frozen state error detected, reset
>>> controller
>>> [ 2217.096928] nvme nvme0: frozen state error detected, reset
>>> controller
>>> [ 2217.118947] nvme nvme18: frozen state error detected, reset
>>> controller
>>> [ 2217.138945] nvme nvme6: frozen state error detected, reset
>>> controller
>>> [ 2217.164918] nvme nvme14: frozen state error detected, reset
>>> controller
>>> [ 2217.186902] nvme nvme20: frozen state error detected, reset
>>> controller
>>> [ 2279.420266] nvme 0000:1a:00.0: Unable to change power state from
>>> D3cold to D0, device inaccessible
>>> [ 2279.420329] nvme nvme22: Disabling device after reset failure: -
>>> 19
>>> [ 2279.464727] pcieport 0000:12:13.0: AER: device recovery failed
>>> [ 2279.464823] pcieport 0000:12:13.0: pciehp: pcie_do_write_cmd: no
>>> response from device
>>>
>>> Port resets and recovers
>>>
>>> [ 2279.593196] pcieport 0000:10:01.1: AER: Root Port link has been
>>> reset (0)
>>> [ 2279.593699] nvme nvme4: restart after slot reset
>>> [ 2279.593949] nvme nvme10: restart after slot reset
>>> [ 2279.594222] nvme nvme0: restart after slot reset
>>> [ 2279.594453] nvme nvme18: restart after slot reset
>>> [ 2279.594728] nvme nvme6: restart after slot reset
>>> [ 2279.594984] nvme nvme14: restart after slot reset
>>> [ 2279.595226] nvme nvme20: restart after slot reset
>>> [ 2279.595435] pcieport 0000:12:13.0: pciehp: Slot(19): Card
>>> present
>>> [ 2279.595441] pcieport 0000:12:13.0: pciehp: Slot(19): Link Up
>>> [ 2279.609081] nvme nvme4: Shutdown timeout set to 8 seconds
>>> [ 2279.617532] nvme nvme0: Shutdown timeout set to 8 seconds
>>> [ 2279.617533] nvme nvme14: Shutdown timeout set to 8 seconds
>>> [ 2279.618028] nvme nvme6: Shutdown timeout set to 8 seconds
>>> [ 2279.618207] nvme nvme18: Shutdown timeout set to 8 seconds
>>> [ 2279.618290] nvme nvme10: Shutdown timeout set to 8 seconds
>>> [ 2279.618308] nvme nvme20: Shutdown timeout set to 8 seconds
>>> [ 2279.631961] nvme nvme4: 32/0/0 default/read/poll queues
>>> [ 2279.643293] nvme nvme14: 32/0/0 default/read/poll queues
>>> [ 2279.643372] nvme nvme0: 32/0/0 default/read/poll queues
>>> [ 2279.644881] nvme nvme6: 32/0/0 default/read/poll queues
>>> [ 2279.644966] nvme nvme10: 32/0/0 default/read/poll queues
>>> [ 2279.645030] nvme nvme18: 32/0/0 default/read/poll queues
>>> [ 2279.645132] nvme nvme20: 32/0/0 default/read/poll queues
>>> [ 2279.645202] pcieport 0000:10:01.1: AER: device recovery
>>> successful
>>>
>>> 2. Any kernel upstream latest 6.11, RHEL8 or RHEL9 causes 
>>> a machine check and panics the box when its against a nvme in a 
>>> PCIE slot
>>>
>>>   263.862919] mce: [Hardware Error]: CPU 12: Machine Check
>>> Exception: 5
>>> Bank 6: ba00000000000e0b
>>> [  263.862924] mce: [Hardware Error]: RIP !INEXACT!
>>> 10:<ffffffff8571dce4> {intel_idle+0x54/0x90}
>>> [  263.862931] mce: [Hardware Error]: TSC 7a47d8d62ba6dd MISC
>>> 83100000 
>>> [  263.862933] mce: [Hardware Error]: PROCESSOR 0:606a6 TIME
>>> 1727384194
>>> SOCKET 1 APIC 40 microcode d0003a5
>>> [  263.862936] mce: [Hardware Error]: Run the above through 'mcelog
>>> --
>>> ascii'
>>> [  263.885254] mce: [Hardware Error]: Machine check: Processor
>>> context
>>> corrupt
>>> [  263.885259] Kernel panic - not syncing: Fatal machine check
>>>
>>> Hardware event. This is not a software error.
>>> CPU 0 BANK 0 TSC 7a47d8d62ba6dd 
>>> RIP !INEXACT! 10:ffffffff8571dce4
>>> TIME 1727384194 Thu Sep 26 16:56:34 2024
>>> MCG status:
>>> MCi status:
>>> Machine check not valid
>>> Corrected error
>>> MCA: No Error
>>> STATUS 0 MCGSTATUS 0
>>> CPUID Vendor Intel Family 6 Model 106 Step 6
>>> RIP: intel_idle+0x54/0x90}
>>> SOCKET 1 APIC 40 microcode d0003a5
>>> Run the above through 'mcelog --ascii'
>>> Machine check: Processor context corrupt
>>>
>>> Regards
>>> Laurence
>>>
>>>
>>>
>> I think the Keith's email address is not correct. Adding the correct
>> email address of Keith here.
>>
>> BTW, Keith recently help fixed an issue in kernel v6.11 with nvme
>> subsystem-reset command to ensure 
>> that we recover the nvme disk on PPC. On PPC architecture, we use EEH
>> to recover the disk post 
>> subsystem-reset but yours is Intel and that uses AER for recovery. So
>> I'm not sure if that same 
>> commit 210b1f6576e8("nvme-pci: do not directly handle subsys reset
>> fallout") which was merged in 
>> kernel v6.11 causing a side effect on the Intel machine. 
>>
>> Would you please revert the above commit and see if that help fix the
>> observed symptom on your
>> Intel machine? 
>>
>> Thanks,
>> --Nilay
>>
>>
>>
>>
>>
> Hello Nilay
> Thanks will try that.
> Was your IBM PPC issue also only with direct attached PCIE slot based
> nvme.
> Will report back after testing with the revert
> 
On PPC, it doesn't matter whether NVMe disk is directly attached to PHB or 
attached through another PCIe bridge. On PPC we saw that when nvme subsystem-
reset command is executed on an NVMe disk the EEH couldn't recover the disk 
and that' where the above commit (from Keith) helped get the disk recovered 
using EEH after the subsystem-reset command.

Thanks,
--Nilay