nvme: machine check when running nvme subsystem-reset /dev/nvme0 against direct attach via PCIE slot
Keith Busch
kbusch at kernel.org
Thu Oct 3 14:04:50 PDT 2024
On Thu, Sep 26, 2024 at 05:11:05PM -0400, Laurence Oberman wrote:
> It was reported to Red Hat, seeing issues with using a
> "nvme subsystem-reset /dev/nvme0" command to test resets.
I really dislike that command. The side effects are overkill for the pci
transport...
> On multiple servers I tested on two types of nvme attached devices
> These are not the rootfs devices
>
> 1. The front slot (hotplug) devices in a 2.5in format
> reset and after some time recover (what is expected)
>
> Example of one working
>
> Does not trap and land up as a machine-check
<snip>
> 2. Any kernel upstream latest 6.11, RHEL8 or RHEL9 causes
> a machine check and panics the box when its against a nvme in a
> PCIE slot
>
> [ 263.862919] mce: [Hardware Error]: CPU 12: Machine Check Exception: 5 Bank 6: ba00000000000e0b
> [ 263.862924] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8571dce4> {intel_idle+0x54/0x90}
So this wasn't failing before 6.11? As Nilay mentioned, there are some
changes on how nvme subsystem reset is handled. The main thing being
this ioctl doesn't automatically trigger an nvme reset. I expected
delayed recovery might happen, but machine checks are not expected. If
this was working before, I can only guess right now that the previous
behavior was accessing MMIO and config quicker and triggered a different
error path. If you're successful with the PPC patch reverted, I would be
interested to hear about it.
More information about the Linux-nvme
mailing list