[PATCH] nvme-pci: Harden drive presence detect in nvme_dev_disable()

Keith Busch kbusch at kernel.org
Thu May 12 07:33:00 PDT 2022


On Thu, May 12, 2022 at 04:03:39PM +0300, Max Gurtovoy wrote:
> On 5/6/2022 1:15 PM, Stefan Roese wrote:
> > On our ZynqMP system we observe, that a NVMe drive that resets itself
> > while doing a firmware update causes a Kernel crash like this:
> > 
> > [ 67.720772] pcieport 0000:02:02.0: pciehp: Slot(2): Link Down
> > [ 67.720783] pcieport 0000:02:02.0: pciehp: Slot(2): Card not present
> > [ 67.720795] nvme 0000:04:00.0: PME# disabled
> > [ 67.720849] Internal error: synchronous external abort: 96000010 [#1] PREEMPT SMP
> > [ 67.720853] nwl-pcie fd0e0000.pcie: Slave error
> > 
> > Analysis: When nvme_dev_disable() is called because of this PCIe hotplug
> > event, pci_is_enabled() is still true. And accessing the NVMe drive
> > which is currently not available as it's in reboot process causes this
> > "synchronous external abort" on this ARM64 platform.
> > 
> > This patch adds the pci_device_is_present() check as well, which returns
> > false in this "Card not present" hot-plug case. With this change, the
> > NVMe driver does not try to access the NVMe registers any more and the
> > FW update finishes without any problems.
> > 
> > Signed-off-by: Stefan Roese <sr at denx.de>
> > Cc: Keith Busch <kbusch at kernel.org>
> > Cc: Jens Axboe <axboe at fb.com>
> > Cc: Christoph Hellwig <hch at lst.de>
> > ---
> >   drivers/nvme/host/pci.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 97afeb898b25..c764a2a1d7f1 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2473,7 +2473,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
> >   	struct pci_dev *pdev = to_pci_dev(dev->dev);
> >   	mutex_lock(&dev->shutdown_lock);
> > -	if (pci_is_enabled(pdev)) {
> > +	if (pci_device_is_present(pdev) && pci_is_enabled(pdev)) {
> 
> sound weird to me that pci device can be enabled but not present.

pci_is_enabled() just checks a driver reference count. It doesn't actually
query the physical status of the device's link it references, but
pci_device_is_present() will do that.
 
> Also why we get hot unplug event for FW upgrade ?

It might happen if the device requires a Subsystem Reset to activate the new
firmware.
 
> is this the correct behavior ?

I don't know if the "synchronous external abort" is normal, or how severe of a
problem it indicates, but we normally expect accessing registers on an
inaccessible device to be mostly harmless.

> >   		u32 csts = readl(dev->bar + NVME_REG_CSTS);
> >   		if (dev->ctrl.state == NVME_CTRL_LIVE ||



More information about the Linux-nvme mailing list