way to unbind a bad nvme device/controller without powering off system

Keith Busch kbusch at kernel.org
Mon Oct 24 15:36:28 PDT 2022


On Mon, Oct 24, 2022 at 05:40:30PM -0400, James Puthukattukaran wrote:
> Hi -
> 
> I'm seeing a scenario where what seems to be a non-functioning nvme controller/drive where the IO transactions are timing out and the controller is not responding to any controller commands. The controller seems to be disabled (nvme_dev_disable called via the nvme_timeout) but we're still seeing the nvme_reset_work thread  blocked and not making progress. I tried to remove the controller via the HP sysfs interface and that also hangs behind the reset thread waiting for it to complete. 

If it's in a hotplug slot, then just pull it out.
 
> I thought the the disable controller path does not talk to the controller and simply unblocks the queues and cleans them out before unbinding the controller from the device. Not sure why the reset thread is still stuck then? Does the reset thread have to finish its course even though the controller has been disabled? trying to understand the flow here.
> 
> I guess what I'm really looking for is a way to simply unbind the device from the driver, kill any threads and allow the device to be powered of via the hotplug interface (trying to avoid rebooting the system to remove the device).

What kernel are you using?

Generally, the default timeout is really long. If you have a broken
controller, it could take several minutes before the driver unblocks
forward progress to unbind.



More information about the Linux-nvme mailing list