[External] : Re: way to unbind a bad nvme device/controller without powering off system
James Puthukattukaran
james.puthukattukaran at oracle.com
Mon Oct 24 17:02:33 PDT 2022
On 10/24/22 18:36, Keith Busch wrote:
> On Mon, Oct 24, 2022 at 05:40:30PM -0400, James Puthukattukaran wrote:
>> Hi -
>>
>> I'm seeing a scenario where what seems to be a non-functioning nvme controller/drive where the IO transactions are timing out and the controller is not responding to any controller commands. The controller seems to be disabled (nvme_dev_disable called via the nvme_timeout) but we're still seeing the nvme_reset_work thread blocked and not making progress. I tried to remove the controller via the HP sysfs interface and that also hangs behind the reset thread waiting for it to complete.
>
> If it's in a hotplug slot, then just pull it out.
Looking for a programmatic (remote) way to do it. Also, doing this will cause surprise remove and won't it leave the nvme controller data structure in a bad state/not unbound from the driver?
>
>> I thought the the disable controller path does not talk to the controller and simply unblocks the queues and cleans them out before unbinding the controller from the device. Not sure why the reset thread is still stuck then? Does the reset thread have to finish its course even though the controller has been disabled? trying to understand the flow here.
>>
>> I guess what I'm really looking for is a way to simply unbind the device from the driver, kill any threads and allow the device to be powered of via the hotplug interface (trying to avoid rebooting the system to remove the device).
>
> What kernel are you using?
5.14 based kernel
>
> Generally, the default timeout is really long. If you have a broken
> controller, it could take several minutes before the driver unblocks
> forward progress to unbind.
One concern is that the reset controller flow attempts to reinitialze the controller and this will cause problems if the controller is bad. Would it make sense to have a sysfs "remove_controller" interface that simply goes through and does a nvme_dev_disable() with the assumption that the controller is dead? Will the nvme_kill_queues() in nvme_dev_disadble() unwedge any potential nvme reset thread that is blocked and thus allow the nvme_remove() flow to complete?
thanks
More information about the Linux-nvme
mailing list