[External] : Re: way to unbind a bad nvme device/controller without powering off system

Mon Oct 24 19:26:54 PDT 2022

On Mon, Oct 24, 2022 at 08:02:33PM -0400, James Puthukattukaran wrote:
> On 10/24/22 18:36, Keith Busch wrote:
> 
> > 
> > Generally, the default timeout is really long. If you have a broken
> > controller, it could take several minutes before the driver unblocks
> > forward progress to unbind.
> One concern is that the reset controller flow attempts to reinitialze the controller and this will cause problems if the controller is bad. Would it make sense to have a sysfs "remove_controller" interface that simply goes through and does a nvme_dev_disable() with the assumption that the controller is dead? Will the nvme_kill_queues() in nvme_dev_disadble() unwedge any potential nvme reset thread that is blocked and thus allow the nvme_remove() flow to complete?
> thanks

In your log snippet, there's this line:

  kernel:warning: [10416608.580157] nvme nvme3: I/O 209 QID 1 timeout, disable controller

The next action the driver takes after logging that is to drain any
outstanding IO through a forced reset, and all subsequent tasks *should*
be unblocked after that completes to allow the unbinding, so I don't
think adding any new sysfs knobs is going to help if it's not already
succeeding.

The only other thing that looks odd is that one of your stuck tasks is a
user passthrough command, but that should have also been cleared out by
the reset. Do you know what command that process is sending? I'll need
to double check your kernel version to see if there's anything missing
in that driver to ensure the unbinding succeeds.