[PATCH V3 7/8] nvme: pci: recover controller reliably
Ming Lei
ming.lei at redhat.com
Thu May 3 21:24:52 PDT 2018
On Thu, May 03, 2018 at 11:46:56PM +0800, jianchao.wang wrote:
> Hi Ming
>
> Thanks for your kindly response.
>
> On 05/03/2018 06:08 PM, Ming Lei wrote:
> > nvme_eh_reset() can move on, if controller state is either CONNECTING or
> > RESETTING, nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING) won't
> > be called in nvme_eh_reset(), and nvme_pre_reset_dev() will be called
> > directly, then follows scheduling of the eh_reset_work.
>
> We have to consider another context, nvme_reset_work.
> There are two contexts that will invoke nvme_pre_reset_dev.
> nvme_eh_reset could invoke nvme_pre_reset_dev even if the state is RESETTING or CONNECTING.
>
> nvme_error_handler nvme_reset_ctrl
> -> change state to RESETTING
> -> queue reset work
> nvme_reset_work
> -> nvme_dev_disable
> -> nvme_eh_reset
> -> nvme_pre_reset_dev -> nvme_pre_reset_dev
> -> change state to CONNECTING -> change state fails
> -> nvme_remove_dead_ctrl
> There seems to be other race scenarios between them.
IMO it is fine to change the change state in nvme_pre_reset_dev()
as the following way:
if (dev->ctrl.state != NVME_CTRL_CONNECTING)
if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_CONNECTING))
goto out;
Also dev->reset_lock has been introduced, and we may hold it for
nvme_pre_reset_dev() in the path of nvme_reset_ctrl() too, then
there isn't the above race any more.
>
> Just invoke nvme_dev_disable in nvme_error_handler context and hand over the other things
> to nvme_reset_work as the v2 patch series seems clearer.
That way may not fix the current race: nvme_dev_disable() will
quiesce/freeze queue again when resetting is in-progress, then
nothing can move on.
One big change in V3 is to run nvme_dev_disable() & nvme_pre_reset_dev()
in the EH thread, and make sure queues are started once
nvme_pre_reset_dev() returns. But as you pointed, it may not work well
enough if admin requests are timed-out in nvme_pre_reset_dev().
> The primary target of nvme_error_handler is to drain IOs and disable controller.
> reset_work could take over the left things of the recovery work
Draining IOs isn't necessary, we can remove freezing queues & wait_freeze()
from nvme_dev_disable() & nvme_reset_work() for non-shutdown, that can be
easy to fix all these races since the .eh_reset_work isn't needed any more,
because quiescing is enough to prevent IOs from entering driver/device.
The only issue is that more requests may be failed when reset failed,
so I don't want to try that, and just follows the current behaviour.
IMO the primary goal of nvme EH is to recover controller, that is
what nvme_dev_disable() and resetting should do, if IO isn't drained,
timeout still may be triggered.
The big trouble is about percpu_ref, once the q_usage_counter is killed
to start freezing for preventing IO from entering block layer, we have
to run blk_mq_freeze_queue_wait() before unfreezing the usage couner. If
blk_mq_freeze_queue_wait() isn't needed, things can be much easier for us.
>
> IMO, modify the nvme_reset_ctrl_sync in nvme_error_handler to nvme_reset_ctrl should be ok.
> If the admin ios in reset_work timeout, the nvme_error_handler could also provide help to complete them.
>
Right, that is one goal of this patchset.
> > There are actually two issues here, both are covered in this patchset:
> >
> > 1) when nvme_wait_freeze() is for draining IO, controller error may
> > happen again, this patch can shutdown & reset controller again to
> > recover it.
> >
>
> We could split the nvme_reset_work into two contexts instead of introduce another nvme_eh_reset.
> Otherwise, the code looks really complicated.
Yeah, that is what I plan to do in V4, introduce .eh_pre_reset_work and
.eh_post_reset_work, and run the following things in .eh_pre_reset_work:
- nvme_pre_reset_dev()
- schedule .eh_post_reset_work
and run draining IO & updating controller state in .eh_post_reset_work.
.eh_pre_reset_work will be scheduled in nvme_error_handler().
This way may address the issue of admin req timeout in nvme_pre_reset_dev(),
what do you think of this approach?
Thanks,
Ming
More information about the Linux-nvme
mailing list