[PATCH V6 11/11] nvme: pci: support nested EH

Ming Lei ming.lei at redhat.com
Wed May 16 16:10:59 PDT 2018


Hi Keith,

On Wed, May 16, 2018 at 08:12:42AM -0600, Keith Busch wrote:
> Hi Ming,
> 
> I'm sorry, but I am frankly not on board with introducing yet
> another work-queue into this driver for handling this situation. The
> fact you missed syncing with this queue in the surprise remove case,
> introducing various use-after-free conditions, just demonstrates exactly

blk_cleanup_queue() will drain all requests, and all the timeout activities
will be quiesced, so could you explain what the use-after-free
conditions are? Also we can wait on 'eh_wq' simply until dev->nested_eh
becomes zero before removing 'nvme_dev'.

> how over-complicated this approach is. That, and the forced controller

Now we run controller shutdown and resetting in different contests,
which has caused big trouble, kinds of IO hang:

	https://marc.info/?l=linux-block&m=152644353006033&w=2

This patch moves the two into one same context, it is really a
simplification for avoiding IO hang which is caused by race
between starting freeze and nvme_wait_freeze(), which two are
run from different contexts now.

The sync between all EH contexts looks a bit complicated, but
eh_wq is introduced to sync until the recovery is done for all EHs.

So the new EH model isn't complicated.

I hope all these issues can be fixed, but open to any soluiton,
unfortunately not see any workable way yet, except for this patchset.

> state transtions is yet another way surprise removal will break as its
> depending on the state machine to prevent certain transitions.

We can limit the transtions in nvme_force_change_ctrl_state() too,
that can be documented well.

> 
> The driver is already in a work queue context when it becomes aware of
> corrective action being necessary. Seriously, simply syncing these in the

Yeah, it is in reset work func, which is part of recovery, but either
admin or normal IO can be timed-out in this context too, that is why I
introduces new work to cover this case.

I think we have to handle the issue of IO time-out during reset.

In block/011 test, it can be observed easily that the 3rd EH is started
to recover the whole failure, and finally it succeeds.

> reset convers nearly all conditions you're concered with, most of which

I believe I have explained it to you, that isn't enough:

https://marc.info/?l=linux-block&m=152599770314638&w=2
https://marc.info/?l=linux-block&m=152598984210852&w=2
https://marc.info/?l=linux-block&m=152598664409170&w=2

> will be obviated if Bart's blk-mq timeout rework is added. The only case
> that isn't covered is if IO stops when renumbering the hardware contexts
> (unlikely as that is), and that's easily fixable just moving that into
> the scan_work.

How can Bart's patch cover multiple NS's case? Timeout from multiple NS
still can come at the same time.

> 
> As far as blktests block/011 is concerned, I think this needs to be
> rethought considering what it's actually showing us. The fact the
> pci driver provides such an easy way to not only muck with PCI config
> register *and* internal kernel structures out from under a driver that's
> bound to it is insane.  If PCI really wants to provide this sysfs entry,
> it really ought to notify bound drivers that this is occuring, similar
> to the 'reset' sysfs.

All simulation in block/011 may happen in reality.

When one IO is timed out, IO dispatched during resetting can be
timed-out too.

Unfortunately current NVMe driver can't handle that and the reset
work will hang forever. I don't believe it is a good way as EH.

> 
> Anyway, there is merit to some of your earlier patches. In particular,
> specifically patches 2, 4, and 5.  On the timing out the host memory
> releasing (patch 2), I would just rather see this as a generic API,
> though:
> 
>   http://lists.infradead.org/pipermail/linux-nvme/2018-January/015313.html
>   http://lists.infradead.org/pipermail/linux-nvme/2018-January/015314.html

The generic API may not be necessary, since it is only needed when it is
called in nvme_dev_disable(), but it is still better to have this API.

Except above, you ignore one big source of IO hang, that is the race
between starting freeze & blk_mq_freeze_queue_wait().

- nvme_dev_disable() and resetting controller are required for recovering
    controller, but the two are run from different contexts. nvme_start_freeze()
    is call from nvme_dev_disable() which is run timeout work context, and
    nvme_unfreeze() is run from reset work context. Unfortunatley timeout may be
    triggered during resetting controller, so nvme_start_freeze() may be run
    several times.
    
- Also two reset work may run one by one, this may cause hang in
nvme_wait_freeze() forever too.

In short, I'd see all these issues described in the following commit log
can be fixed:

	https://marc.info/?l=linux-block&m=152644353006033&w=2

If you think there are better ways, let's talk in code. Again, I am
open to any solution.

Thanks,
Ming



More information about the Linux-nvme mailing list