[PATCH] blk-mq: avoid to hang in the cpuhp offline handler

Thu Sep 22 00:41:47 PDT 2022

On Thu, Sep 22, 2022 at 08:25:17AM +0200, Christoph Hellwig wrote:
> On Tue, Sep 20, 2022 at 10:17:24AM +0800, Ming Lei wrote:
> > For avoiding to trigger io timeout when one hctx becomes inactive, we
> > drain IOs when all CPUs of one hctx are offline. However, driver's
> > timeout handler may require cpus_read_lock, such as nvme-pci,
> > pci_alloc_irq_vectors_affinity() is called in nvme-pci reset context,
> > and irq_build_affinity_masks() needs cpus_read_lock().
> > 
> > Meantime when blk-mq's cpuhp offline handler is called, cpus_write_lock
> > is held, so deadlock is caused.
> > 
> > Fixes the issue by breaking the wait loop if enough long time elapses,
> > and these in-flight not drained IO still can be handled by timeout
> > handler.
> 
> I'm not sure that this actually is a good idea on its own, and it kinda
> defeats the cpu hotplug processing.
> 
> So if I understand your log above correctly the problem is that
> we have commands that would time out, and we exacalate to a
> controller reset that is racing with the CPU unplug.

Yes. 

blk_mq_hctx_notify_offline() is waiting for inflight requests, then
cpu_write_lock() is held since it is cpuhp code path.

Meantime nvme reset grabs dev->shutdown_lock, then calls
pci_alloc_irq_vectors_affinity()->irq_build_affinity_masks() which
is waiting for cpu_read_lock().

Meantime nvme_dev_disable() can't move on for handling any io timeout
because dev->shutdown_lock is held by nvme reset. Then in-flight IO
can't be drained by blk_mq_hctx_notify_offline()

One real IO deadlock between cpuhp and nvme_reset.

thanks,
Ming