[PATCH] blk-mq: avoid to hang in the cpuhp offline handler

Thu Sep 22 02:13:28 PDT 2022

On Thu, Sep 22, 2022 at 09:47:09AM +0100, John Garry wrote:
> On 20/09/2022 03:17, Ming Lei wrote:
> > For avoiding to trigger io timeout when one hctx becomes inactive, we
> > drain IOs when all CPUs of one hctx are offline. However, driver's
> > timeout handler may require cpus_read_lock, such as nvme-pci,
> > pci_alloc_irq_vectors_affinity() is called in nvme-pci reset context,
> > and irq_build_affinity_masks() needs cpus_read_lock().
> > 
> > Meantime when blk-mq's cpuhp offline handler is called, cpus_write_lock
> > is held, so deadlock is caused.
> > 
> > Fixes the issue by breaking the wait loop if enough long time elapses,
> > and these in-flight not drained IO still can be handled by timeout
> > handler.
> 
> I don't think that that this is a good idea - that is because often drivers
> cannot safely handle scenario of timeout of an IO which has actually
> completed. NVMe timeout handler may poll for completion, but SCSI does not.
> 
> Indeed, if we were going to allow the timeout handler handle these in-flight
> IO then there is no point in having this hotplug handler in the first place.

That is true from the beginning, and we did know the point, I remember that
Hannes asked this question in LSF/MM, and there are many drivers which don't
implement timeout handler.

For this issue, it looks more like one nvme specific since nvme timeout handler
can't move on during nvme reset. Let's see if it can be fixed by nvme
driver.

BTW nvme error handling is really fragile, not only this one, such as, any timeout
during reset will cause device removed.

Thanks.
Ming