[PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx

Thu Jul 1 02:13:49 PDT 2021

On Thu, Jul 01, 2021 at 10:00:27AM +0200, Hannes Reinecke wrote:
> On 7/1/21 1:59 AM, Ming Lei wrote:
> > On Wed, Jun 30, 2021 at 09:46:35PM +0200, Hannes Reinecke wrote:
> > > On 6/30/21 8:59 PM, Sagi Grimberg wrote:
> > > > 
> > > > > > > > Shouldn't we rather modify the tagset to only refer to
> > > > > > > > the current online
> > > > > > > > CPUs _only_, thereby never submit a connect request for hctx with only
> > > > > > > > offline CPUs?
> > > > > > > 
> > > > > > > Then you may setup very less io queues, and performance may suffer even
> > > > > > > though lots of CPUs become online later.
> > > > > > > ;
> > > > > > Only if we stay with the reduced number of I/O queues. Which is
> > > > > > not what I'm
> > > > > > proposing; I'd rather prefer to connect and disconnect queues
> > > > > > from the cpu
> > > > > > hotplug handler. For starters we could even trigger a reset once
> > > > > > the first
> > > > > > cpu within a hctx is onlined.
> > > > > 
> > > > > Yeah, that need one big/complicated patchset, but not see any advantages
> > > > > over this simple approach.
> > > > 
> > > > I tend to agree with Ming here.
> > > 
> > > Actually, Daniel and me came to a slightly different idea: use cpu hotplug
> > > notifier.
> > > Thing is, blk-mq already has cpu hotplug notifier, which should ensure that
> > > no I/O is pending during cpu hotplug.
> > 
> > Why should we ensure that for non-managed irq?
> > 
> 
> While not strictly necessary, it does align the hctx layout with the current
> CPU topology.
> As such we won't have any 'stale' CPUs or queues in the hctx layout, and
> with that avoid any issues we'll be having due to inactive CPUs in the
> cpumask.

We know the exact theory for non-managed interrupt, can you share us
any issue you want to avoid?

> 
> > > If we now add a nvme cpu hotplug notifier which essentially kicks off a
> > > reset once all cpu in a hctx are offline the reset logic will rearrange the
> > > queues to match the current cpu layout.
> > > And when the cpus are getting onlined we'll do another reset.
> > > 
> > > Daniel is currently preparing a patch; let's see how it goes.
> > 
> > What is the advantage of that big change over this simple way?
> > 
> 
> With the simple way we might (and, as the results show, do) run the
> nvme_ctrl_reset() in parallel to CPU hotplug.
> This leads to quite some complexity, and as we've seen is triggering quite
> some race conditions.

CPU hotplug isn't related with nvme reset at all, which is just for
recovering nvme controller to make it working again. How can cpu offline
affect nvme controller?

For NVMe PCI, blk-mq have drained any inflight requests before the hctx
is becoming inactive.

For other NVMe controller which doesn't use managed irq, the inflight
requests can still be completed via the original interrupt, or polling
task, both can been migrated to new online cpu.

I don't know what exact race conditions you are talking about. Please
explain it in details, otherwise it just wastes our time.

> 
> Hence I do think we need to synchronize nvme_ctrl_reset() with CPU hotplug,
> to ensure that the reset handler is completed before the cpu is completely
> torn down.

No, you didn't provide any proof about the necessity of adding the sync. 

Thanks, 
Ming