[PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx

Wed Jun 30 01:42:43 PDT 2021

On Wed, Jun 30, 2021 at 10:18:37AM +0200, Hannes Reinecke wrote:
> On 6/29/21 9:49 AM, Ming Lei wrote:
> > Hi,
> > 
> > blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect
> > io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask.
> > However, all cpus in hctx->cpumask may be offline.
> > 
> > This usage model isn't well supported by blk-mq which supposes allocator is
> > always done on one online CPU in hctx->cpumask. This assumption is
> > related with managed irq, which also requires blk-mq to drain inflight
> > request in this hctx when the last cpu in hctx->cpumask is going to
> > offline.
> > 
> > However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow
> > them to ask for request allocation when the specified hctx is inactive
> > (all cpus in hctx->cpumask are offline).
> > 
> > Fix blk_mq_alloc_request_hctx() by adding/passing flag of
> > BLK_MQ_F_NOT_USE_MANAGED_IRQ.
> > 
> > 
> > Ming Lei (2):
> >    blk-mq: not deactivate hctx if the device doesn't use managed irq
> >    nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop
> > 
> >   block/blk-mq.c             | 6 +++++-
> >   drivers/nvme/host/fc.c     | 3 ++-
> >   drivers/nvme/host/rdma.c   | 3 ++-
> >   drivers/nvme/host/tcp.c    | 3 ++-
> >   drivers/nvme/target/loop.c | 3 ++-
> >   include/linux/blk-mq.h     | 1 +
> >   6 files changed, 14 insertions(+), 5 deletions(-)
> > 
> > Cc: Sagi Grimberg <sagi at grimberg.me>
> > Cc: Daniel Wagner <dwagner at suse. thede>
> > Cc: Wen Xiong <wenxiong at us.ibm.com>
> > Cc: John Garry <john.garry at huawei.com>
> > 
> > 
> I have my misgivings about this patchset.
> To my understanding, only CPUs present in the hctx cpumask are eligible to
> submit I/O to that hctx.

It is just true for managed irq, and should be CPUs online.

However, no such constraint for non managed irq, since irq may migrate to
other online CPUs if all CPUs in irq's current affinity become offline.

> Consequently if all cpus in that mask are offline, where is the point of
> even transmitting a 'connect' request?

nvmef requires to submit the connect request via one specified hctx
which index has to be same with the io queue's index.

Almost all nvmef drivers fail to setup controller in case of
connect io queue error.

Also CPU can become offline & online, especially it is done in
lots of sanity test.

So we should allow to allocate the connect request successful, and
submit it to drivers given it is allowed in this way for non-managed
irq.

> Shouldn't we rather modify the tagset to only refer to the current online
> CPUs _only_, thereby never submit a connect request for hctx with only
> offline CPUs?

Then you may setup very less io queues, and performance may suffer even
though lots of CPUs become online later.

Thanks,
Ming