[PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx

Hannes Reinecke hare at suse.de
Wed Jun 30 02:43:41 PDT 2021


On 6/30/21 10:42 AM, Ming Lei wrote:
> On Wed, Jun 30, 2021 at 10:18:37AM +0200, Hannes Reinecke wrote:
>> On 6/29/21 9:49 AM, Ming Lei wrote:
>>> Hi,
>>>
>>> blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect
>>> io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask.
>>> However, all cpus in hctx->cpumask may be offline.
>>>
>>> This usage model isn't well supported by blk-mq which supposes allocator is
>>> always done on one online CPU in hctx->cpumask. This assumption is
>>> related with managed irq, which also requires blk-mq to drain inflight
>>> request in this hctx when the last cpu in hctx->cpumask is going to
>>> offline.
>>>
>>> However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow
>>> them to ask for request allocation when the specified hctx is inactive
>>> (all cpus in hctx->cpumask are offline).
>>>
>>> Fix blk_mq_alloc_request_hctx() by adding/passing flag of
>>> BLK_MQ_F_NOT_USE_MANAGED_IRQ.
>>>
>>>
>>> Ming Lei (2):
>>>     blk-mq: not deactivate hctx if the device doesn't use managed irq
>>>     nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop
>>>
>>>    block/blk-mq.c             | 6 +++++-
>>>    drivers/nvme/host/fc.c     | 3 ++-
>>>    drivers/nvme/host/rdma.c   | 3 ++-
>>>    drivers/nvme/host/tcp.c    | 3 ++-
>>>    drivers/nvme/target/loop.c | 3 ++-
>>>    include/linux/blk-mq.h     | 1 +
>>>    6 files changed, 14 insertions(+), 5 deletions(-)
>>>
>>> Cc: Sagi Grimberg <sagi at grimberg.me>
>>> Cc: Daniel Wagner <dwagner at suse. thede>
>>> Cc: Wen Xiong <wenxiong at us.ibm.com>
>>> Cc: John Garry <john.garry at huawei.com>
>>>
>>>
>> I have my misgivings about this patchset.
>> To my understanding, only CPUs present in the hctx cpumask are eligible to
>> submit I/O to that hctx.
> 
> It is just true for managed irq, and should be CPUs online.
> 
> However, no such constraint for non managed irq, since irq may migrate to
> other online CPUs if all CPUs in irq's current affinity become offline.
> 

But there shouldn't be any I/O pending during CPU offline (cf 
blk_mq_hctx_notify_offline()), so no interrupts should be triggered, 
either, no?

>> Consequently if all cpus in that mask are offline, where is the point of
>> even transmitting a 'connect' request?
> 
> nvmef requires to submit the connect request via one specified hctx
> which index has to be same with the io queue's index.
> 
> Almost all nvmef drivers fail to setup controller in case of
> connect io queue error.
> 

And I would prefer to fix that, namely allowing blk-mq to run on a 
sparse set of io queues.
The remaining io queues can be connected once the first cpu in the hctx 
cpumask is onlined; we already have blk_mq_hctx_notify_online(), which 
could easily be expanded to connect to relevant I/O queue...

> Also CPU can become offline & online, especially it is done in
> lots of sanity test.
> 

True, but then again all I/O on the hctx should be quiesced during cpu 
offline.

> So we should allow to allocate the connect request successful, and
> submit it to drivers given it is allowed in this way for non-managed
> irq.
> 

I'd rather not do this, as the 'connect' command runs on the 'normal' 
I/O tagset, and hence runs into the risk of being issues against 
non-existing CPUs.

>> Shouldn't we rather modify the tagset to only refer to the current online
>> CPUs _only_, thereby never submit a connect request for hctx with only
>> offline CPUs?
> 
> Then you may setup very less io queues, and performance may suffer even
> though lots of CPUs become online later.
> ;
Only if we stay with the reduced number of I/O queues. Which is not what 
I'm proposing; I'd rather prefer to connect and disconnect queues from 
the cpu hotplug handler. For starters we could even trigger a reset once 
the first cpu within a hctx is onlined.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare at suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer



More information about the Linux-nvme mailing list