[bug report] nvme/rdma: nvme connect failed after offline one cpu on host side

Sagi Grimberg sagi at grimberg.me
Tue Jul 26 01:56:19 PDT 2022



On 7/26/22 05:05, Ming Lei wrote:
> On Thu, Jul 07, 2022 at 10:28:22AM +0300, Sagi Grimberg wrote:
>>
>>>>>>> update the subject to better describe the issue:
>>>>>>>
>>>>>>> So I tried this issue on one nvme/rdma environment, and it was also
>>>>>>> reproducible, here are the steps:
>>>>>>>
>>>>>>> # echo 0 >/sys/devices/system/cpu/cpu0/online
>>>>>>> # dmesg | tail -10
>>>>>>> [  781.577235] smpboot: CPU 0 is now offline
>>>>>>> # nvme connect -t rdma -a 172.31.45.202 -s 4420 -n testnqn
>>>>>>> Failed to write to /dev/nvme-fabrics: Invalid cross-device link
>>>>>>> no controller found: failed to write to nvme-fabrics device
>>>>>>>
>>>>>>> # dmesg
>>>>>>> [  781.577235] smpboot: CPU 0 is now offline
>>>>>>> [  799.471627] nvme nvme0: creating 39 I/O queues.
>>>>>>> [  801.053782] nvme nvme0: mapped 39/0/0 default/read/poll queues.
>>>>>>> [  801.064149] nvme nvme0: Connect command failed, error wo/DNR bit: -16402
>>>>>>> [  801.073059] nvme nvme0: failed to connect queue: 1 ret=-18
>>>>>>
>>>>>> This is because of blk_mq_alloc_request_hctx() and was raised before.
>>>>>>
>>>>>> IIRC there was reluctance to make it allocate a request for an hctx even
>>>>>> if its associated mapped cpu is offline.
>>>>>>
>>>>>> The latest attempt was from Ming:
>>>>>> [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx
>>>>>>
>>>>>> Don't know where that went tho...
>>>>>
>>>>> The attempt relies on that the queue for connecting io queue uses
>>>>> non-admined irq, unfortunately that can't be true for all drivers,
>>>>> so that way can't go.
>>>>
>>>> The only consumer is nvme-fabrics, so others don't matter.
>>>> Maybe we need a different interface that allows this relaxation.
>>>>
>>>>> So far, I'd suggest to fix nvme_*_connect_io_queues() to ignore failed
>>>>> io queue, then the nvme host still can be setup with less io queues.
>>>>
>>>> What happens when the CPU comes back? Not sure we can simply ignore it.
>>>
>>> Anyway, it is a not good choice to fail the whole controller if only one
>>> queue can't be connected.
>>
>> That is irrelevant.
>>
>>> I meant the queue can be kept as non-LIVE, and
>>> it should work since no any io can be issued to this queue when it is
>>> non-LIVE.
>>
>> The way that nvme-pci behaves is to create all the queues and either
>> have them idle when their mapped cpu is offline, and have the queue
>> there and ready when the cpu comes back. It is the simpler approach and
>> I would like to have it for fabrics too, but to establish a fabrics
>> queue we need to send a request (connect) to the controller. The fact
>> that we cannot simply get a reference to a request for a given hw queue
>> is baffling to me.
>>
>>> Just wondering why we can't re-connect the io queue and set LIVE after
>>> any CPU in the this hctx->cpumask becomes online? blk-mq could add one
>>> pair of callbacks for driver for handing this queue change.
>> Certainly possible, but you are creating yet another interface solely
>> for nvme-fabrics that covers up for the existing interface that does not
>> satisfy what nvme-fabrics (the only consumer of it) would like it to do.
> 
> I guess you meant that the others(rdma and tcp) use non-managed queue,
> so they needn't such change?
> 
> But it isn't true actually, blk-mq/nvme still can't handle it well. From
> blk-mq's viewpoint, if all CPUs in hctx->cpumask are offline, it will
> treat the hctx as inactive and not workable, and refuses to allocate
> request from this hctx, no matter if the underlying queue irq is managed
> or not.
> 
> Now after 14dc7a18abbe ("block: Fix handling of offline queues in
> blk_mq_alloc_request_hctx(), it may break controller setup easily if
> any CPU is offline.
> 
> I'd suggest to fix the issue in unified way since nvme-fabric needs to be
> covered, then nvme's user experience can be improved.

That is exactly what I want, but unlike pcie, nvmf creates the queue
using a connect request that is not driven from a user context. Hence
it would be nice to have an interface to get it done.

The alternative would be to make nvmf connect not use blk-mq, but that
is not a good alternative in my mind. Having a callback interface for
cpu hotplug is just another interface that every transport will need
to implement, and it makes nvmf different than pci.

> BTW, I guess rdma/tcp/fc's queue may take extra or bigger resources than
> nvme pci, if resource are only allocated until the queue is active, queue
> resource utilization may be improved.

That is not a concern what-so-ever. Queue resources are cheap enough
that we shouldn't have to care about it in this scale.



More information about the Linux-nvme mailing list