[PATCH] nvme-fabrics: fix crash for no IO queues

Mon Mar 8 01:30:47 GMT 2021


On 2021/3/6 4:58, Sagi Grimberg wrote:
> 
>> A crash happens when set feature(NVME_FEAT_NUM_QUEUES) timeout in nvme
>> over rdma(roce) reconnection, the reason is use the queue which is not
>> alloced.
>>
>> If queue is not live, should not allow queue request.
> 
> Can you describe exactly the scenario here? What is the state
> here? LIVE? or DELETING?
If seting feature(NVME_FEAT_NUM_QUEUES) failed due to time out or
the target return 0 io queues, nvme_set_queue_count will return 0,
and then reconnection will continue and success. The state of controller
is LIVE. The request will continue to deliver by call ->queue_rq(),
and then crash happens.
> 
>>
>> Signed-off-by: Chao Leng <lengchao at huawei.com>
>> ---
>>   drivers/nvme/host/fabrics.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
>> index 733010d2eafd..2479744fc349 100644
>> --- a/drivers/nvme/host/fabrics.h
>> +++ b/drivers/nvme/host/fabrics.h
>> @@ -189,7 +189,7 @@ static inline bool nvmf_check_ready(struct nvme_ctrl *ctrl, struct request *rq,
>>   {
>>       if (likely(ctrl->state == NVME_CTRL_LIVE ||
>>              ctrl->state == NVME_CTRL_DELETING))
>> -        return true;
>> +        return queue_live;
>>       return __nvmf_check_ready(ctrl, rq, queue_live);
>>   }
> 
> There were some issues in the past that made us allow submitting
> requests in DELETING state and introducing DELETING_NOIO. See
> patch ecca390e8056 ("nvme: fix deadlock in disconnect during scan_work and/or ana_work")
This doesn't make any difference. When in deleting state the queue is
still live.
> 
> The driver should be able to accept I/O in DELETING because the core
> changes the state to DELETING_NOIO _before_ it calls ->delete_ctrl so I
> don't understand how you get to this if the queue is not allocated...
The state of controller is live. The deletion process looks good.
> .