[PATCH 5/6] nvme-rdma: fix timeout handler

Chao Leng lengchao at huawei.com
Wed Aug 5 02:27:01 EDT 2020



On 2020/8/5 9:12, Sagi Grimberg wrote:
> 
>>>> may interrupt by hard interrupt, and then timeout progress flush work
>>>> at this time. Thus error recovery and nvme_rdma_complete_timed_out may
>>>> concurrent to stop queue. will cause: error recovery may cancel request
>>>> or nvme_rdma_complete_timed_out may complete request, but the queue may
>>>> not be stoped. Thus will cause abnormal.
>>>
>>> We should be fine and safe to complete the I/O.
>>
>> Complete request in nvme_rdma_timeout or cancel request in
>> nvme_rdma_error_recovery_work or nvme_rdma_reset_ctrl_work is not safe.
>> Because the queue may be not really stoped, it may just cleard the flag:
>> NVME_RDMA_Q_ALLOCATED for the queue. Thus one request may concurrent
>> treat by two progress, it is not allowed.
> 
> The request being timed out cannot be completed after the queue is
> stopped, that is the point of nvme_rdma_stop_queue. if it is only
> ALLOCATED, we did not yet connect hence there is zero chance for
> any command to complete.
The request may already complete before stop queue, it is in the cq, but
is not treated by software. If nvme_rdma_stop_queue concurrent, for example:
The error recovery run first, it will clear the flag:NVME_RDMA_Q_LIVE,
and then wait drain cq. At the same time nvme_rdma_timeout
call nvme_rdma_stop_queue will return immediately, and then may call
blk_mq_complete_request. but error recovery may drain cq at the same
time, and may also treat the same request.



More information about the Linux-nvme mailing list