nvme-rdma corrupts memory upon timeout

Thu Mar 29 06:07:39 PDT 2018

On 3/1/2018 11:12 AM, Sagi Grimberg wrote:
> 
>>> This patch still returns to userspace after queuing work and may
>>> result in corruption.
>>
>> That's probably that one request being timed out as we complete
>> it earlier.
>>
>> Does this patch on top help?
>> -- 
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index e793b0899d4e..50bbf88b82f6 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -1590,10 +1590,7 @@ nvme_rdma_timeout(struct request *rq, bool 
>> reserved)
>>          /* queue error recovery */
>>          nvme_rdma_error_recovery(req->queue->ctrl);
>>
>> -       /* fail with DNR on cmd timeout */
>> -       nvme_req(rq)->status = NVME_SC_ABORT_REQ | NVME_SC_DNR;
>> -
>> -       return BLK_EH_HANDLED;
>> +       return BLK_EH_RESET_TIMER;
>>   }
>>
>>   /*
>> -- 
> 
> Did this help?

it helped in our setup. Can we push this ?
regarding the stopping/draining the qp before the calling the 
blk_mq_tagset_busy_iter, it seems right to me also since we want to stop 
getting completions from the HCA before we cancel all the inflight requests.

We are completing the request from various contexts and can get NULL 
dereference nvme_rdma_process_nvme_rsp (in req->mr).
We added a debug prints to see if the rq is inflight during 
nvme_rdma_process_nvme_rsp and it was't (other context "complete" the 
request already).

-Max.

> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fmailman%2Flistinfo%2Flinux-nvme&data=02%7C01%7Cmaxg%40mellanox.com%7Cd9e8d94af2a948c67a7108d57f549415%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636554923592413828&sdata=%2BbaaFQTsG8TU9NUSfr6adRhrk3VTMBfHHjr9hM%2BxdfE%3D&reserved=0 
>