[PATCH 1/1] nvme-rdma: destroy cm id before destroy qp to avoid use after free
liruozhu
liruozhu at huawei.com
Thu Sep 9 19:50:30 PDT 2021
Hi Max,
Yes, I did encounter this problem. I posted the detailed log in the cover letter of this patch.
I tested it with Huawei storage array, kill nvme target processes(we use self-developed user mode nvme target)
while inject PCIE AER errors on array HBA card repeatedly.
I’m not sure if there is a way to easily reproduce it using standard server with kernel target.
But I still think that the host side should try to avoid this possibility of crashing.
Thanks.
On 2021/9/10 1:32, Max Gurtovoy wrote:
>
> On 9/9/2021 5:05 PM, Christoph Hellwig wrote:
>> This looks reasonable to me. Sagi, Max: any comments?
>>
>> On Mon, Sep 06, 2021 at 11:51:34AM +0800, Ruozhu Li wrote:
>>> We should always destroy cm_id before destroy qp to avoid to get cma
>>> event after qp was destroyed, which may lead to use after free.
>
> Do you really encounter use-after-free ?
>
> I think the code looks correct but from experience with RDMA-CM I
> would like to know which tests did you run with this code ?
>
> interesting tests are: port toggling with/without IO, unload
> mlx5/other_rdma_hw drivers during connection establishment or during
> IO, reboot target, reboot target during connection establishment,
> reboot target during IO.
>
> Thanks.
>
>
>>> In RDMA connection establishment error flow, don't destroy qp in cm
>>> event handler.Just report cm_error to upper level, qp will be destroy
>>> in nvme_rdma_alloc_queue() after destroy cm id.
>>>
>>> Signed-off-by: Ruozhu Li <liruozhu at huawei.com>
>>> ---
>>> drivers/nvme/host/rdma.c | 16 +++-------------
>>> 1 file changed, 3 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>> index a68704e39084..042c594bc57e 100644
>>> --- a/drivers/nvme/host/rdma.c
>>> +++ b/drivers/nvme/host/rdma.c
>>> @@ -656,8 +656,8 @@ static void nvme_rdma_free_queue(struct
>>> nvme_rdma_queue *queue)
>>> if (!test_and_clear_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags))
>>> return;
>>> - nvme_rdma_destroy_queue_ib(queue);
>>> rdma_destroy_id(queue->cm_id);
>>> + nvme_rdma_destroy_queue_ib(queue);
>>> mutex_destroy(&queue->queue_lock);
>>> }
>>> @@ -1815,14 +1815,10 @@ static int
>>> nvme_rdma_conn_established(struct nvme_rdma_queue *queue)
>>> for (i = 0; i < queue->queue_size; i++) {
>>> ret = nvme_rdma_post_recv(queue, &queue->rsp_ring[i]);
>>> if (ret)
>>> - goto out_destroy_queue_ib;
>>> + return ret;
>>> }
>>> return 0;
>>> -
>>> -out_destroy_queue_ib:
>>> - nvme_rdma_destroy_queue_ib(queue);
>>> - return ret;
>>> }
>>> static int nvme_rdma_conn_rejected(struct nvme_rdma_queue *queue,
>>> @@ -1916,14 +1912,10 @@ static int nvme_rdma_route_resolved(struct
>>> nvme_rdma_queue *queue)
>>> if (ret) {
>>> dev_err(ctrl->ctrl.device,
>>> "rdma_connect_locked failed (%d).\n", ret);
>>> - goto out_destroy_queue_ib;
>>> + return ret;
>>> }
>>> return 0;
>>> -
>>> -out_destroy_queue_ib:
>>> - nvme_rdma_destroy_queue_ib(queue);
>>> - return ret;
>>> }
>>> static int nvme_rdma_cm_handler(struct rdma_cm_id *cm_id,
>>> @@ -1954,8 +1946,6 @@ static int nvme_rdma_cm_handler(struct
>>> rdma_cm_id *cm_id,
>>> case RDMA_CM_EVENT_ROUTE_ERROR:
>>> case RDMA_CM_EVENT_CONNECT_ERROR:
>>> case RDMA_CM_EVENT_UNREACHABLE:
>>> - nvme_rdma_destroy_queue_ib(queue);
>>> - fallthrough;
>>> case RDMA_CM_EVENT_ADDR_ERROR:
>>> dev_dbg(queue->ctrl->ctrl.device,
>>> "CM error event %d\n", ev->event);
>>> --
>>> 2.16.4
>>>
>>>
>>> _______________________________________________
>>> Linux-nvme mailing list
>>> Linux-nvme at lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>> ---end quoted text---
> .
More information about the Linux-nvme
mailing list