target crash / host hang with nvme-all.3 branch of nvme-fabrics

Christoph Hellwig hch at lst.de
Thu Jun 16 13:38:24 PDT 2016


On Thu, Jun 16, 2016 at 10:11:30PM +0300, Sagi Grimberg wrote:
>> I think nvmet_rdma_delete_ctrl is getting the exlusion vs other calls
>> or __nvmet_rdma_queue_disconnect wrong as we rely on a queue that
>> is undergoing deletion to not be on any list.
>
> How do we rely on that? __nvmet_rdma_queue_disconnect callers are
> responsible for queue_list deletion and queue the release. I don't
> see where are we getting it wrong.

Thread 1:

Moves the queues off nvmet_rdma_queue_list and and onto the
local list in nvmet_rdma_delete_ctrl

Thread 2:

Gets into nvmet_rdma_cm_handler -> nvmet_rdma_queue_disconnect for one
of the queues now on the local list.  list_empty(&queue->queue_list) evaluates
to false because the queue is on the local list, and now we have thread 1
and 2 racing for disconnecting the queue.

>>   static int nvmet_rdma_add_port(struct nvmet_port *port)
>>
>
> Umm, this looks wrong to me. delete_controller should delete _all_
> the ctrl queues (which will usually involve more than 1), what about
> all the other queues? what am I missing?

Yes, it should - see the patch I just posted.



More information about the Linux-nvme mailing list