crash on device removal
Sagi Grimberg
sagi at grimberg.me
Wed Jul 13 03:06:01 PDT 2016
>> We actually missed a kref_get in nvme_get_ns_from_disk().
>>
>> This should fix it. Could you help to verify?
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 4babdf0..b146f52 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -183,6 +183,8 @@ static struct nvme_ns *nvme_get_ns_from_disk(struct
>> gendisk *disk)
>> }
>> spin_unlock(&dev_list_lock);
>>
>> + kref_get(&ns->ctrl->kref);
>> +
>> return ns;
>>
>> fail_put_ns:
>
> Hey Ming. This avoids the crash in nvme_rdma_free_qe(), but now I see another crash:
>
> [ 975.633436] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 10.0.1.14:4420
> [ 978.463636] nvme nvme0: creating 32 I/O queues.
> [ 979.187826] nvme nvme0: new ctrl: NQN "testnqn", addr 10.0.1.14:4420
> [ 987.778287] nvme nvme0: Got rdma device removal event, deleting ctrl
> [ 987.882202] BUG: unable to handle kernel paging request at ffff880e770e01f8
> [ 987.890024] IP: [<ffffffffa03a1a46>] __ib_process_cq+0x46/0xc0 [ib_core]
>
> This looks like another problem with freeing the tag sets before stopping the QP. I thought we fixed that once and for all, but perhaps there is some other path we missed. :(
The fix doesn't look right to me. But I wander how you got this crash
now? if at all, this would delay the controller removal...
More information about the Linux-nvme
mailing list