Kernel OOPS while creating a NVMe Namespace

Nilay Shroff nilay at linux.ibm.com
Mon Jun 17 02:10:45 PDT 2024



On 6/11/24 01:03, Keith Busch wrote:
> On Mon, Jun 10, 2024 at 10:17:42PM +0300, Sagi Grimberg wrote:
>> On 10/06/2024 22:15, Keith Busch wrote:
>>> On Mon, Jun 10, 2024 at 10:05:00PM +0300, Sagi Grimberg wrote:
>>>>
>>>> On 10/06/2024 21:53, Keith Busch wrote:
>>>>> On Mon, Jun 10, 2024 at 01:21:00PM +0530, Venkat Rao Bagalkote wrote:
>>>>>> Issue is introduced by the patch: be647e2c76b27f409cdd520f66c95be888b553a3.
>>>>> My mistake. The namespace remove list appears to be getting corrupted
>>>>> because I'm using the wrong APIs to replace a "list_move_tail". This is
>>>>> fixing the issue on my end:
>>>>>
>>>>> ---
>>>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>>>> index 7c9f91314d366..c667290de5133 100644
>>>>> --- a/drivers/nvme/host/core.c
>>>>> +++ b/drivers/nvme/host/core.c
>>>>> @@ -3959,9 +3959,10 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
>>>>>    	mutex_lock(&ctrl->namespaces_lock);
>>>>>    	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
>>>>> -		if (ns->head->ns_id > nsid)
>>>>> -			list_splice_init_rcu(&ns->list, &rm_list,
>>>>> -					     synchronize_rcu);
>>>>> +		if (ns->head->ns_id > nsid) {
>>>>> +			list_del_rcu(&ns->list);
>>>>> +			list_add_tail_rcu(&ns->list, &rm_list);
>>>>> +		}
>>>>>    	}
>>>>>    	mutex_unlock(&ctrl->namespaces_lock);
>>>>>    	synchronize_srcu(&ctrl->srcu);
>>>>> --
>>>> Can we add a reproducer for this in blktests? I'm assuming that we can
>>>> easily trigger this
>>>> with adding/removing nvmet namespaces?
>>> I'm testing this with Namespace Manamgent commands, which nvmet doesn't
>>> support. You can recreate the issue by detaching the last namespace.
>>>
>>
>> I think the same will happen in a test that creates two namespaces and then
>> echo 0 > ns/enable.
> 
> Looks like nvme/016 tess this. It's reporting as "passed" on my end, but
> I don't think it's actually testing the driver as intended. Still
> messing with it.
> 
I believe nvme/016 creates and deletes the namespace however there's no backstore 
associated with the loop device and hence nvme/016 is unable to recreate this issue.

To recreate this issue, we need to associate a backstore (either a block-dev or 
a regular-file) to the loop device and then use it for creating and then deleting 
the namespace.

I wrote a blktest for this specific regression and I could able to trigger this crash. 
I would submit this blktest in a separate email. 

Thanks,
--Nilay





More information about the Linux-nvme mailing list