[PATCH RFC for-5.8-rc] nvme-core: fix deadlock in disconnect during scan_work and/or ana_work
Sagi Grimberg
sagi at grimberg.me
Mon Jun 29 03:44:29 EDT 2020
>> + /*
>> + * Controller deletion started, we may issue I/O, block and prevent
>> + * the controller deletion process from completing
>> + */
>> + if (ctrl->state == NVME_CTRL_DELETE_START)
>> + return;
>> +
>> /* No tagset on a live ctrl means IO queues could not created */
>> if (ctrl->state != NVME_CTRL_LIVE || !ctrl->tagset)
>
> Can we merge the checks into a single one?
Its actually redundant and covered by state != LIVE, will remove.
>> @@ -3913,6 +3932,9 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
>> if (ctrl->state == NVME_CTRL_DEAD)
>> nvme_kill_queues(ctrl);
>>
>> + /* prevent mpath I/O before removing namespaces */
>> + nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING);
>
> So with the DEAD state above isn't this going to cause problems,
> shouldn't this be:
>
> if (ctrl->state == NVME_CTRL_DEAD)
> nvme_kill_queues(ctrl);
> else
> nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING);
The DEAD state is only designed to get to this point and make sure
nvme_kill_queues is called. Shouldn't be any issue changing the state
here.
> But even with that I'm not sure it does the right thing for the direct
> call from the PCIe code.
From the pci removal flow it seems OK, the only problem I now see
is when reset_work is only able to setup admin queue, then it calls
nvme_remove_namespaces... While the controller is in strange state,
we should probably not move it to DELETING state...
> Also I wonder about the state naming. Shouldn't NVME_CTRL_DELETE_START
> stay as NVME_CTRL_DELETING and the new state could be
> NVME_CTRL_NS_REMOVAL? or NVME_CTRL_DELETED? But with any name we'll
> need to document the difference between the two removal states.
Let me think some more about the naming...
>> + /*
>> + * We don't treat NVME_CTRL_DELETE_START as a disabled path
>> + * as we I/O should still be able to complete assuming that
>> + * the controller is connected, otherwize it'll fail
>> + * immediately and return to the requeue list.
>> + */
>
> This needs to run through a spell and grammar checker :)
Rilly? I kant find aniting rong with it...
More information about the Linux-nvme
mailing list