[PATCHv2] NVMe: Fix reset/remove race

sagig sagi at grimberg.me
Sun Apr 3 09:38:46 PDT 2016



On 22/03/16 23:49, Keith Busch wrote:
> This fixes a scenario where device is present and being reset, but a
> request to unbind the driver occurs.
>
> A previous patch series addressing a device failure removal scenario
> flushed reset_work after controller disable to unblock reset_work waiting
> on a completion that wouldn't occur. This isn't safe as-is. The broken
> scenario can potentially be induced with:
>
>    modprobe nvme && modprobe -r nvme
>
> To fix, the reset work is flushed immediately after setting the controller
> removing flag, and any subsequent reset will not proceed with controller
> initialization if the flag is set.
>
> The controller status must be polled while active, so the watchdog timer
> is also left active until the controller is disabled to cleanup requests
> that may be stuck during namespace removal.
>
> [Fixes: ff23a2a15a2117245b4599c1352343c8b8fb4c43]
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> ---
> v1->v2:
>    Removed the untested change on IO timeout handling that skipped queueing
>    reset work.
>
>   drivers/nvme/host/pci.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 24ccda3..660ec84 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1859,6 +1859,9 @@ static void nvme_reset_work(struct work_struct *work)
>   	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
>   		nvme_dev_disable(dev, false);
>   
> +	if (test_bit(NVME_CTRL_REMOVING, &dev->flags))
> +		goto out;
> +
>   	set_bit(NVME_CTRL_RESETTING, &dev->flags);
>   
>   	result = nvme_pci_enable(dev);
> @@ -2078,11 +2081,10 @@ static void nvme_remove(struct pci_dev *pdev)
>   {
>   	struct nvme_dev *dev = pci_get_drvdata(pdev);
>   
> -	del_timer_sync(&dev->watchdog_timer);
> -
>   	set_bit(NVME_CTRL_REMOVING, &dev->flags);
>   	pci_set_drvdata(pdev, NULL);
>   	flush_work(&dev->async_work);
> +	flush_work(&dev->reset_work);

Do we need the same for scan_work? AFAICT it can still
sneak in while we're removing...



More information about the Linux-nvme mailing list