[PATCH 2/3] nvme-multipath: cannot disconnect controller on stuck partition scan
Keith Busch
kbusch at kernel.org
Tue Oct 15 07:33:19 PDT 2024
On Thu, Oct 10, 2024 at 10:57:23AM +0200, Hannes Reinecke wrote:
> On 10/9/24 19:32, Keith Busch wrote:
> > @@ -974,14 +991,14 @@ void nvme_mpath_shutdown_disk(struct nvme_ns_head *head)
> > return;
> > if (test_and_clear_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
> > nvme_cdev_del(&head->cdev, &head->cdev_device);
> > + /*
> > + * requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared
> > + * to allow multipath to fail all I/O.
> > + */
> > + synchronize_srcu(&head->srcu);
> > + kblockd_schedule_work(&head->requeue_work);
> > del_gendisk(head->disk);
> > }
>
> I guess we need to split 'test_and_clear_bit()' into a 'test_bit()' when
> testing for the condition and a 'clear_bit()' after del_gendisk().
>
> Otherwise we're having a race condition with nvme_mpath_set_live:
>
> if (!test_and_set_bit(NVME_NSHEAD_DISK_LIVE, &head->flags)) {
> rc = device_add_disk(&head->subsys->dev, head->disk,
> nvme_ns_attr_groups);
>
> which could sneak in from another controller just after we cleared the
> NVME_NSHEAD_DISK_LIVE flag, causing device_add_disk() to fail as the
> same name is already registered.
> Or nvme_cdev_del() to display nice kernel warnings as the cdev was
> not registered.
Is that actually happening for you? I don't think it's supposed to
happen because mpath_shutdwon_disk only happens if the head is detached
from the subsystem, so no other thread should be able to possibly
attempt to access that head's disk for a different controller path. That
new controller should have to allocate an entirely new head.
More information about the Linux-nvme
mailing list