Leaking minor device numbers

Keith Busch kbusch at kernel.org
Fri Oct 7 12:15:43 PDT 2022


On Fri, Oct 07, 2022 at 02:37:10PM +0200, Daniel Wagner wrote:
> I've noticed the nvme subsystem is leaking minor device numbers:
> 
> # nvme connect-all
> 
> brw-rw---- 1 root disk 259,   3 Oct  6 18:58 /dev/nvme2n1
> brw-rw---- 1 root disk 259,   5 Oct  6 18:58 /dev/nvme2n2
> brw-rw---- 1 root disk 259,   7 Oct  6 18:58 /dev/nvme2n3
> brw-rw---- 1 root disk 259,   9 Oct  6 18:58 /dev/nvme2n4
> 
> # while true; do nvme connect-all; nvme disconnect-all; done
> 
> brw-rw---- 1 root disk 259, 7913 Oct  6 19:13 /dev/nvme2n1
> brw-rw---- 1 root disk 259, 7929 Oct  6 19:13 /dev/nvme2n2
> brw-rw---- 1 root disk 259, 7943 Oct  6 19:13 /dev/nvme2n3
> brw-rw---- 1 root disk 259, 7945 Oct  6 19:13 /dev/nvme2n4
> 
> After a bit printk debugging I found the source of the
> leak. nvme_alloc_ns() and nvme_alloc_ns_head() (nvme_mpath_set_live)
> will allocate for each disk via blk_alloc_ext_minor() a minor device
> number, though using two different interfaces.
> 
> nvme_mpath_set_live() will uses device_add_disk() which is also setting
> the major number to 259/BLOCK_EXT_MAJOR. This allocation will be freed.
> 
> Though the other allocation happens via blk_mq_alloc_disk() which
> doesn't set the major number to 259/BOCK_EXT_MAJOR and eventually when
> the disk is freed, we end up in bdev_free_inode() and do not call
> blk_free_ext_minor():
> 
> static void bdev_free_inode(struct inode *inode)
> {
> 	[...]
> 
> 	if (MAJOR(bdev->bd_dev) == BLOCK_EXT_MAJOR)
> 		blk_free_ext_minor(MINOR(bdev->bd_dev));
> 
> 	[...]
> }
> 
> I am a bit lost what is supposed to happen or how those things are
> connected together. Any ideas?

Thanks for the detailed report.

I see we are leaking the minors of the HIDDEN block devices as we never
set the dev_t for them in device_add_disk():

	if (!(disk->flags & GENHD_FL_HIDDEN))
		ddev->devt = MKDEV(disk->major, disk->first_minor);

We could remove that HIDDEN check in order to set it unconditionally,
but then the device nodes would appear under /dev/, which is what we
don't want to happen.

It sounds like the del_gendisk needs to take on this responsibility
directly for HIDDEN disks. The below appears to work, but I'm not sure
this is respecting the necessary dev_t lifetime if user space is holding
a reference on the namespace.

---
diff --git a/block/genhd.c b/block/genhd.c
index 514395361d7c..b40b66ed30ec 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -611,6 +611,8 @@ void del_gendisk(struct gendisk *disk)
 		 * get reused and we'd get clashes in sysfs).
 		 */
 		bdi_unregister(disk->bdi);
+	} else if (disk->major == BLOCK_EXT_MAJOR) {
+		blk_free_ext_minor(disk->first_minor);
 	}

 	blk_unregister_queue(disk);
--



More information about the Linux-nvme mailing list