Leaking minor device numbers
Keith Busch
kbusch at kernel.org
Fri Oct 7 12:15:43 PDT 2022
On Fri, Oct 07, 2022 at 02:37:10PM +0200, Daniel Wagner wrote:
> I've noticed the nvme subsystem is leaking minor device numbers:
>
> # nvme connect-all
>
> brw-rw---- 1 root disk 259, 3 Oct 6 18:58 /dev/nvme2n1
> brw-rw---- 1 root disk 259, 5 Oct 6 18:58 /dev/nvme2n2
> brw-rw---- 1 root disk 259, 7 Oct 6 18:58 /dev/nvme2n3
> brw-rw---- 1 root disk 259, 9 Oct 6 18:58 /dev/nvme2n4
>
> # while true; do nvme connect-all; nvme disconnect-all; done
>
> brw-rw---- 1 root disk 259, 7913 Oct 6 19:13 /dev/nvme2n1
> brw-rw---- 1 root disk 259, 7929 Oct 6 19:13 /dev/nvme2n2
> brw-rw---- 1 root disk 259, 7943 Oct 6 19:13 /dev/nvme2n3
> brw-rw---- 1 root disk 259, 7945 Oct 6 19:13 /dev/nvme2n4
>
> After a bit printk debugging I found the source of the
> leak. nvme_alloc_ns() and nvme_alloc_ns_head() (nvme_mpath_set_live)
> will allocate for each disk via blk_alloc_ext_minor() a minor device
> number, though using two different interfaces.
>
> nvme_mpath_set_live() will uses device_add_disk() which is also setting
> the major number to 259/BLOCK_EXT_MAJOR. This allocation will be freed.
>
> Though the other allocation happens via blk_mq_alloc_disk() which
> doesn't set the major number to 259/BOCK_EXT_MAJOR and eventually when
> the disk is freed, we end up in bdev_free_inode() and do not call
> blk_free_ext_minor():
>
> static void bdev_free_inode(struct inode *inode)
> {
> [...]
>
> if (MAJOR(bdev->bd_dev) == BLOCK_EXT_MAJOR)
> blk_free_ext_minor(MINOR(bdev->bd_dev));
>
> [...]
> }
>
> I am a bit lost what is supposed to happen or how those things are
> connected together. Any ideas?
Thanks for the detailed report.
I see we are leaking the minors of the HIDDEN block devices as we never
set the dev_t for them in device_add_disk():
if (!(disk->flags & GENHD_FL_HIDDEN))
ddev->devt = MKDEV(disk->major, disk->first_minor);
We could remove that HIDDEN check in order to set it unconditionally,
but then the device nodes would appear under /dev/, which is what we
don't want to happen.
It sounds like the del_gendisk needs to take on this responsibility
directly for HIDDEN disks. The below appears to work, but I'm not sure
this is respecting the necessary dev_t lifetime if user space is holding
a reference on the namespace.
---
diff --git a/block/genhd.c b/block/genhd.c
index 514395361d7c..b40b66ed30ec 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -611,6 +611,8 @@ void del_gendisk(struct gendisk *disk)
* get reused and we'd get clashes in sysfs).
*/
bdi_unregister(disk->bdi);
+ } else if (disk->major == BLOCK_EXT_MAJOR) {
+ blk_free_ext_minor(disk->first_minor);
}
blk_unregister_queue(disk);
--
More information about the Linux-nvme
mailing list