[PATCHv6 RFC 0/3] Add visibility for native NVMe multipath using sysfs

Nilay Shroff nilay at linux.ibm.com
Sun Jan 12 04:18:13 PST 2025



On 1/10/25 9:17 PM, Keith Busch wrote:
> On Wed, Jan 08, 2025 at 09:47:48AM -0700, Keith Busch wrote:
>> On Fri, Dec 13, 2024 at 09:48:33AM +0530, Nilay Shroff wrote:
>>> This RFC propose adding new sysfs attributes for adding visibility of
>>> nvme native multipath I/O.
>>>
>>> The changes are divided into three patches.
>>> The first patch adds visibility for round-robin io-policy.
>>> The second patch adds visibility for numa io-policy.
>>> The third patch adds the visibility for queue-depth io-policy.
>>
>> Thanks, applied to nvme-6.14.
> 
> I think I have to back this out of nvme-6.14 for now. This appears to be
> causing a problem with blktests, test case trtype = loop nvme/058, as
> reported by Chaitanya.
> 
> Here's a snippet of the kernel messages related to this:
> 
> [ 9031.706759] sysfs: cannot create duplicate filename '/devices/virtual/nvme-subsystem/nvme-subsys1/nvme1n2/multipath/nvme1c4n2'
> [ 9031.706767] CPU: 41 UID: 0 PID: 52494 Comm: kworker/u192:61 Tainted:G        W  O     N 6.13.0-rc4nvme+ #109
> [ 9031.706775] Tainted: [W]=WARN, [O]=OOT_MODULE, [N]=TEST
> [ 9031.706777] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [ 9031.706781] Workqueue: async async_run_entry_fn
> [ 9031.706790] Call Trace:
> [ 9031.706795]  <TASK>
> [ 9031.706798]  dump_stack_lvl+0x94/0xb0
> [ 9031.706806]  sysfs_warn_dup+0x5b/0x70
> [ 9031.706812]  sysfs_do_create_link_sd+0xce/0xe0
> [ 9031.706817]  sysfs_add_link_to_group+0x35/0x60
> [ 9031.706823]  nvme_mpath_add_sysfs_link+0xc3/0x160 [nvme_core]
> [ 9031.706848]  nvme_mpath_set_live+0xb9/0x1f0 [nvme_core]
> [ 9031.706865]  nvme_mpath_add_disk+0x10b/0x130 [nvme_core]
> [ 9031.706883]  nvme_alloc_ns+0x8d5/0xc80 [nvme_core]
> [ 9031.706904]  nvme_scan_ns+0x280/0x350 [nvme_core]
> [ 9031.706920]  ? do_raw_spin_unlock+0x4e/0xc0
> [ 9031.706929]  async_run_entry_fn+0x31/0x130
> [ 9031.706934]  process_one_work+0x1f9/0x630
> [ 9031.706943]  worker_thread+0x191/0x330
> [ 9031.706948]  ? __pfx_worker_thread+0x10/0x10
> [ 9031.706952]  kthread+0xe1/0x120
> [ 9031.706956]  ? __pfx_kthread+0x10/0x10
> [ 9031.706959]  ret_from_fork+0x31/0x50
> [ 9031.706965]  ? __pfx_kthread+0x10/0x10
> [ 9031.706968]  ret_from_fork_asm+0x1a/0x30
> [ 9031.706980]  </TASK>
> [ 9031.707062] block nvme1n2: failed to create link to nvme1c4n2
> 
> 
Thank you for the report! Yes indeed it failed with trtype=loop and nvme/058.
I further investigated it and found that nvme/058 creates 3 shared namespaces
and then attaches those namespaces to 6 different controllers. Later it rapidly 
(in quick succession) unmaps and then maps those namespaces in random order. 
So that causes multiple nvme paths being simultaneously added/removed in the 
host. During those simultaneous add/remove operations, sometimes we trap into 
the observed symptom as reported above.

So we have to protect nvme_mpath_add_sysfs_link() from simultaneous add/remove
ns paths. Fortunately it's not so difficult to protect it. There're two things 
we need to ensure:

1. Don't try to recreate the sysfs link if it's already created:
The current code uses flag NVME_NS_SYSFS_ATTR_LINK which is marked against ns->flags
when link from head node to the ns path node is added. The current code uses test_bit() 
to evaluate if that flag is set or not. If it's not set then we create link and then 
mark NVME_NS_SYSFS_ATTR_LINK against ns->flags. However this is not safe and we need to
replace test_bit() with test_and_set_bit() which helps set the flag atomically.

2. Don't create the link from head node to ns path node if disk is not yet added:
As sysfs link is created between kobjects of the head dev node and ns path dev node,
we have to ensure that it's only after device_add_disk() successfully returns for
both head disk and path node disk, we attempt to create the link otherwise sysfs/kernfs 
would complain loudly. So we just need to test against GD_ADDED flag for both head disk 
and path disk and it's only after respective disks are added, we attempt create sysfs link.

I have made above two changes and tested the code against blktest nvme/058. Now I see 
that test pass without any issue. I tested the script for hundreds of times and it 
passed each iteration. So with the above two changes, I will spin a patch and submit 
upstream. Please help review the same.

Thanks,
--Nilay




More information about the Linux-nvme mailing list