[PATCH] nvmet-loop: avoid using mutex in IO hotpath

Hannes Reinecke hare at suse.de
Tue Dec 3 00:38:57 PST 2024


On 11/29/24 11:48, Nilay Shroff wrote:
> Using mutex lock in IO hot path causes the kernel BUG sleeping while
> atomic. Shinichiro[1], first encountered this issue while running blktest
> nvme/052 shown below:
> 
> BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585
> in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 996, name: (udev-worker)
> preempt_count: 0, expected: 0
> RCU nest depth: 1, expected: 0
> 2 locks held by (udev-worker)/996:
>   #0: ffff8881004570c8 (mapping.invalidate_lock){.+.+}-{3:3}, at: page_cache_ra_unbounded+0x155/0x5c0
>   #1: ffffffff8607eaa0 (rcu_read_lock){....}-{1:2}, at: blk_mq_flush_plug_list+0xa75/0x1950
> CPU: 2 UID: 0 PID: 996 Comm: (udev-worker) Not tainted 6.12.0-rc3+ #339
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
> Call Trace:
>   <TASK>
>   dump_stack_lvl+0x6a/0x90
>   __might_resched.cold+0x1f7/0x23d
>   ? __pfx___might_resched+0x10/0x10
>   ? vsnprintf+0xdeb/0x18f0
>   __mutex_lock+0xf4/0x1220
>   ? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
>   ? __pfx_vsnprintf+0x10/0x10
>   ? __pfx___mutex_lock+0x10/0x10
>   ? snprintf+0xa5/0xe0
>   ? xas_load+0x1ce/0x3f0
>   ? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
>   nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
>   ? __pfx_nvmet_subsys_nsid_exists+0x10/0x10 [nvmet]
>   nvmet_req_find_ns+0x24e/0x300 [nvmet]
>   nvmet_req_init+0x694/0xd40 [nvmet]
>   ? blk_mq_start_request+0x11c/0x750
>   ? nvme_setup_cmd+0x369/0x990 [nvme_core]
>   nvme_loop_queue_rq+0x2a7/0x7a0 [nvme_loop]
>   ? __pfx___lock_acquire+0x10/0x10
>   ? __pfx_nvme_loop_queue_rq+0x10/0x10 [nvme_loop]
>   __blk_mq_issue_directly+0xe2/0x1d0
>   ? __pfx___blk_mq_issue_directly+0x10/0x10
>   ? blk_mq_request_issue_directly+0xc2/0x140
>   blk_mq_plug_issue_direct+0x13f/0x630
>   ? lock_acquire+0x2d/0xc0
>   ? blk_mq_flush_plug_list+0xa75/0x1950
>   blk_mq_flush_plug_list+0xa9d/0x1950
>   ? __pfx_blk_mq_flush_plug_list+0x10/0x10
>   ? __pfx_mpage_readahead+0x10/0x10
>   __blk_flush_plug+0x278/0x4d0
>   ? __pfx___blk_flush_plug+0x10/0x10
>   ? lock_release+0x460/0x7a0
>   blk_finish_plug+0x4e/0x90
>   read_pages+0x51b/0xbc0
>   ? __pfx_read_pages+0x10/0x10
>   ? lock_release+0x460/0x7a0
>   page_cache_ra_unbounded+0x326/0x5c0
>   force_page_cache_ra+0x1ea/0x2f0
>   filemap_get_pages+0x59e/0x17b0
>   ? __pfx_filemap_get_pages+0x10/0x10
>   ? lock_is_held_type+0xd5/0x130
>   ? __pfx___might_resched+0x10/0x10
>   ? find_held_lock+0x2d/0x110
>   filemap_read+0x317/0xb70
>   ? up_write+0x1ba/0x510
>   ? __pfx_filemap_read+0x10/0x10
>   ? inode_security+0x54/0xf0
>   ? selinux_file_permission+0x36d/0x420
>   blkdev_read_iter+0x143/0x3b0
>   vfs_read+0x6ac/0xa20
>   ? __pfx_vfs_read+0x10/0x10
>   ? __pfx_vm_mmap_pgoff+0x10/0x10
>   ? __pfx___seccomp_filter+0x10/0x10
>   ksys_read+0xf7/0x1d0
>   ? __pfx_ksys_read+0x10/0x10
>   do_syscall_64+0x93/0x180
>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>   ? do_syscall_64+0x9f/0x180
>   ? lockdep_hardirqs_on+0x78/0x100
>   ? do_syscall_64+0x9f/0x180
>   ? lockdep_hardirqs_on_prepare+0x16d/0x400
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> RIP: 0033:0x7f565bd1ce11
> Code: 00 48 8b 15 09 90 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 d0 ad 01 00 f3 0f 1e fa 80 3d 35 12 0e 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
> RSP: 002b:00007ffd6e7a20c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f565bd1ce11
> RDX: 0000000000001000 RSI: 00007f565babb000 RDI: 0000000000000014
> RBP: 00007ffd6e7a2130 R08: 00000000ffffffff R09: 0000000000000000
> R10: 0000556000bfa610 R11: 0000000000000246 R12: 000000003ffff000
> R13: 0000556000bfa5b0 R14: 0000000000000e00 R15: 0000556000c07328
>   </TASK>
> 
> Apparently, the above issue is caused due to using mutex lock while
> we're in IO hot path. It's a regression caused with commit 505363957fad
> ("nvmet: fix nvme status code when namespace is disabled"). The mutex
> ->su_mutex is used to find whether a disabled nsid exists in the config
> group or not. This is to differentiate between a nsid that is disabled
> vs non-existent.
> 
> To mitigate the above issue, we've worked upon a fix[2] where we now
> insert nsid in subsys Xarray as soon as it's created under config group
> and later when that nsid is enabled, we add an Xarray mark on it and set
> ns->enabled to true. The Xarray mark is useful while we need to loop
> through all enabled namepsaces under a subsystem using xa_for_each_marked()
> API. If later a nsid is disabled then we clear Xarray mark from it and also
> set ns->enabled to false. It's only when nsid is deleted from the config
> group we delete it from the Xarray.
> 
> So with this change, now we could easily differentiate a nsid is disabled
> (i.e. Xarray entry for ns exists but ns->enabled is set to false) vs non-
> existent (i.e.Xarray entry for ns doesn't exist).
> 
> Link: https://lore.kernel.org/linux-nvme/20241022070252.GA11389@lst.de/ [2]
> Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki at wdc.com>
> Closes: https://lore.kernel.org/linux-nvme/tqcy3sveity7p56v7ywp7ssyviwcb3w4623cnxj3knoobfcanq@yxgt2mjkbkam/ [1]
> Fixes: 505363957fad ("nvmet: fix nvme status code when namespace is disabled")
> Fix-suggested-by: Christoph Hellwig <hch at lst.de>
> Signed-off-by: Nilay Shroff <nilay at linux.ibm.com>
> ---
>   drivers/nvme/target/admin-cmd.c |  13 ++--
>   drivers/nvme/target/core.c      | 108 +++++++++++++++++++-------------
>   drivers/nvme/target/nvmet.h     |   1 +
>   drivers/nvme/target/pr.c        |  10 +--
>   4 files changed, 79 insertions(+), 53 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare at suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



More information about the Linux-nvme mailing list