[bug report] WARNING: possible circular locking at: rdma_destroy_id+0x17/0x20 [rdma_cm] triggered by blktests nvmeof-mp/002
Sagi Grimberg
sagi at grimberg.me
Wed May 25 04:01:03 PDT 2022
> Hello
> Below WARNING triggered by blktests nvmeof-mp/002, it can be
> reproduced with the latest 5.16.0-rc3 and also exists on 5.12, pls
> check it.
iirc this was reported before, based on my analysis lockdep is giving
a false alarm here. The reason is that the id_priv->handler_mutex cannot
be the same for both cm_id that is handling the connect and the cm_id
that is handling the rdma_destroy_id because rdma_destroy_id call
is always called on a already disconnected cm_id, so this deadlock
lockdep is complaining about cannot happen.
I'm not sure how to settle this.
>
> # use_siw=1 ./check nvmeof-mp/002
> nvmeof-mp/002 (File I/O on top of multipath concurrently with logout
> and login (mq)) [passed]
> runtime 41.767s ... 36.877s
> # ./check nvmeof-mp/002
> nvmeof-mp/002 (File I/O on top of multipath concurrently with logout
> and login (mq)) [passed]
> runtime ... 41.277s
>
> [ 609.166100] run blktests nvmeof-mp/002 at 2021-12-03 21:27:52
> [ 609.977883] null_blk: module loaded
> [ 610.538697] rdma_rxe: loaded
> [ 610.602763] infiniband eno1_rxe: set active
> [ 610.602782] infiniband eno1_rxe: added eno1
> [ 610.639520] eno2 speed is unknown, defaulting to 1000
> [ 610.644946] eno2 speed is unknown, defaulting to 1000
> [ 610.650318] eno2 speed is unknown, defaulting to 1000
> [ 610.690546] infiniband eno2_rxe: set down
> [ 610.690553] infiniband eno2_rxe: added eno2
> [ 610.691112] eno2 speed is unknown, defaulting to 1000
> [ 610.714293] eno2 speed is unknown, defaulting to 1000
> [ 610.729535] eno3 speed is unknown, defaulting to 1000
> [ 610.734666] eno3 speed is unknown, defaulting to 1000
> [ 610.740026] eno3 speed is unknown, defaulting to 1000
> [ 610.781372] infiniband eno3_rxe: set down
> [ 610.781379] infiniband eno3_rxe: added eno3
> [ 610.781906] eno3 speed is unknown, defaulting to 1000
> [ 610.806211] eno2 speed is unknown, defaulting to 1000
> [ 610.811508] eno3 speed is unknown, defaulting to 1000
> [ 610.826655] eno4 speed is unknown, defaulting to 1000
> [ 610.831810] eno4 speed is unknown, defaulting to 1000
> [ 610.837176] eno4 speed is unknown, defaulting to 1000
> [ 610.885844] infiniband eno4_rxe: set down
> [ 610.885850] infiniband eno4_rxe: added eno4
> [ 610.886426] eno4 speed is unknown, defaulting to 1000
> [ 611.266756] Loading iSCSI transport class v2.0-870.
> [ 611.335358] iscsi: registered transport (iser)
> [ 611.387969] nvmet: adding nsid 1 to subsystem nvme-test
> [ 611.403878] eno2 speed is unknown, defaulting to 1000
> [ 611.409276] eno3 speed is unknown, defaulting to 1000
> [ 611.414522] eno4 speed is unknown, defaulting to 1000
> [ 611.468712] Rounding down aligned max_sectors from 4294967295 to 4294967288
> [ 611.469203] db_root: cannot open: /etc/target
> [ 611.515816] nvmet_rdma: enabling port 1 (10.16.221.116:7777)
> [ 611.622158] eno2 speed is unknown, defaulting to 1000
> [ 611.628243] eno3 speed is unknown, defaulting to 1000
> [ 611.633422] eno4 speed is unknown, defaulting to 1000
> [ 611.703648] eno2 speed is unknown, defaulting to 1000
> [ 611.708949] eno3 speed is unknown, defaulting to 1000
> [ 611.714267] eno4 speed is unknown, defaulting to 1000
> [ 611.754520] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0: address
> resolved (0): status 0 id 00000000d0d64a6d
> [ 611.765454] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0: route
> resolved (2): status 0 id 00000000d0d64a6d
> [ 611.767778] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma: connect
> request (4): status 0 id 000000007ccc375a
> [ 611.767935] nvmet_rdma:nvmet_rdma_find_get_device: nvmet_rdma:
> added eno1_rxe.
> [ 611.777083] nvmet_rdma:nvmet_rdma_create_queue_ib: nvmet_rdma:
> nvmet_rdma_create_queue_ib: max_cqe= 8191 max_sge= 32 sq_size = 289
> cm_id= 000000007ccc375a
> --snip--
> [ 612.132317] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0: established
> (9): status 0 id 00000000cede4b76
> [ 612.132452] nvme nvme0: mapped 32/0/0 default/read/poll queues.
> [ 612.132537] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma:
> established (9): status 0 id 0000000070c2f22c
> [ 612.157849] nvmet:nvmet_execute_io_connect: nvmet: adding queue 1 to ctrl 1.
> [ 612.158194] nvmet:nvmet_execute_io_connect: nvmet: adding queue 2 to ctrl 1.
> [ 612.158548] nvmet:nvmet_execute_io_connect: nvmet: adding queue 3 to ctrl 1.
> [ 612.158854] nvmet:nvmet_execute_io_connect: nvmet: adding queue 4 to ctrl 1.
> [ 612.159173] nvmet:nvmet_execute_io_connect: nvmet: adding queue 5 to ctrl 1.
> [ 612.159510] nvmet:nvmet_execute_io_connect: nvmet: adding queue 6 to ctrl 1.
> [ 612.159853] nvmet:nvmet_execute_io_connect: nvmet: adding queue 7 to ctrl 1.
> [ 612.160225] nvmet:nvmet_execute_io_connect: nvmet: adding queue 8 to ctrl 1.
> [ 612.160572] nvmet:nvmet_execute_io_connect: nvmet: adding queue 9 to ctrl 1.
> [ 612.160885] nvmet:nvmet_execute_io_connect: nvmet: adding queue 10 to ctrl 1.
> [ 612.161216] nvmet:nvmet_execute_io_connect: nvmet: adding queue 11 to ctrl 1.
> [ 612.161759] nvmet:nvmet_execute_io_connect: nvmet: adding queue 12 to ctrl 1.
> [ 612.162123] nvmet:nvmet_execute_io_connect: nvmet: adding queue 13 to ctrl 1.
> [ 612.162473] nvmet:nvmet_execute_io_connect: nvmet: adding queue 14 to ctrl 1.
> [ 612.162793] nvmet:nvmet_execute_io_connect: nvmet: adding queue 15 to ctrl 1.
> [ 612.163135] nvmet:nvmet_execute_io_connect: nvmet: adding queue 16 to ctrl 1.
> [ 612.163458] nvmet:nvmet_execute_io_connect: nvmet: adding queue 17 to ctrl 1.
> [ 612.163776] nvmet:nvmet_execute_io_connect: nvmet: adding queue 18 to ctrl 1.
> [ 612.164103] nvmet:nvmet_execute_io_connect: nvmet: adding queue 19 to ctrl 1.
> [ 612.164417] nvmet:nvmet_execute_io_connect: nvmet: adding queue 20 to ctrl 1.
> [ 612.164700] nvmet:nvmet_execute_io_connect: nvmet: adding queue 21 to ctrl 1.
> [ 612.164971] nvmet:nvmet_execute_io_connect: nvmet: adding queue 22 to ctrl 1.
> [ 612.165275] nvmet:nvmet_execute_io_connect: nvmet: adding queue 23 to ctrl 1.
> [ 612.165689] nvmet:nvmet_execute_io_connect: nvmet: adding queue 24 to ctrl 1.
> [ 612.166082] nvmet:nvmet_execute_io_connect: nvmet: adding queue 25 to ctrl 1.
> [ 612.166383] nvmet:nvmet_execute_io_connect: nvmet: adding queue 26 to ctrl 1.
> [ 612.166663] nvmet:nvmet_execute_io_connect: nvmet: adding queue 27 to ctrl 1.
> [ 612.166963] nvmet:nvmet_execute_io_connect: nvmet: adding queue 28 to ctrl 1.
> [ 612.167273] nvmet:nvmet_execute_io_connect: nvmet: adding queue 29 to ctrl 1.
> [ 612.167529] nvmet:nvmet_execute_io_connect: nvmet: adding queue 30 to ctrl 1.
> [ 612.167804] nvmet:nvmet_execute_io_connect: nvmet: adding queue 31 to ctrl 1.
> [ 612.168114] nvmet:nvmet_execute_io_connect: nvmet: adding queue 32 to ctrl 1.
> [ 612.168523] nvme nvme0: new ctrl: NQN "nvme-test", addr 10.16.221.116:7777
> [ 612.293734] device-mapper: multipath service-time: version 0.3.0 loaded
> [ 612.296099] device-mapper: table: 253:3: multipath: error getting
> device (-EBUSY)
> [ 612.303676] device-mapper: ioctl: error adding target to table
> [ 612.501014] device-mapper: table: 253:3: multipath: error getting
> device (-EBUSY)
> [ 612.508751] device-mapper: ioctl: error adding target to table
> [ 612.514717] device-mapper: table: 253:3: multipath: error getting
> device (-EBUSY)
> [ 612.522632] device-mapper: ioctl: error adding target to table
> [ 612.950887] device-mapper: table: 253:3: multipath: error getting
> device (-EBUSY)
> [ 612.958616] device-mapper: ioctl: error adding target to table
> [ 613.869348] EXT4-fs (dm-3): mounted filesystem without journal.
> Opts: (null). Quota mode: none.
> [ 613.869619] ext4 filesystem being mounted at
> /root/kernel-tests/storage/blktests/nvme/nvmeof-mp/blktests/results/tmpdir.nvmeof-mp.002.ZYO/mnt0
> supports timestamps until 2038 (0x7fffffff)
> [ 617.144570] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
> traffic based keep-alive timer
> [ 617.272414] nvmet:nvmet_execute_keep_alive: nvmet: ctrl 1 update
> keep-alive timer for 5 secs
> [ 622.775724] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
> traffic based keep-alive timer
> [ 627.895810] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
> traffic based keep-alive timer
> [ 633.015369] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
> traffic based keep-alive timer
> [ 638.138209] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
> traffic based keep-alive timer
> [ 643.254966] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
> traffic based keep-alive timer
> [ 644.822178] eno2 speed is unknown, defaulting to 1000
> [ 644.827494] eno3 speed is unknown, defaulting to 1000
> [ 644.832797] eno4 speed is unknown, defaulting to 1000
> [ 648.377790] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
> traffic based keep-alive timer
> [ 651.364570] nvme nvme0: Removing ctrl: NQN "nvme-test"
> [ 651.430891] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma:
> disconnected (10): status 0 id 000000009ae8bc67
> [ 651.430914] nvmet_rdma:__nvmet_rdma_queue_disconnect: nvmet_rdma:
> cm_id= 000000009ae8bc67 queue->state= 1
> [ 651.430983] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma:
> disconnected (10): status 0 id 0000000078827089
> [ 651.430994] nvmet_rdma:__nvmet_rdma_queue_disconnect: nvmet_rdma:
> cm_id= 0000000078827089 queue->state= 1
> --snip--
> [ 651.438145] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0:
> disconnected (10): status 0 id 00000000cede4b76
> [ 651.438150] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0: disconnect
> received - connection closed
>
> [ 651.448883] ======================================================
> [ 651.455061] WARNING: possible circular locking dependency detected
> [ 651.461242] 5.16.0-rc3+ #2 Tainted: G S I
> [ 651.466380] ------------------------------------------------------
> [ 651.472559] kworker/14:4/4839 is trying to acquire lock:
> [ 651.477873] ffff8884c0e353e0 (&id_priv->handler_mutex){+.+.}-{3:3},
> at: rdma_destroy_id+0x17/0x20 [rdma_cm]
> [ 651.487623]
> but task is already holding lock:
> [ 651.493453] ffff8883cc21fe00
> ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at:
> process_one_work+0x888/0x1740
> [ 651.504062]
> which lock already depends on the new lock.
>
> [ 651.512233]
> the existing dependency chain (in reverse order) is:
> [ 651.519715]
> -> #3 ((work_completion)(&queue->release_work)){+.+.}-{0:0}:
> [ 651.527894] lock_acquire+0x1d2/0x5a0
> [ 651.532079] process_one_work+0x8da/0x1740
> [ 651.536700] worker_thread+0x87/0xbf0
> [ 651.540886] kthread+0x3aa/0x490
> [ 651.544637] ret_from_fork+0x1f/0x30
> [ 651.548737]
> -> #2 ((wq_completion)events){+.+.}-{0:0}:
> [ 651.555349] lock_acquire+0x1d2/0x5a0
> [ 651.559535] flush_workqueue+0x109/0x15f0
> [ 651.564067] nvmet_rdma_queue_connect+0x1c79/0x2610 [nvmet_rdma]
> [ 651.570594] cma_cm_event_handler+0xef/0x530 [rdma_cm]
> [ 651.576261] cma_ib_req_handler+0x1a76/0x4a50 [rdma_cm]
> [ 651.582016] cm_process_work+0x44/0x190 [ib_cm]
> [ 651.587077] cm_req_handler+0x2218/0x7040 [ib_cm]
> [ 651.592312] cm_work_handler+0x1acf/0x4a30 [ib_cm]
> [ 651.597633] process_one_work+0x978/0x1740
> [ 651.602253] worker_thread+0x87/0xbf0
> [ 651.606437] kthread+0x3aa/0x490
> [ 651.610190] ret_from_fork+0x1f/0x30
> [ 651.614290]
> -> #1 (&id_priv->handler_mutex/1){+.+.}-{3:3}:
> [ 651.621249] lock_acquire+0x1d2/0x5a0
> [ 651.625435] __mutex_lock+0x155/0x15f0
> [ 651.629708] cma_ib_req_handler+0x1709/0x4a50 [rdma_cm]
> [ 651.635462] cm_process_work+0x44/0x190 [ib_cm]
> [ 651.640521] cm_req_handler+0x2218/0x7040 [ib_cm]
> [ 651.645749] cm_work_handler+0x1acf/0x4a30 [ib_cm]
> [ 651.651071] process_one_work+0x978/0x1740
> [ 651.655689] worker_thread+0x87/0xbf0
> [ 651.659874] kthread+0x3aa/0x490
> [ 651.663628] ret_from_fork+0x1f/0x30
> [ 651.667726]
> -> #0 (&id_priv->handler_mutex){+.+.}-{3:3}:
> [ 651.674511] check_prevs_add+0x3fd/0x2480
> [ 651.679045] __lock_acquire+0x23f7/0x2f80
> [ 651.683577] lock_acquire+0x1d2/0x5a0
> [ 651.687762] __mutex_lock+0x155/0x15f0
> [ 651.692036] rdma_destroy_id+0x17/0x20 [rdma_cm]
> [ 651.697184] nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma]
> [ 651.703189] nvmet_rdma_release_queue_work+0x3b/0x90 [nvmet_rdma]
> [ 651.709802] process_one_work+0x978/0x1740
> [ 651.714422] worker_thread+0x87/0xbf0
> [ 651.718608] kthread+0x3aa/0x490
> [ 651.722358] ret_from_fork+0x1f/0x30
> [ 651.726459]
> other info that might help us debug this:
>
> [ 651.734457] Chain exists of:
> &id_priv->handler_mutex --> (wq_completion)events -->
> (work_completion)(&queue->release_work)
>
> [ 651.748531] Possible unsafe locking scenario:
>
> [ 651.754451] CPU0 CPU1
> [ 651.758984] ---- ----
> [ 651.763515] lock((work_completion)(&queue->release_work));
> [ 651.769174] lock((wq_completion)events);
> [ 651.775786]
> lock((work_completion)(&queue->release_work));
> [ 651.783959] lock(&id_priv->handler_mutex);
> [ 651.788233]
> *** DEADLOCK ***
>
> [ 651.794153] 2 locks held by kworker/14:4/4839:
> [ 651.798596] #0: ffff88810005b948
> ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x854/0x1740
> [ 651.808078] #1: ffff8883cc21fe00
> ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at:
> process_one_work+0x888/0x1740
> [ 651.819121]
> stack backtrace:
> [ 651.823478] CPU: 14 PID: 4839 Comm: kworker/14:4 Tainted: G S
> I 5.16.0-rc3+ #2
> [ 651.831824] Hardware name: Dell Inc. PowerEdge R640/06NR82, BIOS
> 2.11.2 004/21/2021
> [ 651.839476] Workqueue: events nvmet_rdma_release_queue_work [nvmet_rdma]
> [ 651.846175] Call Trace:
> [ 651.848628] <TASK>
> [ 651.850735] dump_stack_lvl+0x44/0x57
> [ 651.854401] check_noncircular+0x280/0x320
> [ 651.858501] ? print_circular_bug.isra.47+0x430/0x430
> [ 651.863553] ? lock_chain_count+0x20/0x20
> [ 651.867566] ? wait_for_completion+0x1e0/0x270
> [ 651.872010] ? mark_lock.part.52+0xf7/0x1050
> [ 651.876283] ? sched_clock_cpu+0x15/0x200
> [ 651.880295] ? find_held_lock+0x3a/0x1c0
> [ 651.884223] check_prevs_add+0x3fd/0x2480
> [ 651.888238] ? mark_held_locks+0xb7/0x120
> [ 651.892248] ? check_irq_usage+0xb50/0xb50
> [ 651.896346] ? lockdep_lock+0xcb/0x1c0
> [ 651.900100] ? static_obj+0xc0/0xc0
> [ 651.903591] ? sched_clock_cpu+0x15/0x200
> [ 651.907606] __lock_acquire+0x23f7/0x2f80
> [ 651.911620] ? rcu_read_lock_bh_held+0xc0/0xc0
> [ 651.916063] lock_acquire+0x1d2/0x5a0
> [ 651.919729] ? rdma_destroy_id+0x17/0x20 [rdma_cm]
> [ 651.924529] ? rcu_read_unlock+0x50/0x50
> [ 651.928456] ? lock_is_held_type+0xd9/0x130
> [ 651.932644] __mutex_lock+0x155/0x15f0
> [ 651.936393] ? rdma_destroy_id+0x17/0x20 [rdma_cm]
> [ 651.941194] ? kasan_quarantine_put+0xa4/0x180
> [ 651.945641] ? rdma_destroy_id+0x17/0x20 [rdma_cm]
> [ 651.950442] ? mutex_lock_io_nested+0x1460/0x1460
> [ 651.955146] ? lock_is_held_type+0xd9/0x130
> [ 651.959334] ? rcu_read_lock_sched_held+0xaf/0xe0
> [ 651.964039] ? rcu_read_lock_bh_held+0xc0/0xc0
> [ 651.968487] ? rdma_destroy_id+0x17/0x20 [rdma_cm]
> [ 651.973287] rdma_destroy_id+0x17/0x20 [rdma_cm]
> [ 651.977915] nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma]
> [ 651.983401] nvmet_rdma_release_queue_work+0x3b/0x90 [nvmet_rdma]
> [ 651.989493] process_one_work+0x978/0x1740
> [ 651.993595] ? pwq_dec_nr_in_flight+0x270/0x270
> [ 651.998126] worker_thread+0x87/0xbf0
> [ 652.001790] ? __kthread_parkme+0xc3/0x1d0
> [ 652.005890] ? process_one_work+0x1740/0x1740
> [ 652.010248] kthread+0x3aa/0x490
> [ 652.013480] ? set_kthread_struct+0x100/0x100
> [ 652.017841] ret_from_fork+0x1f/0x30
> [ 652.021424] </TASK>
> [ 652.050739] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma:
> disconnected (10): status 0 id 000000007ccc375a
> [ 652.060318] nvmet_rdma:__nvmet_rdma_queue_disconnect: nvmet_rdma:
> cm_id= 000000007ccc375a queue->state= 1
> [ 652.069937] nvmet_rdma:nvmet_rdma_free_queue: nvmet_rdma: freeing queue 0
> [ 652.082650] nvmet:nvmet_stop_keep_alive_timer: nvmet: ctrl 1 stop keep-alive
> [ 652.488476] eno2 speed is unknown, defaulting to 1000
> [ 652.493657] eno3 speed is unknown, defaulting to 1000
> [ 652.498847] eno4 speed is unknown, defaulting to 1000
> [ 652.558751] rdma_rxe: rxe-ah pool destroyed with unfree'd elem
> [ 652.678415] rdma_rxe: unloaded
>
More information about the Linux-nvme
mailing list