[bug report] WARNING: possible circular locking at: rdma_destroy_id+0x17/0x20 [rdma_cm] triggered by blktests nvmeof-mp/002

Wed May 25 04:01:03 PDT 2022

> Hello
> Below WARNING triggered by blktests nvmeof-mp/002, it can be
> reproduced with the latest 5.16.0-rc3 and also exists on 5.12, pls
> check it.

iirc this was reported before, based on my analysis lockdep is giving
a false alarm here. The reason is that the id_priv->handler_mutex cannot
be the same for both cm_id that is handling the connect and the cm_id
that is handling the rdma_destroy_id because rdma_destroy_id call
is always called on a already disconnected cm_id, so this deadlock
lockdep is complaining about cannot happen.

I'm not sure how to settle this.

> 
> # use_siw=1 ./check nvmeof-mp/002
> nvmeof-mp/002 (File I/O on top of multipath concurrently with logout
> and login (mq)) [passed]
>      runtime  41.767s  ...  36.877s
> # ./check nvmeof-mp/002
> nvmeof-mp/002 (File I/O on top of multipath concurrently with logout
> and login (mq)) [passed]
>      runtime    ...  41.277s
> 
> [  609.166100] run blktests nvmeof-mp/002 at 2021-12-03 21:27:52
> [  609.977883] null_blk: module loaded
> [  610.538697] rdma_rxe: loaded
> [  610.602763] infiniband eno1_rxe: set active
> [  610.602782] infiniband eno1_rxe: added eno1
> [  610.639520] eno2 speed is unknown, defaulting to 1000
> [  610.644946] eno2 speed is unknown, defaulting to 1000
> [  610.650318] eno2 speed is unknown, defaulting to 1000
> [  610.690546] infiniband eno2_rxe: set down
> [  610.690553] infiniband eno2_rxe: added eno2
> [  610.691112] eno2 speed is unknown, defaulting to 1000
> [  610.714293] eno2 speed is unknown, defaulting to 1000
> [  610.729535] eno3 speed is unknown, defaulting to 1000
> [  610.734666] eno3 speed is unknown, defaulting to 1000
> [  610.740026] eno3 speed is unknown, defaulting to 1000
> [  610.781372] infiniband eno3_rxe: set down
> [  610.781379] infiniband eno3_rxe: added eno3
> [  610.781906] eno3 speed is unknown, defaulting to 1000
> [  610.806211] eno2 speed is unknown, defaulting to 1000
> [  610.811508] eno3 speed is unknown, defaulting to 1000
> [  610.826655] eno4 speed is unknown, defaulting to 1000
> [  610.831810] eno4 speed is unknown, defaulting to 1000
> [  610.837176] eno4 speed is unknown, defaulting to 1000
> [  610.885844] infiniband eno4_rxe: set down
> [  610.885850] infiniband eno4_rxe: added eno4
> [  610.886426] eno4 speed is unknown, defaulting to 1000
> [  611.266756] Loading iSCSI transport class v2.0-870.
> [  611.335358] iscsi: registered transport (iser)
> [  611.387969] nvmet: adding nsid 1 to subsystem nvme-test
> [  611.403878] eno2 speed is unknown, defaulting to 1000
> [  611.409276] eno3 speed is unknown, defaulting to 1000
> [  611.414522] eno4 speed is unknown, defaulting to 1000
> [  611.468712] Rounding down aligned max_sectors from 4294967295 to 4294967288
> [  611.469203] db_root: cannot open: /etc/target
> [  611.515816] nvmet_rdma: enabling port 1 (10.16.221.116:7777)
> [  611.622158] eno2 speed is unknown, defaulting to 1000
> [  611.628243] eno3 speed is unknown, defaulting to 1000
> [  611.633422] eno4 speed is unknown, defaulting to 1000
> [  611.703648] eno2 speed is unknown, defaulting to 1000
> [  611.708949] eno3 speed is unknown, defaulting to 1000
> [  611.714267] eno4 speed is unknown, defaulting to 1000
> [  611.754520] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0: address
> resolved (0): status 0 id 00000000d0d64a6d
> [  611.765454] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0: route
> resolved  (2): status 0 id 00000000d0d64a6d
> [  611.767778] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma: connect
> request (4): status 0 id 000000007ccc375a
> [  611.767935] nvmet_rdma:nvmet_rdma_find_get_device: nvmet_rdma:
> added eno1_rxe.
> [  611.777083] nvmet_rdma:nvmet_rdma_create_queue_ib: nvmet_rdma:
> nvmet_rdma_create_queue_ib: max_cqe= 8191 max_sge= 32 sq_size = 289
> cm_id= 000000007ccc375a
> --snip--
> [  612.132317] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0: established
> (9): status 0 id 00000000cede4b76
> [  612.132452] nvme nvme0: mapped 32/0/0 default/read/poll queues.
> [  612.132537] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma:
> established (9): status 0 id 0000000070c2f22c
> [  612.157849] nvmet:nvmet_execute_io_connect: nvmet: adding queue 1 to ctrl 1.
> [  612.158194] nvmet:nvmet_execute_io_connect: nvmet: adding queue 2 to ctrl 1.
> [  612.158548] nvmet:nvmet_execute_io_connect: nvmet: adding queue 3 to ctrl 1.
> [  612.158854] nvmet:nvmet_execute_io_connect: nvmet: adding queue 4 to ctrl 1.
> [  612.159173] nvmet:nvmet_execute_io_connect: nvmet: adding queue 5 to ctrl 1.
> [  612.159510] nvmet:nvmet_execute_io_connect: nvmet: adding queue 6 to ctrl 1.
> [  612.159853] nvmet:nvmet_execute_io_connect: nvmet: adding queue 7 to ctrl 1.
> [  612.160225] nvmet:nvmet_execute_io_connect: nvmet: adding queue 8 to ctrl 1.
> [  612.160572] nvmet:nvmet_execute_io_connect: nvmet: adding queue 9 to ctrl 1.
> [  612.160885] nvmet:nvmet_execute_io_connect: nvmet: adding queue 10 to ctrl 1.
> [  612.161216] nvmet:nvmet_execute_io_connect: nvmet: adding queue 11 to ctrl 1.
> [  612.161759] nvmet:nvmet_execute_io_connect: nvmet: adding queue 12 to ctrl 1.
> [  612.162123] nvmet:nvmet_execute_io_connect: nvmet: adding queue 13 to ctrl 1.
> [  612.162473] nvmet:nvmet_execute_io_connect: nvmet: adding queue 14 to ctrl 1.
> [  612.162793] nvmet:nvmet_execute_io_connect: nvmet: adding queue 15 to ctrl 1.
> [  612.163135] nvmet:nvmet_execute_io_connect: nvmet: adding queue 16 to ctrl 1.
> [  612.163458] nvmet:nvmet_execute_io_connect: nvmet: adding queue 17 to ctrl 1.
> [  612.163776] nvmet:nvmet_execute_io_connect: nvmet: adding queue 18 to ctrl 1.
> [  612.164103] nvmet:nvmet_execute_io_connect: nvmet: adding queue 19 to ctrl 1.
> [  612.164417] nvmet:nvmet_execute_io_connect: nvmet: adding queue 20 to ctrl 1.
> [  612.164700] nvmet:nvmet_execute_io_connect: nvmet: adding queue 21 to ctrl 1.
> [  612.164971] nvmet:nvmet_execute_io_connect: nvmet: adding queue 22 to ctrl 1.
> [  612.165275] nvmet:nvmet_execute_io_connect: nvmet: adding queue 23 to ctrl 1.
> [  612.165689] nvmet:nvmet_execute_io_connect: nvmet: adding queue 24 to ctrl 1.
> [  612.166082] nvmet:nvmet_execute_io_connect: nvmet: adding queue 25 to ctrl 1.
> [  612.166383] nvmet:nvmet_execute_io_connect: nvmet: adding queue 26 to ctrl 1.
> [  612.166663] nvmet:nvmet_execute_io_connect: nvmet: adding queue 27 to ctrl 1.
> [  612.166963] nvmet:nvmet_execute_io_connect: nvmet: adding queue 28 to ctrl 1.
> [  612.167273] nvmet:nvmet_execute_io_connect: nvmet: adding queue 29 to ctrl 1.
> [  612.167529] nvmet:nvmet_execute_io_connect: nvmet: adding queue 30 to ctrl 1.
> [  612.167804] nvmet:nvmet_execute_io_connect: nvmet: adding queue 31 to ctrl 1.
> [  612.168114] nvmet:nvmet_execute_io_connect: nvmet: adding queue 32 to ctrl 1.
> [  612.168523] nvme nvme0: new ctrl: NQN "nvme-test", addr 10.16.221.116:7777
> [  612.293734] device-mapper: multipath service-time: version 0.3.0 loaded
> [  612.296099] device-mapper: table: 253:3: multipath: error getting
> device (-EBUSY)
> [  612.303676] device-mapper: ioctl: error adding target to table
> [  612.501014] device-mapper: table: 253:3: multipath: error getting
> device (-EBUSY)
> [  612.508751] device-mapper: ioctl: error adding target to table
> [  612.514717] device-mapper: table: 253:3: multipath: error getting
> device (-EBUSY)
> [  612.522632] device-mapper: ioctl: error adding target to table
> [  612.950887] device-mapper: table: 253:3: multipath: error getting
> device (-EBUSY)
> [  612.958616] device-mapper: ioctl: error adding target to table
> [  613.869348] EXT4-fs (dm-3): mounted filesystem without journal.
> Opts: (null). Quota mode: none.
> [  613.869619] ext4 filesystem being mounted at
> /root/kernel-tests/storage/blktests/nvme/nvmeof-mp/blktests/results/tmpdir.nvmeof-mp.002.ZYO/mnt0
> supports timestamps until 2038 (0x7fffffff)
> [  617.144570] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
> traffic based keep-alive timer
> [  617.272414] nvmet:nvmet_execute_keep_alive: nvmet: ctrl 1 update
> keep-alive timer for 5 secs
> [  622.775724] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
> traffic based keep-alive timer
> [  627.895810] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
> traffic based keep-alive timer
> [  633.015369] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
> traffic based keep-alive timer
> [  638.138209] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
> traffic based keep-alive timer
> [  643.254966] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
> traffic based keep-alive timer
> [  644.822178] eno2 speed is unknown, defaulting to 1000
> [  644.827494] eno3 speed is unknown, defaulting to 1000
> [  644.832797] eno4 speed is unknown, defaulting to 1000
> [  648.377790] nvmet:nvmet_keep_alive_timer: nvmet: ctrl 1 reschedule
> traffic based keep-alive timer
> [  651.364570] nvme nvme0: Removing ctrl: NQN "nvme-test"
> [  651.430891] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma:
> disconnected (10): status 0 id 000000009ae8bc67
> [  651.430914] nvmet_rdma:__nvmet_rdma_queue_disconnect: nvmet_rdma:
> cm_id= 000000009ae8bc67 queue->state= 1
> [  651.430983] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma:
> disconnected (10): status 0 id 0000000078827089
> [  651.430994] nvmet_rdma:__nvmet_rdma_queue_disconnect: nvmet_rdma:
> cm_id= 0000000078827089 queue->state= 1
> --snip--
> [  651.438145] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0:
> disconnected (10): status 0 id 00000000cede4b76
> [  651.438150] nvme_rdma:nvme_rdma_cm_handler: nvme nvme0: disconnect
> received - connection closed
> 
> [  651.448883] ======================================================
> [  651.455061] WARNING: possible circular locking dependency detected
> [  651.461242] 5.16.0-rc3+ #2 Tainted: G S        I
> [  651.466380] ------------------------------------------------------
> [  651.472559] kworker/14:4/4839 is trying to acquire lock:
> [  651.477873] ffff8884c0e353e0 (&id_priv->handler_mutex){+.+.}-{3:3},
> at: rdma_destroy_id+0x17/0x20 [rdma_cm]
> [  651.487623]
>                 but task is already holding lock:
> [  651.493453] ffff8883cc21fe00
> ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at:
> process_one_work+0x888/0x1740
> [  651.504062]
>                 which lock already depends on the new lock.
> 
> [  651.512233]
>                 the existing dependency chain (in reverse order) is:
> [  651.519715]
>                 -> #3 ((work_completion)(&queue->release_work)){+.+.}-{0:0}:
> [  651.527894]        lock_acquire+0x1d2/0x5a0
> [  651.532079]        process_one_work+0x8da/0x1740
> [  651.536700]        worker_thread+0x87/0xbf0
> [  651.540886]        kthread+0x3aa/0x490
> [  651.544637]        ret_from_fork+0x1f/0x30
> [  651.548737]
>                 -> #2 ((wq_completion)events){+.+.}-{0:0}:
> [  651.555349]        lock_acquire+0x1d2/0x5a0
> [  651.559535]        flush_workqueue+0x109/0x15f0
> [  651.564067]        nvmet_rdma_queue_connect+0x1c79/0x2610 [nvmet_rdma]
> [  651.570594]        cma_cm_event_handler+0xef/0x530 [rdma_cm]
> [  651.576261]        cma_ib_req_handler+0x1a76/0x4a50 [rdma_cm]
> [  651.582016]        cm_process_work+0x44/0x190 [ib_cm]
> [  651.587077]        cm_req_handler+0x2218/0x7040 [ib_cm]
> [  651.592312]        cm_work_handler+0x1acf/0x4a30 [ib_cm]
> [  651.597633]        process_one_work+0x978/0x1740
> [  651.602253]        worker_thread+0x87/0xbf0
> [  651.606437]        kthread+0x3aa/0x490
> [  651.610190]        ret_from_fork+0x1f/0x30
> [  651.614290]
>                 -> #1 (&id_priv->handler_mutex/1){+.+.}-{3:3}:
> [  651.621249]        lock_acquire+0x1d2/0x5a0
> [  651.625435]        __mutex_lock+0x155/0x15f0
> [  651.629708]        cma_ib_req_handler+0x1709/0x4a50 [rdma_cm]
> [  651.635462]        cm_process_work+0x44/0x190 [ib_cm]
> [  651.640521]        cm_req_handler+0x2218/0x7040 [ib_cm]
> [  651.645749]        cm_work_handler+0x1acf/0x4a30 [ib_cm]
> [  651.651071]        process_one_work+0x978/0x1740
> [  651.655689]        worker_thread+0x87/0xbf0
> [  651.659874]        kthread+0x3aa/0x490
> [  651.663628]        ret_from_fork+0x1f/0x30
> [  651.667726]
>                 -> #0 (&id_priv->handler_mutex){+.+.}-{3:3}:
> [  651.674511]        check_prevs_add+0x3fd/0x2480
> [  651.679045]        __lock_acquire+0x23f7/0x2f80
> [  651.683577]        lock_acquire+0x1d2/0x5a0
> [  651.687762]        __mutex_lock+0x155/0x15f0
> [  651.692036]        rdma_destroy_id+0x17/0x20 [rdma_cm]
> [  651.697184]        nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma]
> [  651.703189]        nvmet_rdma_release_queue_work+0x3b/0x90 [nvmet_rdma]
> [  651.709802]        process_one_work+0x978/0x1740
> [  651.714422]        worker_thread+0x87/0xbf0
> [  651.718608]        kthread+0x3aa/0x490
> [  651.722358]        ret_from_fork+0x1f/0x30
> [  651.726459]
>                 other info that might help us debug this:
> 
> [  651.734457] Chain exists of:
>                   &id_priv->handler_mutex --> (wq_completion)events -->
> (work_completion)(&queue->release_work)
> 
> [  651.748531]  Possible unsafe locking scenario:
> 
> [  651.754451]        CPU0                    CPU1
> [  651.758984]        ----                    ----
> [  651.763515]   lock((work_completion)(&queue->release_work));
> [  651.769174]                                lock((wq_completion)events);
> [  651.775786]
> lock((work_completion)(&queue->release_work));
> [  651.783959]   lock(&id_priv->handler_mutex);
> [  651.788233]
>                  *** DEADLOCK ***
> 
> [  651.794153] 2 locks held by kworker/14:4/4839:
> [  651.798596]  #0: ffff88810005b948
> ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x854/0x1740
> [  651.808078]  #1: ffff8883cc21fe00
> ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at:
> process_one_work+0x888/0x1740
> [  651.819121]
>                 stack backtrace:
> [  651.823478] CPU: 14 PID: 4839 Comm: kworker/14:4 Tainted: G S
>   I       5.16.0-rc3+ #2
> [  651.831824] Hardware name: Dell Inc. PowerEdge R640/06NR82, BIOS
> 2.11.2 004/21/2021
> [  651.839476] Workqueue: events nvmet_rdma_release_queue_work [nvmet_rdma]
> [  651.846175] Call Trace:
> [  651.848628]  <TASK>
> [  651.850735]  dump_stack_lvl+0x44/0x57
> [  651.854401]  check_noncircular+0x280/0x320
> [  651.858501]  ? print_circular_bug.isra.47+0x430/0x430
> [  651.863553]  ? lock_chain_count+0x20/0x20
> [  651.867566]  ? wait_for_completion+0x1e0/0x270
> [  651.872010]  ? mark_lock.part.52+0xf7/0x1050
> [  651.876283]  ? sched_clock_cpu+0x15/0x200
> [  651.880295]  ? find_held_lock+0x3a/0x1c0
> [  651.884223]  check_prevs_add+0x3fd/0x2480
> [  651.888238]  ? mark_held_locks+0xb7/0x120
> [  651.892248]  ? check_irq_usage+0xb50/0xb50
> [  651.896346]  ? lockdep_lock+0xcb/0x1c0
> [  651.900100]  ? static_obj+0xc0/0xc0
> [  651.903591]  ? sched_clock_cpu+0x15/0x200
> [  651.907606]  __lock_acquire+0x23f7/0x2f80
> [  651.911620]  ? rcu_read_lock_bh_held+0xc0/0xc0
> [  651.916063]  lock_acquire+0x1d2/0x5a0
> [  651.919729]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
> [  651.924529]  ? rcu_read_unlock+0x50/0x50
> [  651.928456]  ? lock_is_held_type+0xd9/0x130
> [  651.932644]  __mutex_lock+0x155/0x15f0
> [  651.936393]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
> [  651.941194]  ? kasan_quarantine_put+0xa4/0x180
> [  651.945641]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
> [  651.950442]  ? mutex_lock_io_nested+0x1460/0x1460
> [  651.955146]  ? lock_is_held_type+0xd9/0x130
> [  651.959334]  ? rcu_read_lock_sched_held+0xaf/0xe0
> [  651.964039]  ? rcu_read_lock_bh_held+0xc0/0xc0
> [  651.968487]  ? rdma_destroy_id+0x17/0x20 [rdma_cm]
> [  651.973287]  rdma_destroy_id+0x17/0x20 [rdma_cm]
> [  651.977915]  nvmet_rdma_free_queue+0x7a/0x380 [nvmet_rdma]
> [  651.983401]  nvmet_rdma_release_queue_work+0x3b/0x90 [nvmet_rdma]
> [  651.989493]  process_one_work+0x978/0x1740
> [  651.993595]  ? pwq_dec_nr_in_flight+0x270/0x270
> [  651.998126]  worker_thread+0x87/0xbf0
> [  652.001790]  ? __kthread_parkme+0xc3/0x1d0
> [  652.005890]  ? process_one_work+0x1740/0x1740
> [  652.010248]  kthread+0x3aa/0x490
> [  652.013480]  ? set_kthread_struct+0x100/0x100
> [  652.017841]  ret_from_fork+0x1f/0x30
> [  652.021424]  </TASK>
> [  652.050739] nvmet_rdma:nvmet_rdma_cm_handler: nvmet_rdma:
> disconnected (10): status 0 id 000000007ccc375a
> [  652.060318] nvmet_rdma:__nvmet_rdma_queue_disconnect: nvmet_rdma:
> cm_id= 000000007ccc375a queue->state= 1
> [  652.069937] nvmet_rdma:nvmet_rdma_free_queue: nvmet_rdma: freeing queue 0
> [  652.082650] nvmet:nvmet_stop_keep_alive_timer: nvmet: ctrl 1 stop keep-alive
> [  652.488476] eno2 speed is unknown, defaulting to 1000
> [  652.493657] eno3 speed is unknown, defaulting to 1000
> [  652.498847] eno4 speed is unknown, defaulting to 1000
> [  652.558751] rdma_rxe: rxe-ah pool destroyed with unfree'd elem
> [  652.678415] rdma_rxe: unloaded
>