[PATCH 2/2] nvmet: fix a race condition between release_queue and io_work

Maurizio Lombardi mlombard at redhat.com
Thu Oct 28 00:55:31 PDT 2021


Hi Sagi,

On Tue, Oct 26, 2021 at 06:42:02PM +0300, Sagi Grimberg wrote:
> On 10/21/21 11:41 AM, Maurizio Lombardi wrote:
> > If the initiator executes a reset controller operation while
> > performing I/O, the target kernel will crash because of a race condition
> > between release_queue and io_work;
> 
> Can you add the stack trace?

Yes, this is an example:

[  526.368376] nvmet: creating controller 1 for subsystem test-nqn for NQN nqn.2014-08.org.nvmexpress:uuid:119cb961-04c6-5845-9ed0-66a692ae0ddd.
[  526.598367] nvmet: creating controller 2 for subsystem test-nqn for NQN nqn.2014-08.org.nvmexpress:uuid:119cb961-04c6-5845-9ed0-66a692ae0ddd.
[  526.827356] nvmet: creating controller 1 for subsystem test-nqn for NQN nqn.2014-08.org.nvmexpress:uuid:119cb961-04c6-5845-9ed0-66a692ae0ddd.
[  526.837836] general protection fault, probably for non-canonical address 0x893b50fb71133fa5: 0000 [#1] SMP PTI
[  526.840045] CPU: 1 PID: 225 Comm: kworker/1:1H Kdump: loaded Not tainted 5.14.0_nvme1+ #1
[  526.841814] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  526.843064] Workqueue: nvmet_tcp_wq nvmet_tcp_io_work [nvmet_tcp]
[  526.844398] RIP: 0010:__kmalloc+0xed/0x3f0
[  526.845300] Code: 08 65 4c 03 05 94 ad 2e 45 49 83 78 10 00 4d 8b 20 0f 84 95 02 00 00 4d 85 e4 0f 84 8c 02 00 00 8b 45 28 48 8b 7d 00 4c 01 e0 <48> 8b 18 48 89 c1 48 33 9d b8 00 00 00 4c 89 e0 48 0f c9 48 31 cb
[  526.849296] RSP: 0018:ffffaf66c014fc38 EFLAGS: 00010286
[  526.850427] RAX: 893b50fb71133fa5 RBX: ffff8e1c83f02450 RCX: 0000000000000000
[  526.851959] RDX: 000000000009de38 RSI: 0000000000000cc0 RDI: 00000000000311c0
[  526.853496] RBP: ffff8e1d80042e00 R08: ffff8e1eb7d311c0 R09: 0000000000000000
[  526.855027] R10: 00000000000035ee R11: 0000000000000000 R12: 893b50fb71132fa5
[  526.856561] R13: 0000000000000cc0 R14: 0000000000000000 R15: ffffffffc072a240
[  526.858092] FS:  0000000000000000(0000) GS:ffff8e1eb7d00000(0000) knlGS:0000000000000000
[  526.859804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  526.860913] CR2: 000055aa6316b0e8 CR3: 0000000052010001 CR4: 00000000007706e0
[  526.862292] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  526.863658] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  526.865023] PKRU: 55555554
[  526.865564] Call Trace:
[  526.866059]  nvmet_tcp_map_data+0xc0/0x110 [nvmet_tcp]
[  526.867052]  nvmet_tcp_done_recv_pdu+0x13b/0x2b0 [nvmet_tcp]
[  526.868144]  nvmet_tcp_try_recv_pdu+0x17d/0x210 [nvmet_tcp]
[  526.869222]  nvmet_tcp_try_recv.constprop.0+0x67/0x300 [nvmet_tcp]
[  526.870414]  ? inet_sendpage+0x4f/0x80
[  526.871156]  nvmet_tcp_io_work+0x44/0xa69 [nvmet_tcp]
[  526.872134]  ? dequeue_task_fair+0xb1/0x370
[  526.872952]  ? finish_task_switch.isra.0+0xb1/0x290
[  526.873904]  process_one_work+0x1e6/0x380
[  526.874694]  worker_thread+0x53/0x3d0
[  526.875420]  ? process_one_work+0x380/0x380
[  526.876233]  kthread+0x10f/0x130
[  526.876872]  ? set_kthread_struct+0x40/0x40
[  526.877686]  ret_from_fork+0x22/0x30

if you want to reproduce it just run something like $watch -n 0.1 "echo 1 > reset_controller"
while you do some I/O with dd


> 
> > nvmet_tcp_uninit_data_in_cmds() may be executed while io_work
> > is running, calling flush_work(io_work) was not sufficient to
> > prevent this because io_work could requeue itself.
> 
> OK, then this should be sufficient to fix it right?

with "this" you mean calling cancel_work_sync() before nvmet_tcp_uninit_data_in_cmds()
and after nvmet_sq_destroy(), without checking sk_user_data in nvmet_tcp_queue_response() ?

If yes, I tried that; it's not sufficient, it still crashes:

[  374.050932] nvmet_tcp: failed cmd 00000000ba978688 id 50 opcode 1, data_len: 1310720
[  374.052699] nvmet_tcp: failed cmd 000000004ba3f36e id 60 opcode 1, data_len: 1310720
[  374.054465] nvmet_tcp: failed cmd 0000000071ea477d id 61 opcode 1, data_len: 1310720
[  374.056253] BUG: kernel NULL pointer dereference, address: 000000000000000c
[  374.057836] #PF: supervisor read access in kernel mode
[  374.059009] #PF: error_code(0x0000) - not-present page
[  374.060178] PGD 0 P4D 0
[  374.060777] Oops: 0000 [#1] SMP PTI
[  374.061578] CPU: 1 PID: 226 Comm: kworker/1:1H Kdump: loaded Not tainted 5.14.0_nvme1+ #1
[  374.063427] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  374.064749] Workqueue: nvmet_tcp_wq nvmet_tcp_io_work [nvmet_tcp]
[  374.066137] RIP: 0010:nvmet_tcp_map_pdu_iovec+0x66/0x110 [nvmet_tcp]
[  374.067573] Code: 00 00 48 c1 e8 0c 41 c1 ef 0c 44 89 bf e8 01 00 00 48 63 c8 49 c1 e7 05 4c 03 7f 30 89 87 ec 01 00 00 85 ed 0f 84 9d 00 00 00 <45> 8b 67 0c 49 8b 1f 45 29 ec 41 39 ec 44 0f 47 e5 48 83 e3 fc 49
[  374.071780] RSP: 0018:ffffa98c8016fc48 EFLAGS: 00010206
[  374.072965] RAX: 0000000000000140 RBX: ffff91c1da819208 RCX: 0000000000000140
[  374.074579] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff91c1da819208
[  374.076188] RBP: 0000000000140000 R08: 0000000000010230 R09: 0000000000017e3a
[  374.077788] R10: 0000000000000125 R11: 0000000000000010 R12: 0000000000000000
[  374.079396] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  374.081004] FS:  0000000000000000(0000) GS:ffff91c2f7d00000(0000) knlGS:0000000000000000
[  374.082835] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  374.084117] CR2: 000000000000000c CR3: 000000010c79a003 CR4: 00000000007706e0
[  374.085669] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  374.087147] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  374.088586] PKRU: 55555554
[  374.089214] Call Trace:
[  374.089807]  nvmet_tcp_done_recv_pdu+0x24c/0x2b0 [nvmet_tcp]
[  374.090971]  nvmet_tcp_try_recv_pdu+0x17d/0x210 [nvmet_tcp]
[  374.092101]  ? update_load_avg+0x7a/0x5a0
[  374.092969]  nvmet_tcp_try_recv.constprop.0+0x67/0x310 [nvmet_tcp]
[  374.094323]  ? set_next_entity+0xac/0x1e0
[  374.095149]  ? pick_next_task_fair+0x18e/0x3b0
[  374.096057]  nvmet_tcp_io_work+0x44/0xabd [nvmet_tcp]
[  374.097141]  ? dequeue_task_fair+0xb1/0x370
[  374.098087]  ? finish_task_switch.isra.0+0xb1/0x290
[  374.099088]  process_one_work+0x1e6/0x380
[  374.099922]  worker_thread+0x53/0x3d0
[  374.100670]  ? process_one_work+0x380/0x380
[  374.101631]  kthread+0x10f/0x130
[  374.102336]  ? set_kthread_struct+0x40/0x40
[  374.103194]  ret_from_fork+0x22/0x30
[  374.103974] Modules linked in: nvmet_tcp nvmet t10_pi rfkill intel_rapl_msr intel_rapl_common isst_if_common nfit libnvdimm i2c_piix4 pcspkr joydev virtio_balloon drm fuse xfs libcrc32c ata_generic crct10dif_pclmul crc32_pclmul crc32c_intel virtio_net ghash_clmulni_intel net_failover ata_piix virtio_blk libata serio_raw failover dm_mirror dm_region_hash dm_log dm_mod
[  374.110690] CR2: 000000000000000c

> 
> > * Fix this bug by preventing io_work from being enqueued when
> > sk_user_data is NULL (it means that the queue is going to be deleted)
> 
> This is triggered from the completion path, where the commands
> are not in a state where they are still fetching data from the
> host. How does this prevent the crash?

io_work is also triggered every time a nvmet_req_init() fails and when
nvmet_sq_destroy() is called, I am not really sure about the state
of the commands in those cases.


> 
> > * Ensure that all the memory allocated for the commands' iovec is freed
> 
> How is this needed to prevent a crash?

It's not really needed to prevent the crash, but without this
you will see massive memory leaks.
It happened sometimes that it exhausted the entire system memory
before hitting the crash.


Maurizio




More information about the Linux-nvme mailing list