[PATCH 2/2] nvmet: fix a race condition between release_queue and io_work
Sagi Grimberg
sagi at grimberg.me
Wed Nov 3 02:28:35 PDT 2021
On 10/28/21 10:55 AM, Maurizio Lombardi wrote:
> Hi Sagi,
>
> On Tue, Oct 26, 2021 at 06:42:02PM +0300, Sagi Grimberg wrote:
>> On 10/21/21 11:41 AM, Maurizio Lombardi wrote:
>>> If the initiator executes a reset controller operation while
>>> performing I/O, the target kernel will crash because of a race condition
>>> between release_queue and io_work;
>>
>> Can you add the stack trace?
>
> Yes, this is an example:
>
> [ 526.368376] nvmet: creating controller 1 for subsystem test-nqn for NQN nqn.2014-08.org.nvmexpress:uuid:119cb961-04c6-5845-9ed0-66a692ae0ddd.
> [ 526.598367] nvmet: creating controller 2 for subsystem test-nqn for NQN nqn.2014-08.org.nvmexpress:uuid:119cb961-04c6-5845-9ed0-66a692ae0ddd.
> [ 526.827356] nvmet: creating controller 1 for subsystem test-nqn for NQN nqn.2014-08.org.nvmexpress:uuid:119cb961-04c6-5845-9ed0-66a692ae0ddd.
> [ 526.837836] general protection fault, probably for non-canonical address 0x893b50fb71133fa5: 0000 [#1] SMP PTI
> [ 526.840045] CPU: 1 PID: 225 Comm: kworker/1:1H Kdump: loaded Not tainted 5.14.0_nvme1+ #1
> [ 526.841814] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> [ 526.843064] Workqueue: nvmet_tcp_wq nvmet_tcp_io_work [nvmet_tcp]
> [ 526.844398] RIP: 0010:__kmalloc+0xed/0x3f0
> [ 526.845300] Code: 08 65 4c 03 05 94 ad 2e 45 49 83 78 10 00 4d 8b 20 0f 84 95 02 00 00 4d 85 e4 0f 84 8c 02 00 00 8b 45 28 48 8b 7d 00 4c 01 e0 <48> 8b 18 48 89 c1 48 33 9d b8 00 00 00 4c 89 e0 48 0f c9 48 31 cb
> [ 526.849296] RSP: 0018:ffffaf66c014fc38 EFLAGS: 00010286
> [ 526.850427] RAX: 893b50fb71133fa5 RBX: ffff8e1c83f02450 RCX: 0000000000000000
> [ 526.851959] RDX: 000000000009de38 RSI: 0000000000000cc0 RDI: 00000000000311c0
> [ 526.853496] RBP: ffff8e1d80042e00 R08: ffff8e1eb7d311c0 R09: 0000000000000000
> [ 526.855027] R10: 00000000000035ee R11: 0000000000000000 R12: 893b50fb71132fa5
> [ 526.856561] R13: 0000000000000cc0 R14: 0000000000000000 R15: ffffffffc072a240
> [ 526.858092] FS: 0000000000000000(0000) GS:ffff8e1eb7d00000(0000) knlGS:0000000000000000
> [ 526.859804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 526.860913] CR2: 000055aa6316b0e8 CR3: 0000000052010001 CR4: 00000000007706e0
> [ 526.862292] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 526.863658] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 526.865023] PKRU: 55555554
> [ 526.865564] Call Trace:
> [ 526.866059] nvmet_tcp_map_data+0xc0/0x110 [nvmet_tcp]
> [ 526.867052] nvmet_tcp_done_recv_pdu+0x13b/0x2b0 [nvmet_tcp]
> [ 526.868144] nvmet_tcp_try_recv_pdu+0x17d/0x210 [nvmet_tcp]
> [ 526.869222] nvmet_tcp_try_recv.constprop.0+0x67/0x300 [nvmet_tcp]
> [ 526.870414] ? inet_sendpage+0x4f/0x80
> [ 526.871156] nvmet_tcp_io_work+0x44/0xa69 [nvmet_tcp]
> [ 526.872134] ? dequeue_task_fair+0xb1/0x370
> [ 526.872952] ? finish_task_switch.isra.0+0xb1/0x290
> [ 526.873904] process_one_work+0x1e6/0x380
> [ 526.874694] worker_thread+0x53/0x3d0
> [ 526.875420] ? process_one_work+0x380/0x380
> [ 526.876233] kthread+0x10f/0x130
> [ 526.876872] ? set_kthread_struct+0x40/0x40
> [ 526.877686] ret_from_fork+0x22/0x30
>
> if you want to reproduce it just run something like $watch -n 0.1 "echo 1 > reset_controller"
> while you do some I/O with dd
>
>
>>
>>> nvmet_tcp_uninit_data_in_cmds() may be executed while io_work
>>> is running, calling flush_work(io_work) was not sufficient to
>>> prevent this because io_work could requeue itself.
>>
>> OK, then this should be sufficient to fix it right?
>
> with "this" you mean calling cancel_work_sync() before nvmet_tcp_uninit_data_in_cmds()
> and after nvmet_sq_destroy(), without checking sk_user_data in nvmet_tcp_queue_response() ?
>
> If yes, I tried that; it's not sufficient, it still crashes:
>
> [ 374.050932] nvmet_tcp: failed cmd 00000000ba978688 id 50 opcode 1, data_len: 1310720
> [ 374.052699] nvmet_tcp: failed cmd 000000004ba3f36e id 60 opcode 1, data_len: 1310720
> [ 374.054465] nvmet_tcp: failed cmd 0000000071ea477d id 61 opcode 1, data_len: 1310720
> [ 374.056253] BUG: kernel NULL pointer dereference, address: 000000000000000c
> [ 374.057836] #PF: supervisor read access in kernel mode
> [ 374.059009] #PF: error_code(0x0000) - not-present page
> [ 374.060178] PGD 0 P4D 0
> [ 374.060777] Oops: 0000 [#1] SMP PTI
> [ 374.061578] CPU: 1 PID: 226 Comm: kworker/1:1H Kdump: loaded Not tainted 5.14.0_nvme1+ #1
> [ 374.063427] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> [ 374.064749] Workqueue: nvmet_tcp_wq nvmet_tcp_io_work [nvmet_tcp]
> [ 374.066137] RIP: 0010:nvmet_tcp_map_pdu_iovec+0x66/0x110 [nvmet_tcp]
> [ 374.067573] Code: 00 00 48 c1 e8 0c 41 c1 ef 0c 44 89 bf e8 01 00 00 48 63 c8 49 c1 e7 05 4c 03 7f 30 89 87 ec 01 00 00 85 ed 0f 84 9d 00 00 00 <45> 8b 67 0c 49 8b 1f 45 29 ec 41 39 ec 44 0f 47 e5 48 83 e3 fc 49
> [ 374.071780] RSP: 0018:ffffa98c8016fc48 EFLAGS: 00010206
> [ 374.072965] RAX: 0000000000000140 RBX: ffff91c1da819208 RCX: 0000000000000140
> [ 374.074579] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff91c1da819208
> [ 374.076188] RBP: 0000000000140000 R08: 0000000000010230 R09: 0000000000017e3a
> [ 374.077788] R10: 0000000000000125 R11: 0000000000000010 R12: 0000000000000000
> [ 374.079396] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 374.081004] FS: 0000000000000000(0000) GS:ffff91c2f7d00000(0000) knlGS:0000000000000000
> [ 374.082835] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 374.084117] CR2: 000000000000000c CR3: 000000010c79a003 CR4: 00000000007706e0
> [ 374.085669] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 374.087147] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 374.088586] PKRU: 55555554
> [ 374.089214] Call Trace:
> [ 374.089807] nvmet_tcp_done_recv_pdu+0x24c/0x2b0 [nvmet_tcp]
> [ 374.090971] nvmet_tcp_try_recv_pdu+0x17d/0x210 [nvmet_tcp]
> [ 374.092101] ? update_load_avg+0x7a/0x5a0
> [ 374.092969] nvmet_tcp_try_recv.constprop.0+0x67/0x310 [nvmet_tcp]
> [ 374.094323] ? set_next_entity+0xac/0x1e0
> [ 374.095149] ? pick_next_task_fair+0x18e/0x3b0
> [ 374.096057] nvmet_tcp_io_work+0x44/0xabd [nvmet_tcp]
> [ 374.097141] ? dequeue_task_fair+0xb1/0x370
> [ 374.098087] ? finish_task_switch.isra.0+0xb1/0x290
> [ 374.099088] process_one_work+0x1e6/0x380
> [ 374.099922] worker_thread+0x53/0x3d0
> [ 374.100670] ? process_one_work+0x380/0x380
> [ 374.101631] kthread+0x10f/0x130
> [ 374.102336] ? set_kthread_struct+0x40/0x40
> [ 374.103194] ret_from_fork+0x22/0x30
> [ 374.103974] Modules linked in: nvmet_tcp nvmet t10_pi rfkill intel_rapl_msr intel_rapl_common isst_if_common nfit libnvdimm i2c_piix4 pcspkr joydev virtio_balloon drm fuse xfs libcrc32c ata_generic crct10dif_pclmul crc32_pclmul crc32c_intel virtio_net ghash_clmulni_intel net_failover ata_piix virtio_blk libata serio_raw failover dm_mirror dm_region_hash dm_log dm_mod
> [ 374.110690] CR2: 000000000000000c
So this means we still get data from the network when
we shouldn't. Maybe we are simply missing a kernel_sock_shutdown
for SHUT_RD?
>
>>
>>> * Fix this bug by preventing io_work from being enqueued when
>>> sk_user_data is NULL (it means that the queue is going to be deleted)
>>
>> This is triggered from the completion path, where the commands
>> are not in a state where they are still fetching data from the
>> host. How does this prevent the crash?
>
> io_work is also triggered every time a nvmet_req_init() fails and when
> nvmet_sq_destroy() is called, I am not really sure about the state
> of the commands in those cases.
But that is from the workqueue context - which means that
cancel_work_sync should prevent it right?
>
>
>>
>>> * Ensure that all the memory allocated for the commands' iovec is freed
>>
>> How is this needed to prevent a crash?
>
> It's not really needed to prevent the crash, but without this
> you will see massive memory leaks.
> It happened sometimes that it exhausted the entire system memory
> before hitting the crash.
But that needs to be a separate fix and not combined with other
fixes.
More information about the Linux-nvme
mailing list