[PATCH 2/2] nvmet: fix a race condition between release_queue and io_work

Wed Nov 3 02:28:35 PDT 2021

On 10/28/21 10:55 AM, Maurizio Lombardi wrote:
> Hi Sagi,
> 
> On Tue, Oct 26, 2021 at 06:42:02PM +0300, Sagi Grimberg wrote:
>> On 10/21/21 11:41 AM, Maurizio Lombardi wrote:
>>> If the initiator executes a reset controller operation while
>>> performing I/O, the target kernel will crash because of a race condition
>>> between release_queue and io_work;
>>
>> Can you add the stack trace?
> 
> Yes, this is an example:
> 
> [  526.368376] nvmet: creating controller 1 for subsystem test-nqn for NQN nqn.2014-08.org.nvmexpress:uuid:119cb961-04c6-5845-9ed0-66a692ae0ddd.
> [  526.598367] nvmet: creating controller 2 for subsystem test-nqn for NQN nqn.2014-08.org.nvmexpress:uuid:119cb961-04c6-5845-9ed0-66a692ae0ddd.
> [  526.827356] nvmet: creating controller 1 for subsystem test-nqn for NQN nqn.2014-08.org.nvmexpress:uuid:119cb961-04c6-5845-9ed0-66a692ae0ddd.
> [  526.837836] general protection fault, probably for non-canonical address 0x893b50fb71133fa5: 0000 [#1] SMP PTI
> [  526.840045] CPU: 1 PID: 225 Comm: kworker/1:1H Kdump: loaded Not tainted 5.14.0_nvme1+ #1
> [  526.841814] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> [  526.843064] Workqueue: nvmet_tcp_wq nvmet_tcp_io_work [nvmet_tcp]
> [  526.844398] RIP: 0010:__kmalloc+0xed/0x3f0
> [  526.845300] Code: 08 65 4c 03 05 94 ad 2e 45 49 83 78 10 00 4d 8b 20 0f 84 95 02 00 00 4d 85 e4 0f 84 8c 02 00 00 8b 45 28 48 8b 7d 00 4c 01 e0 <48> 8b 18 48 89 c1 48 33 9d b8 00 00 00 4c 89 e0 48 0f c9 48 31 cb
> [  526.849296] RSP: 0018:ffffaf66c014fc38 EFLAGS: 00010286
> [  526.850427] RAX: 893b50fb71133fa5 RBX: ffff8e1c83f02450 RCX: 0000000000000000
> [  526.851959] RDX: 000000000009de38 RSI: 0000000000000cc0 RDI: 00000000000311c0
> [  526.853496] RBP: ffff8e1d80042e00 R08: ffff8e1eb7d311c0 R09: 0000000000000000
> [  526.855027] R10: 00000000000035ee R11: 0000000000000000 R12: 893b50fb71132fa5
> [  526.856561] R13: 0000000000000cc0 R14: 0000000000000000 R15: ffffffffc072a240
> [  526.858092] FS:  0000000000000000(0000) GS:ffff8e1eb7d00000(0000) knlGS:0000000000000000
> [  526.859804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  526.860913] CR2: 000055aa6316b0e8 CR3: 0000000052010001 CR4: 00000000007706e0
> [  526.862292] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  526.863658] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  526.865023] PKRU: 55555554
> [  526.865564] Call Trace:
> [  526.866059]  nvmet_tcp_map_data+0xc0/0x110 [nvmet_tcp]
> [  526.867052]  nvmet_tcp_done_recv_pdu+0x13b/0x2b0 [nvmet_tcp]
> [  526.868144]  nvmet_tcp_try_recv_pdu+0x17d/0x210 [nvmet_tcp]
> [  526.869222]  nvmet_tcp_try_recv.constprop.0+0x67/0x300 [nvmet_tcp]
> [  526.870414]  ? inet_sendpage+0x4f/0x80
> [  526.871156]  nvmet_tcp_io_work+0x44/0xa69 [nvmet_tcp]
> [  526.872134]  ? dequeue_task_fair+0xb1/0x370
> [  526.872952]  ? finish_task_switch.isra.0+0xb1/0x290
> [  526.873904]  process_one_work+0x1e6/0x380
> [  526.874694]  worker_thread+0x53/0x3d0
> [  526.875420]  ? process_one_work+0x380/0x380
> [  526.876233]  kthread+0x10f/0x130
> [  526.876872]  ? set_kthread_struct+0x40/0x40
> [  526.877686]  ret_from_fork+0x22/0x30
> 
> if you want to reproduce it just run something like $watch -n 0.1 "echo 1 > reset_controller"
> while you do some I/O with dd
> 
> 
>>
>>> nvmet_tcp_uninit_data_in_cmds() may be executed while io_work
>>> is running, calling flush_work(io_work) was not sufficient to
>>> prevent this because io_work could requeue itself.
>>
>> OK, then this should be sufficient to fix it right?
> 
> with "this" you mean calling cancel_work_sync() before nvmet_tcp_uninit_data_in_cmds()
> and after nvmet_sq_destroy(), without checking sk_user_data in nvmet_tcp_queue_response() ?
> 
> If yes, I tried that; it's not sufficient, it still crashes:
> 
> [  374.050932] nvmet_tcp: failed cmd 00000000ba978688 id 50 opcode 1, data_len: 1310720
> [  374.052699] nvmet_tcp: failed cmd 000000004ba3f36e id 60 opcode 1, data_len: 1310720
> [  374.054465] nvmet_tcp: failed cmd 0000000071ea477d id 61 opcode 1, data_len: 1310720
> [  374.056253] BUG: kernel NULL pointer dereference, address: 000000000000000c
> [  374.057836] #PF: supervisor read access in kernel mode
> [  374.059009] #PF: error_code(0x0000) - not-present page
> [  374.060178] PGD 0 P4D 0
> [  374.060777] Oops: 0000 [#1] SMP PTI
> [  374.061578] CPU: 1 PID: 226 Comm: kworker/1:1H Kdump: loaded Not tainted 5.14.0_nvme1+ #1
> [  374.063427] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> [  374.064749] Workqueue: nvmet_tcp_wq nvmet_tcp_io_work [nvmet_tcp]
> [  374.066137] RIP: 0010:nvmet_tcp_map_pdu_iovec+0x66/0x110 [nvmet_tcp]
> [  374.067573] Code: 00 00 48 c1 e8 0c 41 c1 ef 0c 44 89 bf e8 01 00 00 48 63 c8 49 c1 e7 05 4c 03 7f 30 89 87 ec 01 00 00 85 ed 0f 84 9d 00 00 00 <45> 8b 67 0c 49 8b 1f 45 29 ec 41 39 ec 44 0f 47 e5 48 83 e3 fc 49
> [  374.071780] RSP: 0018:ffffa98c8016fc48 EFLAGS: 00010206
> [  374.072965] RAX: 0000000000000140 RBX: ffff91c1da819208 RCX: 0000000000000140
> [  374.074579] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff91c1da819208
> [  374.076188] RBP: 0000000000140000 R08: 0000000000010230 R09: 0000000000017e3a
> [  374.077788] R10: 0000000000000125 R11: 0000000000000010 R12: 0000000000000000
> [  374.079396] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [  374.081004] FS:  0000000000000000(0000) GS:ffff91c2f7d00000(0000) knlGS:0000000000000000
> [  374.082835] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  374.084117] CR2: 000000000000000c CR3: 000000010c79a003 CR4: 00000000007706e0
> [  374.085669] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  374.087147] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  374.088586] PKRU: 55555554
> [  374.089214] Call Trace:
> [  374.089807]  nvmet_tcp_done_recv_pdu+0x24c/0x2b0 [nvmet_tcp]
> [  374.090971]  nvmet_tcp_try_recv_pdu+0x17d/0x210 [nvmet_tcp]
> [  374.092101]  ? update_load_avg+0x7a/0x5a0
> [  374.092969]  nvmet_tcp_try_recv.constprop.0+0x67/0x310 [nvmet_tcp]
> [  374.094323]  ? set_next_entity+0xac/0x1e0
> [  374.095149]  ? pick_next_task_fair+0x18e/0x3b0
> [  374.096057]  nvmet_tcp_io_work+0x44/0xabd [nvmet_tcp]
> [  374.097141]  ? dequeue_task_fair+0xb1/0x370
> [  374.098087]  ? finish_task_switch.isra.0+0xb1/0x290
> [  374.099088]  process_one_work+0x1e6/0x380
> [  374.099922]  worker_thread+0x53/0x3d0
> [  374.100670]  ? process_one_work+0x380/0x380
> [  374.101631]  kthread+0x10f/0x130
> [  374.102336]  ? set_kthread_struct+0x40/0x40
> [  374.103194]  ret_from_fork+0x22/0x30
> [  374.103974] Modules linked in: nvmet_tcp nvmet t10_pi rfkill intel_rapl_msr intel_rapl_common isst_if_common nfit libnvdimm i2c_piix4 pcspkr joydev virtio_balloon drm fuse xfs libcrc32c ata_generic crct10dif_pclmul crc32_pclmul crc32c_intel virtio_net ghash_clmulni_intel net_failover ata_piix virtio_blk libata serio_raw failover dm_mirror dm_region_hash dm_log dm_mod
> [  374.110690] CR2: 000000000000000c
So this means we still get data from the network when
we shouldn't. Maybe we are simply missing a kernel_sock_shutdown
for SHUT_RD?

> 
>>
>>> * Fix this bug by preventing io_work from being enqueued when
>>> sk_user_data is NULL (it means that the queue is going to be deleted)
>>
>> This is triggered from the completion path, where the commands
>> are not in a state where they are still fetching data from the
>> host. How does this prevent the crash?
> 
> io_work is also triggered every time a nvmet_req_init() fails and when
> nvmet_sq_destroy() is called, I am not really sure about the state
> of the commands in those cases.

But that is from the workqueue context - which means that
cancel_work_sync should prevent it right?

> 
> 
>>
>>> * Ensure that all the memory allocated for the commands' iovec is freed
>>
>> How is this needed to prevent a crash?
> 
> It's not really needed to prevent the crash, but without this
> you will see massive memory leaks.
> It happened sometimes that it exhausted the entire system memory
> before hitting the crash.

But that needs to be a separate fix and not combined with other
fixes.