kernel panic in nvmet_tcp when receiving malformed admin command

Wed Sep 30 07:28:08 EDT 2020

Thanks you Sagi.

-----Original Message-----
From: Sagi Grimberg <sagi at grimberg.me> 
Sent: Tuesday, September 29, 2020 11:48 PM
To: Anner, Ran; linux-nvme at lists.infradead.org
Cc: Zinger, Eldad; Grupi, Elad; Engel, Amit
Subject: Re: kernel panic in nvmet_tcp when receiving malformed admin command

[EXTERNAL EMAIL] 

> Hello all.

Hey Ran, thanks for reporting

> 
> We are investigating kernel panic when a malformed admin command with inline write data is sent from host to target using nvmeof tcp.
> Following a code review of nvmet_tcp from latest kernel and adding debug logs along the way I came up with the following flow:
> 
> nvmet_tcp_try_recv_pdu sends an IO with inline data to nvmet_req_init.
> nvmet_req_init send to parse which fails (ex. host sends malformed command).
> nvmet_req_init calls __nvmet_req_complete, and in turn calls queue_response.
> nvmet_tcp_queue_response add the command to queue response list.
> nvmet_req_init return error status and nvmet_tcp_handle_req_failure is called.
> nvmet_tcp_handle_req_failure checks if the command has inline data and will allocate pages and iovs.
> nvmet_tcp_handle_req_failure return and nvmet_tcp_done_recv_pdu returns -EAGAIN.
> nvmet_tcp_try_recv_pdu returns -EAGAIN and in that case nvmet_tcp_try_recv_one returns 0.
> In case nvmet_tcp_try_recv_one returns 0, nvmet_tcp_try_recv breaks from loop and return zero as well.
> In nvmet_tcp_io_work we continue to nvmet_tcp_try_send in case the return value is zero.
> nvmet_tcp_try_send_one will call nvmet_tcp_fetch_cmd which can assign snd_cmd as the failed command.
> nvmet_tcp_need_data_out and nvmet_tcp_need_data_in return false so we call nvmet_setup_response_pdu.
> nvmet_setup_response_pdu set the cmd state to NVMET_TCP_SEND_RESPONSE and return.
> Back to nvmet_tcp_try_send_one, cmd->state == NVMET_TCP_SEND_RESPONSE so nvmet_try_send_response is called.
> After kernel_sendpage we free the cmd iov and sgl.
> Eventually nvmet_tcp_try_recv_data will be called but the iov was already removed and general protection fault is triggered.
> 
> I was wondering if this is a knows issue or if someone else encountered this before.
> It seems to me a fix for this would be to not queue the response before receiving all data in this specific flow.
> Added below snippet of the kernel trace.
> Alternatively if we do not want to delay the response, free the iov after both flows are done, maybe using reference counting to do so.
> I will be glad to hear your thoughts and comments.

This is indeed a bug. We definitely need a refcount here. I recall that this used to work though..

What we probably need to do is to defer the response send until we complete fetching the entire incoming data and then safely free the command.

I'll try to spin a patch soon.

> 
> Thank you,
> Ran.
> 
> [13538.046364] general protection fault: 0000 [#1] SMP NOPTI 
> [13538.051762] CPU: 10 PID: 603 Comm: kworker/10:1H Kdump: loaded 
> Tainted: G        W  OE     4.19.134-coreos-r9999.1600774095-1610 #1 
> [13538.063574] Hardware name: EMC 900-532-002/110-551-910C-01, BIOS 
> 16.35 04/08/2020 [13538.071049] Workqueue: nvmet_tcp_wq 
> nvmet_tcp_io_work [nvmet_tcp] [13538.077142] RIP: 
> 0010:memcpy_erms+0x6/0x10 [13538.081237] Code: 90 90 90 90 eb 1e 0f 1f 
> 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 
> 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 
> f8 48 83 fa 20 72 7e> [13538.099981] RSP: 0018:ffffc9000781bc70 
> EFLAGS: 00010202 [13538.105199] RAX: 31d247ca1958f63f RBX: 
> 0000000000000510 RCX: 0000000000000510 [13538.112322] RDX: 
> 0000000000000510 RSI: ffff8886445f60ca RDI: 31d247ca1958f63f 
> [13538.119446] RBP: ffff888308c41920 R08: 0000000000000510 R09: 
> ffff88850b96e2bc [13538.126570] R10: ffff88850b96de48 R11: 
> ffff8886445f60ca R12: 0000000000000510 [13538.133693] R13: 
> 0000000000000510 R14: ffff8886445f65da R15: ffff8886902ddd90 
> [13538.140819] FS:  0000000000000000(0000) GS:ffff8897dfe80000(0000) 
> knlGS:0000000000000000 [13538.148906] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [13538.154650] CR2: 00007f547169beb0 CR3: 000000000220a001 CR4: 00000000007606e0 [13538.161772] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [13538.168896] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [13538.176022] PKRU: 55555554 [13538.178726] Call Trace:
> [13538.181171]  _copy_to_iter+0x30e/0x3b0 [13538.184925]  ? 
> wake_up_klogd.part.24+0x30/0x40 [13538.189372]  
> skb_copy_datagram_iter+0x7b/0x260 [13538.193815]  
> tcp_recvmsg+0x238/0xc80 [13538.197395]  ? 
> nvmet_tcp_io_work+0x400/0xa80 [nvmet_tcp] [13538.202707]  ? 
> release_sock+0x43/0x90 [13538.206373]  ? tcp_sendpage+0x41/0x50 
> [13538.210038]  inet_recvmsg+0x5b/0xd0 [13538.213522]  
> nvmet_tcp_io_work+0xea/0xa80 [nvmet_tcp] [13538.218570]  
> process_one_work+0x206/0x400 [13538.222580]  worker_thread+0x2d/0x3e0 
> [13538.226243]  ? process_one_work+0x400/0x400 [13538.230421]  
> kthread+0x112/0x130 [13538.233647]  ? kthread_bind+0x20/0x20 
> [13538.237312]  ret_from_fork+0x1f/0x40
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
>