[RFC] nvme-tcp: fix a possible double-free after failed to send request

Mon Jan 1 01:58:30 PST 2024

On 12/26/23 08:26, zhangyanjun at cestc.cn wrote:
> From: Yanjun Zhang <zhangyanjun at cestc.cn>
> 
> Thanks for your reply and attention!
> 
>>> In storage clusters constructed by nvme-tcp driver, we have encountered
>>> the following crash on the host kernel severval times.
>>>
>>> [248514.030873] nvme nvme1: failed to send request -13
>>> [248514.035916] ------------[ cut here ]------------
>>> [248514.035918] kernel BUG at mm/slub.c:379!
>>> [248514.037647] invalid opcode: 0000 [#1] SMP NOPTI
>>> [248514.039416] CPU: 0 PID: 9 Comm: kworker/0:1H Kdump: loaded Tainted: G S 5.15.67-6.cl9.x86_64 #1
>>> [248514.041376] Hardware name: CECLOUD CeaStor 16114/BC13MBSBC, BIOS 1.37 02/24/2023
>>> [248514.043433] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
>>> [248514.045576] RIP: 0010:__slab_free+0x16a/0x320
>>> [248514.047751] Code: 24 20 e8 69 28 78 00 44 8b 44 24 0c 4c 8b 54 24 10 44 0f b6 5c 24 1b 0f b6 74 24 1c 48 89 04 24 4c 8b 4c 24 20 e9 28 ff ff ff <0f> 0b 41 f7 46 08 00 0d 21 00 75 a0 4d 85 ed 75 9b 80 4c 24 5b 80
>>> [248514.052500] RSP: 0018:ff51b1a6c0273bf0 EFLAGS: 00010246
>>> [248514.054798] RAX: ff2378e68268b800 RBX: 0000000080080004 RCX: ff2378e68268b000
>>> [248514.057038] RDX: ff2378e68268b000 RSI: ffca59110c09a200 RDI: ff2378a480034d00
>>> [248514.059245] RBP: ff51b1a6c0273c90 R08: 0000000000000001 R09: ffffffffc0901a0a
>>> [248514.061386] R10: ff2378e68268b000 R11: ffffffff86e06000 R12: ffca59110c09a200
>>> [248514.063423] R13: ff2378e68268b000 R14: ff2378a480034d00 R15: 0000000000000078
>>> [248514.065428] FS: 0000000000000000(0000) GS:ff2378d32fe00000(0000) knlGS:0000000000000000
>>> [248514.067456] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [248514.069531] CR2: 00007f4759e1c800 CR3: 0000001b5e5a6005 CR4: 0000000000771ef0
>>> [248514.071706] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [248514.073916] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [248514.076130] PKRU: 55555554
>>> [248514.078392] Call Trace:
>>> [248514.080640] <TASK>
>>> [248514.082898] ? sk_stream_alloc_skb+0x66/0x2e0
>>> [248514.085231] ? tcp_skb_entail+0x11d/0x130
>>> [248514.087595] ? tcp_build_frag+0xf0/0x390
>>> [248514.089980] ? nvme_complete_rq+0x1a/0x1f0 [nvme_core]
>>> [248514.092433] kfree+0x215/0x240
>>> [248514.094918] nvme_complete_rq+0x1a/0x1f0 [nvme_core]
>>> [248514.097469] nvme_tcp_recv_pdu+0x534/0x570 [nvme_tcp]
>>> [248514.100070] nvme_tcp_recv_skb+0x4f/0x23e [nvme_tcp]
>>> [248514.102699] ? nvme_tcp_recv_pdu+0x570/0x570 [nvme_tcp]
>>> [248514.105317] tcp_read_sock+0xa0/0x270
>>> [248514.107958] nvme_tcp_try_recv+0x65/0xa0 [nvme_tcp]
>>> [248514.110666] ? nvme_tcp_try_send+0x16b/0x200 [nvme_tcp]
>>> [248514.113431] nvme_tcp_io_work+0x4d/0xa0 [nvme_tcp]
>>> [248514.116247] process_one_work+0x1e8/0x390
>>> [248514.119085] worker_thread+0x53/0x3d0
>>> [248514.121980] ? process_one_work+0x390/0x390
>>> [248514.124887] kthread+0x124/0x150
>>> [248514.127835] ? set_kthread_struct+0x50/0x50
>>> [248514.130836] ret_from_fork+0x1f/0x30
>>> [248514.133841] </TASK>
>>>
>>> By analyzing the vmcore, we know the direct cause is that the slab object
>>> request->special_vec was freed twicely. According to the error message
>>> "nvme nvme1: failed to send request -13" and nvme_tcp_request->state =
>>> NVME_TCP_SEND_DATA, the pdu has been send by nvme_tcp_try_send_cmd_pdu.
>>
>> So what exactly failed to send? incapsule date? Or h2cdata?
>>
> According to the nvme_tcp_request->state is NVME_TCP_SEND_DATA currently, I think
> it shoule be incapsule data failed to send after the cmd pdu has been send.

Sounds like it, can you verify that though?

> 
>>> And the nvme_tcp_fail_request would execute nvme_complete_rq after failed
>>> to send data.
>>
>> That is correct.
>>
>>> Then the nvme_tcp_recv_pdu may receive the responding pdu
>>
>> Which PDU was that? Isn't the controller expecting request data?
>>
> The received pdu type is nvme_tcp_rsp. It should not reveive the data matched with
> the above request that has been completed because of failing to send data or take
> some extra actions for this request? The sent pdu with matched request is nvme_tcp_cmd.

What is unclear to me is why does the controller send a response if it
is expecting incapsule data?

> crash> nvme_tcp_request.pdu ff2378b7cee425d0
>    pdu = 0xff2378dfe9c5f9d0,
> crash> nvme_tcp_hdr.type 0xff2378dfe9c5f9d0
>    type = 4 '\004',
> 
>>> and the nvme_tcp_process_nvme_cqe would complete the request again. To
>>> avoid this slab object double-free issuse, we try to make the following
>>> code modifications, can you give some suggestions, thanks!
>>>
>>> Signed-off-by: Yanjun Zhang <zhangyanjun at cestc.cn>
>>> ---
>>> drivers/nvme/host/tcp.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>>> index 08805f027..84f724558 100644
>>> --- a/drivers/nvme/host/tcp.c
>>> +++ b/drivers/nvme/host/tcp.c
>>> @@ -581,6 +581,9 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
>>> return -EINVAL;
>>> }
>>>
>>> + if (!blk_mq_request_started(rq))
>>> + return 0;
>>
>> First, I want to understand if this is a spurious completion, meaning is
>> this suggesting a protection against a misbehaving controller? Or there
>> was actually something that the host got wrong?
>>
> What can we identify is the nvme_ctrl->state = NVME_CTRL_LIVE from vmcore, and there is no other
> error with nvme ctrl before the message "nvme nvme1: failed to send request -13".
> It looks like always doing well, suddenly an error occurred, but we can not find why it failed?

Can you check why did the controller send a response for a command
capsule where it is still expecting data?

What I'm understanding from you is that the controller sends a spurious
command completion, and you are suggesting that the host protects
against it. This is not tcp specific as far as I can see.

> 
> [237163.158966] systemd-journald[379035]: /var/log/journal/a888731b91ba4a55bf056a48723bbc51/system.journal: Journal header limits reached or header out-of-date, rotating.
> [248514.030873] nvme nvme1: failed to send request -13
> 
>> Because this sort of check does not belong in the tcp driver.
> 
> Or should we add check by comparing nvme_request status with NVME_SC_HOST_PATH_ERROR?
> NVME_SC_HOST_PATH_ERROR is set by nvme_tcp_fail_request, but I am not sure if this
> check has other impact?

I don't know. It is true that the host does not really protects against
a case where the controller sends it a response although it was still 
expecting data. I'm trying to understand why this is a tcp specific issue.

Can you check why does the controller sends back a completion although
it should be receiving data?