kernel panic due to a nvmet race
Engel, Amit
Amit.Engel at Dell.com
Tue May 17 08:05:22 PDT 2022
Right. We are working on a fix for tcp/rdma
Thanks
Amit
Internal Use - Confidential
-----Original Message-----
From: Sagi Grimberg <sagi at grimberg.me>
Sent: Tuesday, May 17, 2022 3:52 PM
To: Engel, Amit; linux-nvme at lists.infradead.org
Cc: Grupi, Elad
Subject: Re: kernel panic due to a nvmet race
[EXTERNAL EMAIL]
> Hi All,
>
> We observed a kernel panic which based on our analysis is due to a nvmet race.
> the race is between nvme connect and nvmet tcp port removal.
> The scenario:
> In case that nvmet_port_release is freeing the nvmet port just before nvme connect is trying to 'nvmet_find_get_subsys' (as part of nvmet_alloc_ctrl) nvmet_find_get_subsys is trying to access a port which is already freed:
>
> nvme/target/core.c:
> static struct nvmet_subsys *nvmet_find_get_subsys(struct nvmet_port
> *port,
>> ------->-------const char *subsysnqn)
> ...snip
>> -------down_read(&nvmet_config_sem);
>> -------list_for_each_entry(p, &port->subsystems, entry) {
>> ------->-------if (!strncmp(p->subsys->subsysnqn, subsysnqn,
>
> crash> bt
> PID: 30216 TASK: ffff888c1e163f00 CPU: 0 COMMAND: "nt"
> #0 [ffffc90020153858] machine_kexec at ffffffff81062fcc
> #1 [ffffc900201538b0] __crash_kexec at ffffffff811273ef
> #2 [ffffc90020153978] panic at ffffffff810851f7
> #3 [ffffc90020153a18] no_context at ffffffff8107104f
> #4 [ffffc90020153a80] page_fault at ffffffff81801184
> [exception RIP: nvmet_find_get_subsys+161]
> RIP: ffffffffa0bbce01 RSP: ffffc90020153b38 RFLAGS: 00010282
> RAX: ffff888c1e163f01 RBX: 0000000000000000 RCX: 0000000000000020
> RDX: 0000000000000000 RSI: ffffffffa0bc5895 RDI: ffffffffa0bce040
> RBP: ffff88aeafc3f520 R8: ffffc90020153ba0 R9: 0000000000000000
> R10: ffffc90020153bf8 R11: ffff888cb8e97b00 R12: ffff888bb3469a00
> R13: ffff888bb3469900 R14: ffffc9000c41ba70 R15: ffffc90020153ba0
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> #5 [ffffc90020153b58] nvmet_alloc_ctrl at ffffffffa0bbe4c2 [nvmet]
>
> Can you please review and provide your inputs ?
Indeed it seems that nothing is preventing this from happening..
The main issue is that ->remove_port() currently does not guarantee teardown of all the resources, it is just triggering it asynchronously. At least this is the case for tcp/rdma (and appears that fc as well).
More information about the Linux-nvme
mailing list