ncme-tcp: io_work NULL pointer when racing with queue stop
Sagi Grimberg
sagi at grimberg.me
Thu Jan 27 15:05:21 PST 2022
>> Thank you for the following detailed description. I'm going to go back
>> to my crash report and take another look at this one.
>
> No worries Chris, perhaps I can assist.
>
> Is the dmesg log prior to the BUG available? Does it tell us anything
> to what was going on leading to this?
>
> Any more information about the test case? (load + controller reset)
> Is the reset in a loop? Any more info about the load?
> Any other 'interference' during the test?
> How reproducible is this?
> Is this Linux nvmet as the controller?
> How many queues does the controller have? (it will help me understand
> how easy it is to reproduce on a vm setup)
I took another look at the code and I think I see how io_work maybe
triggered after a socket was released. The issue might be
.submit_async_event callback from the core.
When we start a reset, the first thing we do is stop the pending
work elements that may trigger io by calling nvme_stop_ctrl, and
then we continue to teardown the I/O queues and then the admin
queue (in nvme_tcp_teardown_ctrl).
So the sequence is:
nvme_stop_ctrl(ctrl);
nvme_tcp_teardown_ctrl(ctrl, false);
However, there is a possibility, after nvme_stop_ctrl but before
we teardown the admin queue, that the controller sends a AEN
and is processed by the host, which includes automatically
submitting another AER which in turn is calling the driver with
.submit_async_event (instead of the normal .queue_rq as AERs don't have
timeouts).
In nvme_tcp_submit_async_event we do not check the controller or
queue state and see that it is ready to accept a new submission like
we do in .queue_rq, so we blindly prepare the AER cmd queue it and
schedules io_work, but at this point I don't see what guarantees that
the queue (e.g. the socket) is not released.
Unless I'm missing something, this flow will trigger a use-after-free
when io_work will attempt to access the socket.
I see we also don't flush the async_event_work in the error recovery
flow which we probably should so we can avoid such a race.
I think that the below patch should address the issue:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 96725c3f1e77..bf380ca0e0d1 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2097,6 +2097,7 @@ static void nvme_tcp_error_recovery_work(struct
work_struct *work)
nvme_auth_stop(ctrl);
nvme_stop_keep_alive(ctrl);
+ flush_work(&ctrl->async_event_work);
nvme_tcp_teardown_io_queues(ctrl, false);
/* unquiesce to fail fast pending requests */
nvme_start_queues(ctrl);
@@ -2212,6 +2213,10 @@ static void nvme_tcp_submit_async_event(struct
nvme_ctrl *arg)
struct nvme_tcp_cmd_pdu *pdu = ctrl->async_req.pdu;
struct nvme_command *cmd = &pdu->cmd;
u8 hdgst = nvme_tcp_hdgst_len(queue);
+ bool queue_ready = test_bit(NVME_TCP_Q_LIVE, &queue->flags);
+
+ if (ctrl->ctrl.state != NVME_CTRL_LIVE || !queue_ready)
+ return;
memset(pdu, 0, sizeof(*pdu));
pdu->hdr.type = nvme_tcp_cmd;
--
Chris, can you take this for some testing?
More information about the Linux-nvme
mailing list