ncme-tcp: io_work NULL pointer when racing with queue stop

Thu Jan 27 15:05:21 PST 2022

>> Thank you for the following detailed description. I'm going to go back
>> to my crash report and take another look at this one.
> 
> No worries Chris, perhaps I can assist.
> 
> Is the dmesg log prior to the BUG available? Does it tell us anything
> to what was going on leading to this?
> 
> Any more information about the test case? (load + controller reset)
> Is the reset in a loop? Any more info about the load?
> Any other 'interference' during the test?
> How reproducible is this?
> Is this Linux nvmet as the controller?
> How many queues does the controller have? (it will help me understand
> how easy it is to reproduce on a vm setup)

I took another look at the code and I think I see how io_work maybe
triggered after a socket was released. The issue might be
.submit_async_event callback from the core.

When we start a reset, the first thing we do is stop the pending
work elements that may trigger io by calling nvme_stop_ctrl, and
then we continue to teardown the I/O queues and then the admin
queue (in nvme_tcp_teardown_ctrl).

So the sequence is:
         nvme_stop_ctrl(ctrl);
         nvme_tcp_teardown_ctrl(ctrl, false);

However, there is a possibility, after nvme_stop_ctrl but before
we teardown the admin queue, that the controller sends a AEN
and is processed by the host, which includes automatically
submitting another AER which in turn is calling the driver with
.submit_async_event (instead of the normal .queue_rq as AERs don't have
timeouts).

In nvme_tcp_submit_async_event we do not check the controller or
queue state and see that it is ready to accept a new submission like
we do in .queue_rq, so we blindly prepare the AER cmd queue it and
schedules io_work, but at this point I don't see what guarantees that
the queue (e.g. the socket) is not released.

Unless I'm missing something, this flow will trigger a use-after-free
when io_work will attempt to access the socket.

I see we also don't flush the async_event_work in the error recovery
flow which we probably should so we can avoid such a race.

I think that the below patch should address the issue:
--

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 96725c3f1e77..bf380ca0e0d1 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2097,6 +2097,7 @@ static void nvme_tcp_error_recovery_work(struct 
work_struct *work)

         nvme_auth_stop(ctrl);
         nvme_stop_keep_alive(ctrl);
+       flush_work(&ctrl->async_event_work);
         nvme_tcp_teardown_io_queues(ctrl, false);
         /* unquiesce to fail fast pending requests */
         nvme_start_queues(ctrl);
@@ -2212,6 +2213,10 @@ static void nvme_tcp_submit_async_event(struct 
nvme_ctrl *arg)
         struct nvme_tcp_cmd_pdu *pdu = ctrl->async_req.pdu;
         struct nvme_command *cmd = &pdu->cmd;
         u8 hdgst = nvme_tcp_hdgst_len(queue);
+       bool queue_ready = test_bit(NVME_TCP_Q_LIVE, &queue->flags);
+
+       if (ctrl->ctrl.state != NVME_CTRL_LIVE || !queue_ready)
+               return;

         memset(pdu, 0, sizeof(*pdu));
         pdu->hdr.type = nvme_tcp_cmd;
--

Chris, can you take this for some testing?