ncme-tcp: io_work NULL pointer when racing with queue stop

Fri Jan 28 15:55:01 PST 2022

This completed a day of automated testing without any failures.

Tested-by: Chris Leech <cleech at redhat.com>

On Thu, Jan 27, 2022 at 7:25 PM Chris Leech <cleech at redhat.com> wrote:
>
> Thanks Sagi, this looks promising.
>
> It also might fit with a new backtrace I was just looking at from the
> same testing, where nvme_tcp_submit_async_event hit a null
> ctrl->async_req.pdu which I can only see happening if it was racing
> with nvme_tcp_error_recovery_work.
>
> I'll get this into some testing here at Red Hat and let you know the results.
>
> - Chris
>
> On Thu, Jan 27, 2022 at 3:05 PM Sagi Grimberg <sagi at grimberg.me> wrote:
> >
> >
> > >> Thank you for the following detailed description. I'm going to go back
> > >> to my crash report and take another look at this one.
> > >
> > > No worries Chris, perhaps I can assist.
> > >
> > > Is the dmesg log prior to the BUG available? Does it tell us anything
> > > to what was going on leading to this?
> > >
> > > Any more information about the test case? (load + controller reset)
> > > Is the reset in a loop? Any more info about the load?
> > > Any other 'interference' during the test?
> > > How reproducible is this?
> > > Is this Linux nvmet as the controller?
> > > How many queues does the controller have? (it will help me understand
> > > how easy it is to reproduce on a vm setup)
> >
> > I took another look at the code and I think I see how io_work maybe
> > triggered after a socket was released. The issue might be
> > .submit_async_event callback from the core.
> >
> > When we start a reset, the first thing we do is stop the pending
> > work elements that may trigger io by calling nvme_stop_ctrl, and
> > then we continue to teardown the I/O queues and then the admin
> > queue (in nvme_tcp_teardown_ctrl).
> >
> > So the sequence is:
> >          nvme_stop_ctrl(ctrl);
> >          nvme_tcp_teardown_ctrl(ctrl, false);
> >
> > However, there is a possibility, after nvme_stop_ctrl but before
> > we teardown the admin queue, that the controller sends a AEN
> > and is processed by the host, which includes automatically
> > submitting another AER which in turn is calling the driver with
> > .submit_async_event (instead of the normal .queue_rq as AERs don't have
> > timeouts).
> >
> > In nvme_tcp_submit_async_event we do not check the controller or
> > queue state and see that it is ready to accept a new submission like
> > we do in .queue_rq, so we blindly prepare the AER cmd queue it and
> > schedules io_work, but at this point I don't see what guarantees that
> > the queue (e.g. the socket) is not released.
> >
> > Unless I'm missing something, this flow will trigger a use-after-free
> > when io_work will attempt to access the socket.
> >
> > I see we also don't flush the async_event_work in the error recovery
> > flow which we probably should so we can avoid such a race.
> >
> > I think that the below patch should address the issue:
> > --
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > index 96725c3f1e77..bf380ca0e0d1 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -2097,6 +2097,7 @@ static void nvme_tcp_error_recovery_work(struct
> > work_struct *work)
> >
> >          nvme_auth_stop(ctrl);
> >          nvme_stop_keep_alive(ctrl);
> > +       flush_work(&ctrl->async_event_work);
> >          nvme_tcp_teardown_io_queues(ctrl, false);
> >          /* unquiesce to fail fast pending requests */
> >          nvme_start_queues(ctrl);
> > @@ -2212,6 +2213,10 @@ static void nvme_tcp_submit_async_event(struct
> > nvme_ctrl *arg)
> >          struct nvme_tcp_cmd_pdu *pdu = ctrl->async_req.pdu;
> >          struct nvme_command *cmd = &pdu->cmd;
> >          u8 hdgst = nvme_tcp_hdgst_len(queue);
> > +       bool queue_ready = test_bit(NVME_TCP_Q_LIVE, &queue->flags);
> > +
> > +       if (ctrl->ctrl.state != NVME_CTRL_LIVE || !queue_ready)
> > +               return;
> >
> >          memset(pdu, 0, sizeof(*pdu));
> >          pdu->hdr.type = nvme_tcp_cmd;
> > --
> >
> > Chris, can you take this for some testing?
> >