[PATCH 2/2] nvmet: fix a race condition between release_queue and io_work
Maurizio Lombardi
mlombard at redhat.com
Wed Nov 3 04:31:25 PDT 2021
On Wed, Nov 03, 2021 at 11:28:35AM +0200, Sagi Grimberg wrote:
>
> So this means we still get data from the network when
> we shouldn't. Maybe we are simply missing a kernel_sock_shutdown
> for SHUT_RD?
Hmm, right, kernel_sock_shutdown(queue->sock) is executed in
nvmet_tcp_delete_ctrl() and sock_release(queue->sock) is called
in nvmet_tcp_release_queue_work(), so there could be a race here.
I will try to move kernel_sock_shutdown(queue->sock) in nvmet_tcp_release_queue_work()
and test it.
>
> >
> > >
> > > > * Fix this bug by preventing io_work from being enqueued when
> > > > sk_user_data is NULL (it means that the queue is going to be deleted)
> > >
> > > This is triggered from the completion path, where the commands
> > > are not in a state where they are still fetching data from the
> > > host. How does this prevent the crash?
> >
> > io_work is also triggered every time a nvmet_req_init() fails and when
> > nvmet_sq_destroy() is called, I am not really sure about the state
> > of the commands in those cases.
>
> But that is from the workqueue context - which means that
> cancel_work_sync should prevent it right?
But nvmet_sq_destroy() is called from the release_work context,
we call cancel_work_sync() immediately after but we can't be sure
that the work will be canceled, io_work might have started already and
cancel_work_sync() will block until io_work ends its job, right?
>
> But that needs to be a separate fix and not combined with other
> fixes.
Ok I will submit it as a separate patch.
Maurizio
More information about the Linux-nvme
mailing list