[PATCH 2/2] nvmet: fix a race condition between release_queue and io_work

Wed Nov 3 04:31:25 PDT 2021

On Wed, Nov 03, 2021 at 11:28:35AM +0200, Sagi Grimberg wrote:
> 
> So this means we still get data from the network when
> we shouldn't. Maybe we are simply missing a kernel_sock_shutdown
> for SHUT_RD?

Hmm, right, kernel_sock_shutdown(queue->sock) is executed in
nvmet_tcp_delete_ctrl() and sock_release(queue->sock) is called
in nvmet_tcp_release_queue_work(), so there could be a race here.

I will try to move kernel_sock_shutdown(queue->sock) in nvmet_tcp_release_queue_work()
and test it.

> 
> > 
> > > 
> > > > * Fix this bug by preventing io_work from being enqueued when
> > > > sk_user_data is NULL (it means that the queue is going to be deleted)
> > > 
> > > This is triggered from the completion path, where the commands
> > > are not in a state where they are still fetching data from the
> > > host. How does this prevent the crash?
> > 
> > io_work is also triggered every time a nvmet_req_init() fails and when
> > nvmet_sq_destroy() is called, I am not really sure about the state
> > of the commands in those cases.
> 
> But that is from the workqueue context - which means that
> cancel_work_sync should prevent it right?

But nvmet_sq_destroy() is called from the release_work context,
we call cancel_work_sync() immediately after but we can't be sure
that the work will be canceled, io_work might have started already and
cancel_work_sync() will block until io_work ends its job, right?

> 
> But that needs to be a separate fix and not combined with other
> fixes.

Ok I will submit it as a separate patch.

Maurizio