[PATCH v2] RDMA/cma: prevent rdma id destroy during cma_iw_handler
Jason Gunthorpe
jgg at ziepe.ca
Wed Jun 14 10:36:58 PDT 2023
On Wed, Jun 14, 2023 at 07:53:49AM +0000, Shinichiro Kawasaki wrote:
> On Jun 13, 2023 / 21:07, Leon Romanovsky wrote:
> > On Tue, Jun 13, 2023 at 10:30:37AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Jun 13, 2023 at 01:43:43AM +0000, Shinichiro Kawasaki wrote:
> > > > > I think there is likely some much larger issue with the IW CM if the
> > > > > cm_id can be destroyed while the iwcm_id is in use? It is weird that
> > > > > there are two id memories for this :\
> > > >
> > > > My understanding about the call chain to rdma id destroy is as follows. I guess
> > > > _destory_id calls iw_destory_cm_id before destroying the rdma id, but not sure
> > > > why it does not wait for cm_id deref by cm_work_handler.
> > > >
> > > > nvme_rdma_teardown_io_queueus
> > > > nvme_rdma_stop_io_queues -> chained to cma_iw_handler
> > > > nvme_rdma_free_io_queues
> > > > nvme_rdma_free_queue
> > > > rdma_destroy_id
> > > > mutex_lock(&id_priv->handler_mutex)
> > > > destroy_id_handler_unlock
> > > > mutex_unlock(&id_priv->handler_mutex)
> > > > _destory_id
> > > > iw_destroy_cm_id
> > > > wait_for_completiion(&id_priv->comp)
> > > > kfree(id_priv)
> > >
> > > Once a destroy_cm_id() has returned that layer is no longer
> > > permitted to run or be running in its handlers. The iw cm is broken if
> > > it allows this, and that is the cause of the bug.
> > >
> > > Taking more refs within handlers that are already not allowed to be
> > > running is just racy.
> >
> > So we need to revert that patch from our rdma-rc.
>
> I see, thanks for the clarifications.
>
> As another fix approach, I reverted the commit 59c68ac31e15 ("iw_cm: free cm_id
> resources on the last deref") so that iw_destroy_cm_id() waits for deref of
> cm_id. With that revert, the KASAN slab-use-after-free disappeared. Is this
> the right fix approach?
That seems like it would bring back the bug it was fixing, though it
isn't totally clear what that is
There is something wrong with the iwarp cm if it is destroying IDs in
handlers, IB cm avoids doing that to avoid the deadlock, the same
solution will be needed for iwarp too.
Also the code this patch removed is quite ugly, if we are going back
to waiting it should be written in a more modern way without the test
bit and so on.
Jason
More information about the Linux-nvme
mailing list