[PATCH v2] RDMA/cma: prevent rdma id destroy during cma_iw_handler

Wed Jun 14 00:53:49 PDT 2023

On Jun 13, 2023 / 21:07, Leon Romanovsky wrote:
> On Tue, Jun 13, 2023 at 10:30:37AM -0300, Jason Gunthorpe wrote:
> > On Tue, Jun 13, 2023 at 01:43:43AM +0000, Shinichiro Kawasaki wrote:
> > > > I think there is likely some much larger issue with the IW CM if the
> > > > cm_id can be destroyed while the iwcm_id is in use? It is weird that
> > > > there are two id memories for this :\
> > > 
> > > My understanding about the call chain to rdma id destroy is as follows. I guess
> > > _destory_id calls iw_destory_cm_id before destroying the rdma id, but not sure
> > > why it does not wait for cm_id deref by cm_work_handler.
> > > 
> > > nvme_rdma_teardown_io_queueus
> > >  nvme_rdma_stop_io_queues -> chained to cma_iw_handler
> > >  nvme_rdma_free_io_queues
> > >   nvme_rdma_free_queue
> > >    rdma_destroy_id
> > >     mutex_lock(&id_priv->handler_mutex)
> > >     destroy_id_handler_unlock
> > >      mutex_unlock(&id_priv->handler_mutex)
> > >      _destory_id
> > >        iw_destroy_cm_id
> > >        wait_for_completiion(&id_priv->comp)
> > >        kfree(id_priv)
> > 
> > Once a destroy_cm_id() has returned that layer is no longer
> > permitted to run or be running in its handlers. The iw cm is broken if
> > it allows this, and that is the cause of the bug.
> > 
> > Taking more refs within handlers that are already not allowed to be
> > running is just racy.
> 
> So we need to revert that patch from our rdma-rc.

I see, thanks for the clarifications.

As another fix approach, I reverted the commit 59c68ac31e15 ("iw_cm: free cm_id
resources on the last deref") so that iw_destroy_cm_id() waits for deref of
cm_id. With that revert, the KASAN slab-use-after-free disappeared. Is this
the right fix approach?