[PATCH v1 2/2] nvme-fc: fix race with connectivity loss with nvme_fc_create_association

Wed Jun 26 02:26:53 PDT 2024

On Wed, Jun 26, 2024 at 11:28:33AM GMT, Sagi Grimberg wrote:
> > +static void nvme_fc_defer_reset_work(struct work_struct *work)
> > +{
> > +	struct nvme_fc_ctrl *ctrl =
> > +		container_of(work, struct nvme_fc_ctrl, fc_reset_work);
> > +
> > +	nvme_reset_ctrl(&ctrl->ctrl);
> > +}
> 
> I'm not entirely sure I understand what you're trying to solve here,
> but scheduling a work that in turn schedules a work looks bogus.

nvme_fc_ctrl_connectivity_loss is called from a interrupt context. If we
are in the connecting state we know that connect work job is still
executing. I figured we should wait until it has finished (see next
paragraph) and then issue an reset. If nvme_fc_ctrl_connectivity_loss
was called from normal context a flushed_delayed_work would do the
trick. The nvme_fc_defer_reset_work is queued/executed on the normal
nvme_wq thus the connecting work has finished.

The other idea I had was to issue the reset immediately and handle the
fallouts in the connect code path. But it turns out the connect path is
somewhat expecting errors and just ignores them on purpose and thus
still reaches the connected state.

I agree, it's dodgy but I haven't found any better solution so far. Any
good idea how to solve this are highly appreciated.