NVMe fabric multipathing

Tue Oct 10 00:49:23 PDT 2023

On Fri, Oct 06, 2023 at 09:43:58AM +0100, Mark Syms wrote:
> > > Made unavailable for any reason. So failed switch port, failed HBA,
> > > failed SAN controller, etc. We've been "emulating" this for testing
> > > purposes by unbinding the PCI device for the HBA on the NVMe target
> > > device but I expect the same happens for any of those reasons.
> >
> > If you undbind the device there is no way NVMe can keep any knowledge
> > about the connections it has.  If you have a real path failure on
> > the underlying fabric as long as it hasn't given up on reconnects.  You
> > can set the max_reconnects value to -1 to never stop reconnecting.
> 
> That was an unbind on the remote system not on the system that we were
> pulling status from, that should, from the POV of the client host, be
> essentially the same as dropping the switch port (or indeed a
> controller in an HA Pair dropping offline) should it not? We can
> certainly try dropping a switch port by SNMP and see if the behaviour
> is different.

So how does your controller on the host disappear?  It should be trying
to reconnect, and if you set the max_reconnects value to -1 it should
never go away.