NVMe fabric multipathing

Fri Oct 6 01:43:58 PDT 2023

On Fri, 6 Oct 2023 at 08:46, Christoph Hellwig <hch at infradead.org> wrote:
>
> On Thu, Oct 05, 2023 at 11:16:41AM +0100, Mark Syms wrote:
> > > > We have a requirement to report information about the status of NVMe
> > > > multipath controllers on an NVMe over fabrics deployment. In
> > > > particular the total number of "known" paths and the number of
> > > > "currently active" paths. As far as we can see right now, when a path
> > > > to an NVMe target device, when using Fibre Channel, is blocked, all
> > > > local information about the affected controller is flushed leaving
> > > > just the currently active controller(s) present.
> > >
> > > What do you mean with "blocked"?
> > >
> > Made unavailable for any reason. So failed switch port, failed HBA,
> > failed SAN controller, etc. We've been "emulating" this for testing
> > purposes by unbinding the PCI device for the HBA on the NVMe target
> > device but I expect the same happens for any of those reasons.
>
> If you undbind the device there is no way NVMe can keep any knowledge
> about the connections it has.  If you have a real path failure on
> the underlying fabric as long as it hasn't given up on reconnects.  You
> can set the max_reconnects value to -1 to never stop reconnecting.

That was an unbind on the remote system not on the system that we were
pulling status from, that should, from the POV of the client host, be
essentially the same as dropping the switch port (or indeed a
controller in an HA Pair dropping offline) should it not? We can
certainly try dropping a switch port by SNMP and see if the behaviour
is different.