[PATCH 2/2] nvme_fc: add uevent for auto-connect

Mon May 8 11:17:19 PDT 2017

On 5/8/2017 4:17 AM, Hannes Reinecke wrote:
> On 05/06/2017 01:13 AM, jsmart2021 at gmail.com wrote:
>> From: James Smart <jsmart2021 at gmail.com>
>>
>> To support auto-connecting to FC-NVME devices upon their dynamic
>> appearance, add a uevent that can kick off connection scripts.
>> uevent is posted against the nvme_fc transport device.
>>
> I'm not sure if that will work for the proposed multipath extensions for
> NVMe.

I don't know why this should conflict with multipath.

>
> From my understanding NVMe will drop all queues and do a full
> reconnection upon failure, correct?

Sure... but these are all actions "below the nvme device".

The nvme storage device presented to the OS, the namespace, is issuing 
io's to the controller block queue. When a controller errors or is lost, 
the nvme fabrics level (currently the transport) will stop the 
controller block queue and all outstanding requests are terminated and 
returned to the block queue for requeuing. The transport initially tries 
to reconnect, and if the reconnect fails, a timer is started to retry 
the reconnect in a few seconds. This repeats until a controller time 
limit expires (ctrl_loss_tmo), at which point the controller block 
queues are torn down. FC differs from RDMA in that: it won't try to 
reconnect if there isn't connectivity; and it sets the controller time 
limit to the smaller of the SCSI FC transport dev_loss_timer (passed to 
it via the driver) and the connection requested ctrl_loss_tmo.

So, while the reconnect is pending, the block queues are stopped and 
idle. If the transport successfully completes a reconnect before the 
timer expires, the controller's block queues are then released and io 
starts again.

This patch changes nothing in the behavior - it only keys the FC 
reconnect attempt to device appearance (note: if fc port is connected, 
the same timers used by rdma still occur on FC).

The patch does add one other things though.  If the time limit did 
expire and the controller was tied down, in order to "get the path back" 
a new create_controller request has to be made.  The patch does, for FC, 
key this to device appearance, so it's automated.  This is likely 
different from RDMA where a system script/daemon is periodically trying 
to connect again (device was there so keep trying to see if it comes 
back) or it depends on some administrative action to create the controller.

>
> So if there is a multipath failure NVMe will have to drop all failed
> paths and reconnect.
> Which means that if we have an all paths down scenario _all_ paths are
> down, and need to be reconnect.
> Consequently the root-fs becomes inaccessible for a brief period of
> time, and relying on things like udev to do a reconnect will probably
> not work.

As for multipathing:
1) if md-like multipathing is done, I believe it means there are 
separate nvme storage devices (each a namespaces on top of a 
controller). Thus each device is a "path". Paths would be added with the 
appearance of a new nvme storage device, and when they are torn down, 
the path would go away.  I assume multipathing would also become aware 
of when the device is "stopped/blocked" due to its controller queues 
being stopped.

2) if a lighter-weight multipathing is done, say within the nvme layer, 
the rescanning of the nvme namespaces would pair it up to the nvme 
storage device, thus each set of controller blk queues would be the 
"path". Thus, when a controller's queues are "stopped/blocked" the nvme 
device knows and stops using that path. And when they go away, the 
"path" is removed.

We could talk further about options when the last path is gone but... 
back to this patch - you'll note nothing in this section has anything to 
do with the patch. The patch changes nothing in the overall nvme device 
or controller behaviors. The only thing the patch does is specific to 
the bottom levels of the FC transport - keying reconnects and or new 
device scanning to FC target device connectivity announcements.

> Also, any other driver (with the notable exception of S/390 ones)(ok,
> and iSCSI) does an autoconnect.
> And went into _soo_ many configuration problems due to that fact.
> zfcp finally caved in and moved to autoconnect, too, precisely to avoid
> all these configuration issues.
>
> So what makes NVMe special that it cannot do autoconnect within the driver?

Well, this gets back to the response I just sent back to Johannes.  NVME 
discovery requires connecting to a discovery controller and reading 
discovery log records (sound similar to iscsi and iSNS) and from the 
discovery log records then connect to nvme subsystems resulting in the 
nvme controllers. This functionality is currently not in the kernel, 
it's in the nvme cli as the "connect-all" functionality when talking to 
a discovery controller. For FC, as it knows the presence of discovery 
controllers on the FC fabric, this patch is how we're invoking that 
functionality. Note: there's nothing in FC that would provide the 
content of the discovery log records so that it could skip the discovery 
controller connection.   For RDMA, it's lack of connectivity knowledge 
prevents it from doing auto-connect once it's torn down the controller 
after a ctrl_loss_tmo.

-- james