[PATCH 0/8] nvme_fc: add dev_loss_tmo support

Tue May 23 10:01:00 PDT 2017

On 5/23/2017 12:20 AM, Christoph Hellwig wrote:
> On Sat, May 13, 2017 at 12:07:14PM -0700, James Smart wrote:
>> As the fabrics implementation already has a similar behavior
>> introduced on rdma, the ctrl_loss_tmo, which may be set on a
>> controller basis (finer granularity than the FC port used for the
>> connection), the nvme_fc transport will mediate and choose the lesser
>> of the controllers value and the remoteports value.
>
> I would much prefer if nvme-fc could stick to the same controller
> concept as rdma.  Especially as it needs to be synchronized with
> the reconnect delay and the keep alive timeout (and we need to
> do a better job on synchronize the latter to start with I think).

Not sure which controller concept you are thinking it isn't staying in 
sync with. ctrl_loss_tmo is continued, but it is augmented by having to 
deal with a node-level timeout that exists on FC, which rdma doesn't 
have. reconnect_delay is still used, but disabled if there's no 
connectivity as there's no point in re-trying a connect if there's no 
connectivity.  The only other semantic nvme-fc does differently (and its 
somewhat a different topic) is: on controller resets, don't tear down if 
the 1st reconnect attempt fails, at least wait the duration of 
ctrl_loss_tmo and use reconnect_delay for other attempts if connected.

If the main issue is: you don't want to have a 2nd timeout value merged 
with the connect-request's ctrl_loss_tmo value, then we need to work out 
a solution. I don't want to make admins learn a new way to set per-node 
timeout values, which should apply to SCSI and NVME equally, and can be 
dynamic. Right now, they come from the SCSI fc transport, and there's a 
lot of admin and infrastructure around management from that area. I 
don't believe you can just ignore it.  So the simple choice, which was 
proposed, was to simply merge them in the transport. The result is still 
ctrl_loss_tmo with reconnect delay, but nvme-fc: a) changes the value 
from what the connect request if node-level value is smaller; and b) it 
can dynamically change if the node-level value dynamically changes.

One thing I can propose - if we're using uevents to trigger connect 
requests, is to have the uevents specify the ctrl_loss_tmo values to use 
for the connect request, which would be based on the node-level devloss 
value. This would keep all the timeout values coming in via the connect 
request.  I dislike pushing too much information like this through udev 
to a cli and back, but it would work. It still won't deal with dynamic 
updates, so some thoughts are needed to address that aspect.

Thoughts ?

-- james