Reconnect on RDMA device reset

Thu Jan 25 11:06:17 PST 2018

> On Jan 25, 2018, at 10:13 AM, Doug Ledford <dledford at redhat.com> wrote:
> 
> On Wed, 2018-01-24 at 22:52 +0200, Sagi Grimberg wrote:
>>>>> Today host and target stacks will respond to RDMA device reset (or plug
>>>>> out
>>>>> and plug in) by cleaning all resources related to that device, and sitting
>>>>> idle waiting for administrator intervention to reconnect (host stack) or
>>>>> rebind subsystem to a port (target stack).
>>>>> 
>>>>> I'm thinking that maybe the right behaviour should be to try and restore
>>>>> everything as soon as the device becomes available again. I don't think a
>>>>> device reset should look different to the users than ports going down and
>>>>> up
>>>>> again.
>>>> 
>>>> 
>>>> Hmm, not sure I fully agree here. In my mind device removal means the
>>>> device is going away which means there is no point in keeping the controller
>>>> around...
>>> 
>>> The same could have been said on a port going down. You don't know if it will
>>> come back up connected to the same network...
>> 
>> That's true. However in my mind port events are considered transient,
>> and we do give up at some point. I'm simply arguing that device removal
>> has different semantics. I don't argue that we need to support it.
> 
> I think it depends on how you view yourself (meaning the target or
> initiator stacks).  It's my understanding that if device eth0
> disappeared completely, and then device eth1 was plugged in, and eth1
> got the same ip address as eth0, then as long as any TCP sockets hadn't
> gone into reset state, the iSCSI devices across the existing connection
> would simply keep working.  This is correct, yes?

For NFS/RDMA, I think of the "failover" case where a device is
removed, then a new one is plugged in (or an existing cold
replacement is made available) with the same IP configuration.

On a "hard" NFS mount, we want the upper layers to wait for
a new suitable device to be made available, and then to use
it to resend any pending RPCs. The workload should continue
after a new device is available.

Feel free to tell me I'm full of turtles.

> If so, then maybe you
> want iSER at least to operate the same way.  The problem, of course, is
> that iSER may use the IP address and ports for connection, but then it
> transitions to queue pairs for data transfer.  Because iSER does that,
> it is sitting at the same level as, say, the net core that *did* know
> about the eth change in the above example and transitioned the TCP
> socket from the old device to the new, meaning that iSER now has to take
> that same responsibility on itself if it wishes the user visible
> behavior of iSER devices to be the same as iSCSI devices.  And that
> would even be true if the old RDMA device went away and a new RDMA
> device came up with the old IP address, so the less drastic form of
> bouncing the existing device should certainly fall under the same
> umbrella.
> 
> I *think* for SRP this is already the case.  The SRP target uses the
> kernel LIO framework, so if you bounce the device under the SRPt layer,
> doesn't the config get preserved?  So that when the device came back up,
> the LIO configuration would still be there and the SRPt driver would see
> that? Bart?
> 
> For the SRP client, I'm almost certain it will try to reconnect since it
> uses a user space daemon with a shell script that restarts the daemon on
> various events.  That might have changed...didn't we just take a patch
> to rdma-core to drop the shell script?  It might not reconnect
> automatically with the latest rdma-core, I'd have to check.  Bart should
> know though...
> 
> I haven't the faintest clue on NVMe over fabrics though.  But, again, I
> think that's up to you guys to decide what semantics you want.  With
> iSER it's a little easier since you can use the TCP semantics as a
> guideline and you have an IP/port discovery so it doesn't even have to
> be the same controller that comes back.  With SRP it must be the same
> controller that comes back or else your login information will be all
> wrong (well, we did just take RDMA_CM support patches for SRP that will
> allow IP/port addressing instead, so theoretically it could now do the
> same thing if you are using RDMA_CM mode logins).  I don't know the
> details of the NVMe addressing though.
> 
>>>> AFAIK device resets usually are expected to quiesce inflight I/O,
>>>> cleanup resources and restore when the reset sequence completes (which is
>>>> what we do in nvme controller resets).
> 
> I think your perspective here might be a bit skewed by the way the NVMe
> stack is implemented (which was intentional for speed as I understand
> it).  As a differing example, in the SCSI stack when the LLD does a SCSI
> host reset, it resets the host but does not restore or restart any
> commands that were aborted.  It is up to the upper layer SCSI drivers to
> do so (if they chose, they might send it back to the block layer).  From
> the way you wrote the above, it sounds like the NVMe layer is almost
> monolithic in nature with no separation between upper level consumer
> layer and lower level driver layer, and so you can reset/restart all
> internally.  I would argue that's rare in the linux kernel and most
> places the low level driver resets, and some other upper layer has to
> restart things if it wants or error out if it doesn't.
> 
>>>> I'm not sure I understand why
>>>> RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via
>>>> rdma_cm or .remove_one via ib_client API). I think the correct interface
>>>> would be suspend/resume semantics for RDMA device resets (similar to pm
>>>> interface).
> 
> No, we can't do this.  Suspend/Resume is not the right model for an RDMA
> device reset.  An RDMA device reset is a hard action that stops all
> ongoing DMA regardless of its source.  Those sources include kernel
> layer consumers, user space consumers acting without the kernel's direct
> intervention, and ongoing DMA with remote RDMA peers (which will throw
> the remote queue pairs into an error state almost immediately).  In the
> future it very likely could include RDMA between things like GPU offload
> processors too.  We can't restart that stuff even if we wanted to.  So
> suspend/resume semantics for an RDMA device level reset is a non-
> starter.
> 
>>>> I think that it would make a much cleaner semantics and ULPs should be
>>>> able to understand exactly what to do (which is what you suggested
>>>> above).
>>>> 
>>>> CCing linux-rdma.
>>> 
>>> Maybe so. I don't know what's the "standard" here for Linux in general and
>>> networking devices in particular. Let's see if linux-rdma agree here.
>> 
>> I would like to hear more opinions on the current interface.
> 
> There is a difference between RDMA device and other network devices. 
> The net stack is much more like the SCSI stack in that you have an upper
> layer connection (socket or otherwise) and a lower layer transport and
> the net core code which is free to move your upper layer abstraction
> from one lower layer transport to another.  With the RDMA subsystem,
> your upper layer is connecting directly into the low level hardware.  If
> you want a semantic that includes reconnection on an event, then it has
> to be handled in your upper layer as there is no intervening middle
> layer to abstract out the task of moving your connection from one low
> level device to another (that's not to say we couldn't create one, and
> several actually already exist, like SMC-R and RDS, but direct hooks
> into the core ib stack are not abstracted out and you are talking
> directly to the hardware).  And if you want to support moving your
> connection from an old removed device to a new replacement device that
> is not simply the same physical device being plugged back in, then you
> need an addressing scheme that doesn't rely on the link layer hardware
> address of the device.
> 
>>>> Regardless of ib_client vs. rdma_cm, we can't simply perform normal
>>>> reconnects because we have dma mappings we need to unmap for each
>>>> request in the tagset which we don't teardown in every reconnect (as
>>>> we may have inflight I/O). We could have theoretically use reinit_tagset
>>>> to do that though.
>>> 
>>> Obviously it isn't that simple... Just trying to agree on the right direction
>>> to go.
>> 
>> Yea, I agree. It shouldn't be too hard also.
>> 
>>>>> In the reconnect flow the stack already repeats creating the cm_id and
>>>>> resolving address and route, so when the RDMA device comes back up, and
>>>>> assuming it will be configured with the same address and connected to the
>>>>> same
>>>>> network (as is the case in device reset), connections will be restored
>>>>> automatically.
>>>> 
>>>> 
>>>> As I said, I think that the problem is the interface of RDMA device
>>>> resets. IMO, device removal means we need to delete all the nvme
>>>> controllers associated with the device.
>>> 
>>> Do you think all associated controllers should be deleted when a TCP socket
>>> gets disconnected in NVMe-over-TCP? Do they?
>> 
>> Nope, but that is equivalent to QP going into error state IMO, and we
>> don't do that in nvme-rdma as well.
> 
> There is no equivalent in the TCP realm of an RDMA controller reset or
> an RDMA controller permanent removal event.  When dealing with TCP, if
> the underlying ethernet device is reset, you *might* get a TCP socket
> reset, you might not.  If the underlying ethernet is removed, you might
> get a socket reset, you might not, depending on how the route to the
> remote host is re-established.  If all IP capable devices in the entire
> system are removed, your TCP socket will get a reset, and attempts to
> reconnect will get an error.
> 
> None of those sound semantically comparable to RDMA device
> unplug/replug.  Again, that's just because the net core never percolates
> that up to the TCP layer.
> 
> When you have a driver that has both TCP and RDMA transports, the truth
> is you are plugging into two very different levels of the kernel and the
> work you have to do to support one is very different from the other.  I
> don't think it's worthwhile to even talk about trying to treat them
> equivalently unless you want to take on an address scheme and
> reset/restart capability in the RDMA side of things that you don't have
> to have in the TCP side of things.
> 
> As a user of things like iSER/SRP/NVMe, I would personally like
> connections to persist across non-fatal events.  But the RDMA stack, as
> it stands, can't reconnect things for you, you would have to do that in
> your own code.
> 
> -- 
> Doug Ledford <dledford at redhat.com>
>    GPG KeyID: B826A3330E572FDD
>    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

--
Chuck Lever