Reconnect on RDMA device reset
Chuck Lever
chuck.lever at oracle.com
Thu Jan 25 11:06:17 PST 2018
> On Jan 25, 2018, at 10:13 AM, Doug Ledford <dledford at redhat.com> wrote:
>
> On Wed, 2018-01-24 at 22:52 +0200, Sagi Grimberg wrote:
>>>>> Today host and target stacks will respond to RDMA device reset (or plug
>>>>> out
>>>>> and plug in) by cleaning all resources related to that device, and sitting
>>>>> idle waiting for administrator intervention to reconnect (host stack) or
>>>>> rebind subsystem to a port (target stack).
>>>>>
>>>>> I'm thinking that maybe the right behaviour should be to try and restore
>>>>> everything as soon as the device becomes available again. I don't think a
>>>>> device reset should look different to the users than ports going down and
>>>>> up
>>>>> again.
>>>>
>>>>
>>>> Hmm, not sure I fully agree here. In my mind device removal means the
>>>> device is going away which means there is no point in keeping the controller
>>>> around...
>>>
>>> The same could have been said on a port going down. You don't know if it will
>>> come back up connected to the same network...
>>
>> That's true. However in my mind port events are considered transient,
>> and we do give up at some point. I'm simply arguing that device removal
>> has different semantics. I don't argue that we need to support it.
>
> I think it depends on how you view yourself (meaning the target or
> initiator stacks). It's my understanding that if device eth0
> disappeared completely, and then device eth1 was plugged in, and eth1
> got the same ip address as eth0, then as long as any TCP sockets hadn't
> gone into reset state, the iSCSI devices across the existing connection
> would simply keep working. This is correct, yes?
For NFS/RDMA, I think of the "failover" case where a device is
removed, then a new one is plugged in (or an existing cold
replacement is made available) with the same IP configuration.
On a "hard" NFS mount, we want the upper layers to wait for
a new suitable device to be made available, and then to use
it to resend any pending RPCs. The workload should continue
after a new device is available.
Feel free to tell me I'm full of turtles.
> If so, then maybe you
> want iSER at least to operate the same way. The problem, of course, is
> that iSER may use the IP address and ports for connection, but then it
> transitions to queue pairs for data transfer. Because iSER does that,
> it is sitting at the same level as, say, the net core that *did* know
> about the eth change in the above example and transitioned the TCP
> socket from the old device to the new, meaning that iSER now has to take
> that same responsibility on itself if it wishes the user visible
> behavior of iSER devices to be the same as iSCSI devices. And that
> would even be true if the old RDMA device went away and a new RDMA
> device came up with the old IP address, so the less drastic form of
> bouncing the existing device should certainly fall under the same
> umbrella.
>
> I *think* for SRP this is already the case. The SRP target uses the
> kernel LIO framework, so if you bounce the device under the SRPt layer,
> doesn't the config get preserved? So that when the device came back up,
> the LIO configuration would still be there and the SRPt driver would see
> that? Bart?
>
> For the SRP client, I'm almost certain it will try to reconnect since it
> uses a user space daemon with a shell script that restarts the daemon on
> various events. That might have changed...didn't we just take a patch
> to rdma-core to drop the shell script? It might not reconnect
> automatically with the latest rdma-core, I'd have to check. Bart should
> know though...
>
> I haven't the faintest clue on NVMe over fabrics though. But, again, I
> think that's up to you guys to decide what semantics you want. With
> iSER it's a little easier since you can use the TCP semantics as a
> guideline and you have an IP/port discovery so it doesn't even have to
> be the same controller that comes back. With SRP it must be the same
> controller that comes back or else your login information will be all
> wrong (well, we did just take RDMA_CM support patches for SRP that will
> allow IP/port addressing instead, so theoretically it could now do the
> same thing if you are using RDMA_CM mode logins). I don't know the
> details of the NVMe addressing though.
>
>>>> AFAIK device resets usually are expected to quiesce inflight I/O,
>>>> cleanup resources and restore when the reset sequence completes (which is
>>>> what we do in nvme controller resets).
>
> I think your perspective here might be a bit skewed by the way the NVMe
> stack is implemented (which was intentional for speed as I understand
> it). As a differing example, in the SCSI stack when the LLD does a SCSI
> host reset, it resets the host but does not restore or restart any
> commands that were aborted. It is up to the upper layer SCSI drivers to
> do so (if they chose, they might send it back to the block layer). From
> the way you wrote the above, it sounds like the NVMe layer is almost
> monolithic in nature with no separation between upper level consumer
> layer and lower level driver layer, and so you can reset/restart all
> internally. I would argue that's rare in the linux kernel and most
> places the low level driver resets, and some other upper layer has to
> restart things if it wants or error out if it doesn't.
>
>>>> I'm not sure I understand why
>>>> RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via
>>>> rdma_cm or .remove_one via ib_client API). I think the correct interface
>>>> would be suspend/resume semantics for RDMA device resets (similar to pm
>>>> interface).
>
> No, we can't do this. Suspend/Resume is not the right model for an RDMA
> device reset. An RDMA device reset is a hard action that stops all
> ongoing DMA regardless of its source. Those sources include kernel
> layer consumers, user space consumers acting without the kernel's direct
> intervention, and ongoing DMA with remote RDMA peers (which will throw
> the remote queue pairs into an error state almost immediately). In the
> future it very likely could include RDMA between things like GPU offload
> processors too. We can't restart that stuff even if we wanted to. So
> suspend/resume semantics for an RDMA device level reset is a non-
> starter.
>
>>>> I think that it would make a much cleaner semantics and ULPs should be
>>>> able to understand exactly what to do (which is what you suggested
>>>> above).
>>>>
>>>> CCing linux-rdma.
>>>
>>> Maybe so. I don't know what's the "standard" here for Linux in general and
>>> networking devices in particular. Let's see if linux-rdma agree here.
>>
>> I would like to hear more opinions on the current interface.
>
> There is a difference between RDMA device and other network devices.
> The net stack is much more like the SCSI stack in that you have an upper
> layer connection (socket or otherwise) and a lower layer transport and
> the net core code which is free to move your upper layer abstraction
> from one lower layer transport to another. With the RDMA subsystem,
> your upper layer is connecting directly into the low level hardware. If
> you want a semantic that includes reconnection on an event, then it has
> to be handled in your upper layer as there is no intervening middle
> layer to abstract out the task of moving your connection from one low
> level device to another (that's not to say we couldn't create one, and
> several actually already exist, like SMC-R and RDS, but direct hooks
> into the core ib stack are not abstracted out and you are talking
> directly to the hardware). And if you want to support moving your
> connection from an old removed device to a new replacement device that
> is not simply the same physical device being plugged back in, then you
> need an addressing scheme that doesn't rely on the link layer hardware
> address of the device.
>
>>>> Regardless of ib_client vs. rdma_cm, we can't simply perform normal
>>>> reconnects because we have dma mappings we need to unmap for each
>>>> request in the tagset which we don't teardown in every reconnect (as
>>>> we may have inflight I/O). We could have theoretically use reinit_tagset
>>>> to do that though.
>>>
>>> Obviously it isn't that simple... Just trying to agree on the right direction
>>> to go.
>>
>> Yea, I agree. It shouldn't be too hard also.
>>
>>>>> In the reconnect flow the stack already repeats creating the cm_id and
>>>>> resolving address and route, so when the RDMA device comes back up, and
>>>>> assuming it will be configured with the same address and connected to the
>>>>> same
>>>>> network (as is the case in device reset), connections will be restored
>>>>> automatically.
>>>>
>>>>
>>>> As I said, I think that the problem is the interface of RDMA device
>>>> resets. IMO, device removal means we need to delete all the nvme
>>>> controllers associated with the device.
>>>
>>> Do you think all associated controllers should be deleted when a TCP socket
>>> gets disconnected in NVMe-over-TCP? Do they?
>>
>> Nope, but that is equivalent to QP going into error state IMO, and we
>> don't do that in nvme-rdma as well.
>
> There is no equivalent in the TCP realm of an RDMA controller reset or
> an RDMA controller permanent removal event. When dealing with TCP, if
> the underlying ethernet device is reset, you *might* get a TCP socket
> reset, you might not. If the underlying ethernet is removed, you might
> get a socket reset, you might not, depending on how the route to the
> remote host is re-established. If all IP capable devices in the entire
> system are removed, your TCP socket will get a reset, and attempts to
> reconnect will get an error.
>
> None of those sound semantically comparable to RDMA device
> unplug/replug. Again, that's just because the net core never percolates
> that up to the TCP layer.
>
> When you have a driver that has both TCP and RDMA transports, the truth
> is you are plugging into two very different levels of the kernel and the
> work you have to do to support one is very different from the other. I
> don't think it's worthwhile to even talk about trying to treat them
> equivalently unless you want to take on an address scheme and
> reset/restart capability in the RDMA side of things that you don't have
> to have in the TCP side of things.
>
> As a user of things like iSER/SRP/NVMe, I would personally like
> connections to persist across non-fatal events. But the RDMA stack, as
> it stands, can't reconnect things for you, you would have to do that in
> your own code.
>
> --
> Doug Ledford <dledford at redhat.com>
> GPG KeyID: B826A3330E572FDD
> Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD
--
Chuck Lever
More information about the Linux-nvme
mailing list