Reconnect on RDMA device reset

Mon Jan 29 12:36:39 PST 2018

> I *think* for SRP this is already the case.  The SRP target uses the
> kernel LIO framework, so if you bounce the device under the SRPt layer,
> doesn't the config get preserved?  So that when the device came back up,
> the LIO configuration would still be there and the SRPt driver would see
> that? Bart?

I think you're right. I think we can do that if we keep the listener
cm_id device node_guid and when a new device comes in we can see if we 
have a cm listener on that device and re-listen. That is a good idea
Doug.

> For the SRP client, I'm almost certain it will try to reconnect since it
> uses a user space daemon with a shell script that restarts the daemon on
> various events.  That might have changed...didn't we just take a patch
> to rdma-core to drop the shell script?  It might not reconnect
> automatically with the latest rdma-core, I'd have to check.  Bart should
> know though...

srp driver relies on srp_daemon to discover and connect again over the
new device. iSER relies on iscsiadm to reconnect. I guess it should be
the correct approach for nvme as well (which we don't have at the
moment)...

>>>> AFAIK device resets usually are expected to quiesce inflight I/O,
>>>> cleanup resources and restore when the reset sequence completes (which is
>>>> what we do in nvme controller resets).
> 
> I think your perspective here might be a bit skewed by the way the NVMe
> stack is implemented (which was intentional for speed as I understand
> it).  As a differing example, in the SCSI stack when the LLD does a SCSI
> host reset, it resets the host but does not restore or restart any
> commands that were aborted.  It is up to the upper layer SCSI drivers to
> do so (if they chose, they might send it back to the block layer).  From
> the way you wrote the above, it sounds like the NVMe layer is almost
> monolithic in nature with no separation between upper level consumer
> layer and lower level driver layer, and so you can reset/restart all
> internally.  I would argue that's rare in the linux kernel and most
> places the low level driver resets, and some other upper layer has to
> restart things if it wants or error out if it doesn't.

That is the case for nvme as well, but I was merely saying that device
reset is not really a device removal. And this makes it hard for the ULP
to understand what to do (or for me at least...)

>>>>   I'm not sure I understand why
>>>> RDMA device resets manifest as DEVICE_REMOVAL events to ULPs (via
>>>> rdma_cm or .remove_one via ib_client API). I think the correct interface
>>>> would be suspend/resume semantics for RDMA device resets (similar to pm
>>>> interface).
> 
> No, we can't do this.  Suspend/Resume is not the right model for an RDMA
> device reset.  An RDMA device reset is a hard action that stops all
> ongoing DMA regardless of its source.

Suspend also requires that.

> Those sources include kernel
> layer consumers, user space consumers acting without the kernel's direct
> intervention, and ongoing DMA with remote RDMA peers (which will throw
> the remote queue pairs into an error state almost immediately).  In the
> future it very likely could include RDMA between things like GPU offload
> processors too.  We can't restart that stuff even if we wanted to.  So
> suspend/resume semantics for an RDMA device level reset is a non-
> starter.

I see. I can understand the argument "we are stuck with what we have"
for user-space, but does that mandate that we must live with that for
kernel consumers as well? Even if the semantics is confusing? (Just
asking, its only my opinion :))

>>>> I think that it would make a much cleaner semantics and ULPs should be
>>>> able to understand exactly what to do (which is what you suggested
>>>> above).
>>>>
>>>> CCing linux-rdma.
>>>
>>> Maybe so. I don't know what's the "standard" here for Linux in general and
>>> networking devices in particular. Let's see if linux-rdma agree here.
>>
>> I would like to hear more opinions on the current interface.
> 
> There is a difference between RDMA device and other network devices.
> The net stack is much more like the SCSI stack in that you have an upper
> layer connection (socket or otherwise) and a lower layer transport and
> the net core code which is free to move your upper layer abstraction
> from one lower layer transport to another.  With the RDMA subsystem,
> your upper layer is connecting directly into the low level hardware.  If
> you want a semantic that includes reconnection on an event, then it has
> to be handled in your upper layer as there is no intervening middle
> layer to abstract out the task of moving your connection from one low
> level device to another (that's not to say we couldn't create one, and
> several actually already exist, like SMC-R and RDS, but direct hooks
> into the core ib stack are not abstracted out and you are talking
> directly to the hardware).  And if you want to support moving your
> connection from an old removed device to a new replacement device that
> is not simply the same physical device being plugged back in, then you
> need an addressing scheme that doesn't rely on the link layer hardware
> address of the device.

Actually, I didn't suggest that at all. I fully agree that the ULP needs
to cooperate with the core and the HW as its holding physical resources.
All I suggested is that the core would reflect that the device is
resetting and not reflect that the device is going away, and after that
a new device comes in, that happens to be the same device...

> As a user of things like iSER/SRP/NVMe, I would personally like
> connections to persist across non-fatal events.  But the RDMA stack, as
> it stands, can't reconnect things for you, you would have to do that in
> your own code.

Again, I fully agree. Didn't mean that the core would handle everything
for the consumer of the device at all. I just think that the interface
can improve such that the consumers life (and code) would be easier.