Unexpected issues with 2 NVME initiators using the same target

Tue Jun 27 00:22:52 PDT 2017

>> I don't understand, is this new with the patch applied?
> 
> I applied your patch to 4.12-rc6 on the initiator, but my targets are
> still 4.9.33 since it looked like the patch only affected the
> initiator. I did not see this before your patch, but I also didn't try
> rebooting the targets multiple times before because of the previous
> messages.

That sounds like a separate issue. Should we move forward with the
suggested patch?

>>> After this and a reboot of the target, the initiator would drop the
>>> connection after 1.5-2 minutes then faster and faster until it was
>>> every 5 seconds. It is almost like it set up the connection then lose
>>> the first ping, or the ping wasn't set-up right. I tried rebooting the
>>> target multiple times.
>>
>>
>> So the initiator could not recover even after the target as available
>> again?
> 
> The initiator recovered the connection when the target came back, but
> the connection was not stable. I/O would happen on the connection,
> then it would get shaky and then finally disconnect. Then it would
> reconnect, pass more I/O, then get shaky and go down again. With the 5
> second disconnects, it would pass traffic for 5 seconds, then as soon
> as I saw the ping timeout, the I/O would stop until it reconnected. At
> that point it seems that the lack of pings would kill the I/O unlike
> earlier where there was a stall in I/O and then the connection would
> be torn down. I can try to see if I can get it to happen again.

So looks like the target is not responding to NOOP_OUTs (or traffic
at all for that matter).

The messages:
[Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260

Are indicating that something is stuck in the login thread, not sure
where though. Did you see a watchdog popping on a hang?

And massage:
[Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure:
transport retry counter exceeded (12) vend_err 81

Is an indication that the rdma fabric is in some error state.

On which reboot attempt all this happened? the first one?

Again, CCing target-devel.