Unexpected issues with 2 NVME initiators using the same target
Robert LeBlanc
robert at leblancnet.us
Tue Jun 20 10:28:24 PDT 2017
On Tue, Jun 20, 2017 at 11:19 AM, Sagi Grimberg <sagi at grimberg.me> wrote:
>
>> Testing this patch I didn't see these new messages even when rebooting
>> the targets multiple times. It also resolved some performance problems
>> I was seeing (I think our switches are having bugs with IPv6 and
>> routing) and I was receiving expected performance. At one point in the
>> test, one target (4.9.33) showed:
>> [Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260
>> [Tue Jun 20 10:11:39 2017] iSCSI Login timeout on Network Portal [::]:3260
>> [Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure:
>> transport retry counter exceeded (12) vend_err 81
>
>
> I don't understand, is this new with the patch applied?
I applied your patch to 4.12-rc6 on the initiator, but my targets are
still 4.9.33 since it looked like the patch only affected the
initiator. I did not see this before your patch, but I also didn't try
rebooting the targets multiple times before because of the previous
messages.
>> After this and a reboot of the target, the initiator would drop the
>> connection after 1.5-2 minutes then faster and faster until it was
>> every 5 seconds. It is almost like it set up the connection then lose
>> the first ping, or the ping wasn't set-up right. I tried rebooting the
>> target multiple times.
>
>
> So the initiator could not recover even after the target as available
> again?
The initiator recovered the connection when the target came back, but
the connection was not stable. I/O would happen on the connection,
then it would get shaky and then finally disconnect. Then it would
reconnect, pass more I/O, then get shaky and go down again. With the 5
second disconnects, it would pass traffic for 5 seconds, then as soon
as I saw the ping timeout, the I/O would stop until it reconnected. At
that point it seems that the lack of pings would kill the I/O unlike
earlier where there was a stall in I/O and then the connection would
be torn down. I can try to see if I can get it to happen again.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
More information about the Linux-nvme
mailing list