Unexpected issues with 2 NVME initiators using the same target

Robert LeBlanc robert at leblancnet.us
Mon Jun 19 10:21:34 PDT 2017


I ran into this with 4.9.32 when I rebooted the target. I tested
4.12-rc6 and this particular error seems to have been resolved, but I
now get a new one on the initiator. This one doesn't seem as
impactful.

[Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2
[Mon Jun 19 11:17:20 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:17:20 2017]  connection3:0: detected conn error (1011)
[Mon Jun 19 11:17:31 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:17:31 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 93005204 0a0001e7 45dd82d2
[Mon Jun 19 11:17:31 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:17:31 2017]  connection4:0: detected conn error (1011)
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 93005204 0a0001f4 004915d2
[Mon Jun 19 11:17:31 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:17:31 2017]  connection3:0: detected conn error (1011)
[Mon Jun 19 11:17:44 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:17:44 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:44 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:44 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:44 2017] 00000000 93005204 0a0001f6 004519d2
[Mon Jun 19 11:17:44 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:17:44 2017]  connection3:0: detected conn error (1011)
[Mon Jun 19 11:18:55 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:18:55 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:18:55 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:18:55 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:18:55 2017] 00000000 93005204 0a0001f7 01934fd2
[Mon Jun 19 11:18:55 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:18:55 2017]  connection3:0: detected conn error (1011)
[Mon Jun 19 11:20:25 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:20:25 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:20:25 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:20:25 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:20:25 2017] 00000000 93005204 0a0001f8 0274edd2
[Mon Jun 19 11:20:25 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:20:25 2017]  connection3:0: detected conn error (1011)

I'm going to try to cherry-pick the fix to 4.9.x and do some testing there.

Thanks,

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, May 18, 2017 at 7:34 AM, Leon Romanovsky <leon at kernel.org> wrote:
> On Wed, May 17, 2017 at 02:56:36PM +0200, Marta Rybczynska wrote:
>> > On Mon, May 15, 2017 at 07:59:52AM -0700, Christoph Hellwig wrote:
>> >> On Mon, May 15, 2017 at 05:36:32PM +0300, Leon Romanovsky wrote:
>> >> > I understand you and both Max and me are feeling the same as you. For more
>> >> > than 2 months, we constantly (almost on daily basis) asked for a solution from
>> >> > architecture group, but received different answers. The proposals were
>> >> > extremely broad from need for strong fence for all cards to no need for
>> >> > strong fence at all.
>> >>
>> >> So let's get the patch to do a strong fence everywhere now, and relax
>> >> it later where possible.
>> >>
>> >> Correntness before speed..
>> >
>> > OK, please give me and Max till EOW to stop this saga. One of the two
>> > options will be: Max will resend original patch, or Max will send patch
>> > blessed by architecture group.
>> >
>>
>> Good luck with this Max & Leon! It seems to be a complicated problem.
>> Just an idea: in our case it *seems* that the problem started appearing
>> after a firmware upgrade, older ones do not seem to have the same
>> behaviour. Maybe it's a hint for you.
>
> OK, we came to the agreement which capability bits we should add. Max
> will return to the office at the middle of the next week and we will
> proceed with the submission of proper patch once our shared code will
> be accepted.
>
> In the meantime, i put the original patch to be part of our regression.
> https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/commit/?h=testing/queue-next&id=a40ac569f243db552661e6efad70080bb406823c
>
> Thank you for your patience.
>
>>
>> Thanks!
>> Marta
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo at vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html



More information about the Linux-nvme mailing list