Unexpected issues with 2 NVME initiators using the same target

Fri Mar 24 11:30:02 PDT 2017

> > >> From: Max Gurtovoy [mailto:maxg at mellanox.com]
> > >>
> > >> I think we need to add fence to the UMR wqe.
> > >>
> > >> diff --git a/drivers/infiniband/hw/mlx5/qp.c
> > >> b/drivers/infiniband/hw/mlx5/qp.c index ad8a263..c38c4fa 100644
> > >> --- a/drivers/infiniband/hw/mlx5/qp.c
> > >> +++ b/drivers/infiniband/hw/mlx5/qp.c
> > >> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp,
> > >> int idx, int size_16)
> > >>
> > >>   static u8 get_fence(u8 fence, struct ib_send_wr *wr)
> > >>   {
> > >> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
> > >> -                    wr->send_flags & IB_SEND_FENCE))
> > >> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode ==
> > >> + IB_WR_REG_MR)
> > >>                  return MLX5_FENCE_MODE_STRONG_ORDERING;
> > >>
> > >>          if (unlikely(fence)) {
> > >>
> > >> Joseph,
> > >> please update after trying the 2 patches (seperatly) + perf numbers.
> > >>
> > >> I'll take it internally and run some more tests with stronger
> > >> servers using
> > >> ConnectX4 NICs.
> > >>
> > >> These patches are only for testing and not for submission yet. If
> > >> we find them good enought for upstream then we need to distinguish
> > >> between ConnexcX4/IB and ConnectX5 (we probably won't see it there).
> > >
> > > Hi Max-
> > >
> > > Our testing on this patch looks good, failures seem completely
> > > alleviated.  We
> > are not really detecting any performance impact to small block read
> > workloads.  Data below uses 50Gb CX4 initiator and target and FIO to
> > generate load.  Each disk runs 4KB random reads with 4 jobs and queue depth
> 32 per job.
> > Initiator uses 16 IO queues per attached subsystem.  We tested with 2
> > P3520 disks attached, and again with 7 disks attached.
> > >
> > > 				IOPS		Latency (usec)
> > > 4.10-RC8	2 disks		545,695		466.0
> > > With Patch	2 disks		587,663		432.8
> > > 4.10-RC8	7 disks		1,074,311	829.5
> > > With Patch	7 disks		1,080,099	825.4
> >
> > Very nice.
> > We also run testing on null devices in our labs on iSER/NVMf (show
> > better the network/transport layer performance) and the impact was
> Atolerable.
> >
> > >
> > > You mention these patches are only for testing.  How do we get to
> > > something
> > which can be submitted to upstream?
> >
> > Yes, we need to be careful and not put the strong_fence if it's not a must.
> > I'll be out for the upcoming week, but I'll ask our mlx5 maintainers
> > to prepare a suitable patch and check some other applications
> > performance numbers.
> > Thanks for the testing, you can use this patch meanwhile till we push
> > the formal solution.
> 
> With additional testing on this patch we are now encountering what seems to
> be a new failure.  It takes hours of testing to reproduce but we've been able to
> reproduce on 2 out of 2 overnight runs of continuous testing.  I cannot say
> conclusively the failure is due to the patch, as we cannot run this exact
> configuration without the patch, but I can say we did not see this failure mode
> in previous testing.  Either the patch induced the new failure mode, or perhaps
> this problem was always present and was just exposed by adding the patch.  We
> will continue trying to characterize the failure.
> 
> I've attached a couple dmesg logs of two different reproductions.  The t01 files
> are the target system and the i03 files are the initiator system.  Failure seems
> to start with a sequence like below on the target.  Thoughts?
> 
> [38875.102023] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
> [38875.108750] nvmet: ctrl 1 fatal error occurred!
> [39028.696921] INFO: task kworker/7:3:10534 blocked for more than 120
> seconds.
> [39028.704813]       Not tainted 4.10.0-rc8patch-2-get-fence #5
> [39028.711147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [39028.719900] kworker/7:3     D    0 10534      2 0x00000000
> [39028.719908] Workqueue: events nvmet_rdma_release_queue_work
> [nvmet_rdma] [39028.719909] Call Trace:
> [39028.719914]  __schedule+0x233/0x6f0
> [39028.719918]  ? sched_clock+0x9/0x10
> [39028.719919]  schedule+0x36/0x80
> [39028.719921]  schedule_timeout+0x22a/0x3f0 [39028.719924]  ?
> vprintk_emit+0x312/0x4a0 [39028.719927]  ? __kfifo_to_user_r+0xb0/0xb0
> [39028.719929]  wait_for_completion+0xb4/0x140 [39028.719930]  ?
> wake_up_q+0x80/0x80 [39028.719933]  nvmet_sq_destroy+0x41/0xf0 [nvmet]
> [39028.719935]  nvmet_rdma_free_queue+0x28/0xa0 [nvmet_rdma]
> [39028.719936]  nvmet_rdma_release_queue_work+0x25/0x50 [nvmet_rdma]
> [39028.719939]  process_one_work+0x1fc/0x4b0 [39028.719940]
> worker_thread+0x4b/0x500 [39028.719942]  kthread+0x101/0x140
> [39028.719943]  ? process_one_work+0x4b0/0x4b0 [39028.719945]  ?
> kthread_create_on_node+0x60/0x60 [39028.719946]
> ret_from_fork+0x2c/0x40

Hey folks.  Apologies if this message comes through twice, but when I originally sent it the list flagged it as too large due to the dmesg log attachments, and then a coworker just told me they never saw it, so I don't think it made it through on the first attempt.

Please see last note above and dmesg example attached - after more extensive testing with Max's patch we are still able to produce cqe dump errors (at a much lower frequency) as well as a new failure mode involving a crash dump.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: full-std-50g-round2-t01-dmesg.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170324/6a4aff65/attachment-0002.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: full-std-50g-single-t01-dmesg.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170324/6a4aff65/attachment-0003.txt>