Unexpected issues with 2 NVME initiators using the same target

Tue Mar 28 03:21:48 PDT 2017

Hi all,

Sorry for taking so long to reply, we had issues with ConnectX4 
availability, so I had a hard time getting HW for testing.

I have only applied the patch on the initiator side (over 4.9.6), target 
is untouched

Performance also looks good in my setup (no noticeable change 
with/without the patch), I haven't seen the issue recreate, but I'll 
leave the system running for the night just in case.

Also bumping the question about whether this fix (or a variation) will 
reach upstream.

Cheers,
Shahar

On 03/28/2017 11:38 AM, Max Gurtovoy wrote:
>
>
> On 3/27/2017 6:39 PM, Gruher, Joseph R wrote:
>>
>>
>>> On 3/24/2017 9:30 PM, Gruher, Joseph R wrote:
>>>>>>>> From: Max Gurtovoy [mailto:maxg at mellanox.com]
>>>>>>>>
>>>>>>>> I think we need to add fence to the UMR wqe.
>>>>>>>>
>>>>>>>> diff --git a/drivers/infiniband/hw/mlx5/qp.c
>>>>>>>> b/drivers/infiniband/hw/mlx5/qp.c index ad8a263..c38c4fa 100644
>>>>>>>> --- a/drivers/infiniband/hw/mlx5/qp.c
>>>>>>>> +++ b/drivers/infiniband/hw/mlx5/qp.c
>>>>>>>> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp,
>>>>>>>> int idx, int size_16)
>>>>>>>>
>>>>>>>>   static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>>>>>>>>   {
>>>>>>>> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
>>>>>>>> -                    wr->send_flags & IB_SEND_FENCE))
>>>>>>>> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode ==
>>>>>>>> + IB_WR_REG_MR)
>>>>>>>>                  return MLX5_FENCE_MODE_STRONG_ORDERING;
>>>>>>>>
>>>>>>>>          if (unlikely(fence)) {
>>>>>>>>
>>>>>>>> Joseph,
>>>>>>>> please update after trying the 2 patches (seperatly) + perf 
>>>>>>>> numbers.
>>>>>>>>
>>>>>>>> I'll take it internally and run some more tests with stronger
>>>>>>>> servers using
>>>>>>>> ConnectX4 NICs.
>>>>>>>>
>>>>>>>> These patches are only for testing and not for submission yet. If
>>>>>>>> we find them good enought for upstream then we need to distinguish
>>>>>>>> between ConnexcX4/IB and ConnectX5 (we probably won't see it 
>>>>>>>> there).
>>>>>>>
>>>>>>> Hi Max-
>>>>>>>
>>>>>>> Our testing on this patch looks good, failures seem completely
>>>>>>> alleviated.  We
>>>>>> are not really detecting any performance impact to small block read
>>>>>> workloads.  Data below uses 50Gb CX4 initiator and target and FIO to
>>>>>> generate load.  Each disk runs 4KB random reads with 4 jobs and
>>>>>> queue depth
>>>>> 32 per job.
>>>>>> Initiator uses 16 IO queues per attached subsystem.  We tested with
>>>>>> 2
>>>>>> P3520 disks attached, and again with 7 disks attached.
>>>>>>>
>>>>>>>                 IOPS        Latency (usec)
>>>>>>> 4.10-RC8    2 disks        545,695        466.0
>>>>>>> With Patch    2 disks        587,663        432.8
>>>>>>> 4.10-RC8    7 disks        1,074,311    829.5
>>>>>>> With Patch    7 disks        1,080,099    825.4
>>>>>>
>>>>>> Very nice.
>>>>>> We also run testing on null devices in our labs on iSER/NVMf (show
>>>>>> better the network/transport layer performance) and the impact was
>>>>> Atolerable.
>>>>>>
>>>>>>>
>>>>>>> You mention these patches are only for testing.  How do we get to
>>>>>>> something
>>>>>> which can be submitted to upstream?
>>>>>>
>>>>>> Yes, we need to be careful and not put the strong_fence if it's 
>>>>>> not a must.
>>>>>> I'll be out for the upcoming week, but I'll ask our mlx5 maintainers
>>>>>> to prepare a suitable patch and check some other applications
>>>>>> performance numbers.
>>>>>> Thanks for the testing, you can use this patch meanwhile till we
>>>>>> push the formal solution.
>>>>>
>>>>> With additional testing on this patch we are now encountering what
>>>>> seems to be a new failure.  It takes hours of testing to reproduce
>>>>> but we've been able to reproduce on 2 out of 2 overnight runs of
>>>>> continuous testing.  I cannot say conclusively the failure is due to
>>>>> the patch, as we cannot run this exact configuration without the
>>>>> patch, but I can say we did not see this failure mode in previous
>>>>> testing.  Either the patch induced the new failure mode, or perhaps
>>>>> this problem was always present and was just exposed by adding the 
>>>>> patch.
>>> We will continue trying to characterize the failure.
>>>>>
>>>>> I've attached a couple dmesg logs of two different reproductions.
>>>>> The t01 files are the target system and the i03 files are the
>>>>> initiator system.  Failure seems to start with a sequence like 
>>>>> below on the
>>> target.  Thoughts?
>>>>>
>>>>> [38875.102023] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
>>>>> [38875.108750] nvmet: ctrl 1 fatal error occurred!
>>>>> [39028.696921] INFO: task kworker/7:3:10534 blocked for more than 120
>>>>> seconds.
>>>>> [39028.704813]       Not tainted 4.10.0-rc8patch-2-get-fence #5
>>>>> [39028.711147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [39028.719900] kworker/7:3     D    0 10534      2 0x00000000
>>>>> [39028.719908] Workqueue: events nvmet_rdma_release_queue_work
>>>>> [nvmet_rdma] [39028.719909] Call Trace:
>>>>> [39028.719914]  __schedule+0x233/0x6f0 [39028.719918]  ?
>>>>> sched_clock+0x9/0x10 [39028.719919]  schedule+0x36/0x80
>>>>> [39028.719921]  schedule_timeout+0x22a/0x3f0 [39028.719924]  ?
>>>>> vprintk_emit+0x312/0x4a0 [39028.719927]  ?
>>>>> __kfifo_to_user_r+0xb0/0xb0 [39028.719929]
>>> wait_for_completion+0xb4/0x140 [39028.719930]  ?
>>>>> wake_up_q+0x80/0x80 [39028.719933] nvmet_sq_destroy+0x41/0xf0
>>>>> [nvmet] [39028.719935]  nvmet_rdma_free_queue+0x28/0xa0
>>> [nvmet_rdma]
>>>>> [39028.719936] nvmet_rdma_release_queue_work+0x25/0x50
>>> [nvmet_rdma]
>>>>> [39028.719939] process_one_work+0x1fc/0x4b0 [39028.719940]
>>>>> worker_thread+0x4b/0x500 [39028.719942] kthread+0x101/0x140
>>>>> [39028.719943]  ? process_one_work+0x4b0/0x4b0 [39028.719945]  ?
>>>>> kthread_create_on_node+0x60/0x60 [39028.719946]
>>>>> ret_from_fork+0x2c/0x40
>>>>
>>>> Hey folks.  Apologies if this message comes through twice, but when I
>>> originally sent it the list flagged it as too large due to the dmesg 
>>> log
>>> attachments, and then a coworker just told me they never saw it, so 
>>> I don't
>>> think it made it through on the first attempt.
>>>>
>>>> Please see last note above and dmesg example attached - after more
>>> extensive testing with Max's patch we are still able to produce cqe 
>>> dump errors
>>> (at a much lower frequency) as well as a new failure mode involving 
>>> a crash
>>> dump.
>>>>
>>>
>>> Hi Joseph,
>>>
>>> you mentioned that you saw the cqe dump with my patch but I can't 
>>> find it in
>>> the attached dmesg. I only see some hang wait_for_completion. can 
>>> you share
>>> the initiator log (since the fix was done in initiator side).
>>>
>>> Max.
>>
>> Hey Max, sure, here are both the initiator and target side dmesg logs 
>> for two test runs.  The logs tagged i03 are the initiator and t01 is 
>> the target.  We weren't aware fix was only for initiator side.
>
> The fix is for the initiator but it's ok that you patched the target.
>
>   We've been applying the patch to both sides and trying to get an 
> overall clean run.  As you say, no errors are directly observed on the 
> initiator end, we do see disconnects/reconnects and various timeouts 
> (I am assuming these are due to the errors on the target end).
>
> are you running simple fio to 16 different devices (each device is 
> different subsystem with 1 ns on the target ?) ?
>
> can you try 1 subsystem with 16 ns ?
>
> what is the workload that cause the dump error ?
>
>   At the target end, each log does have an instance of "dump error 
> cqe", plus stack traces and other obvious problems.
>>
>
> you are getting "remote access error" for some unclear reason.
> need to find why.
> The reason should be found in the initiator but I only see IO error 
> and reconnects there. My guess is that some IO tmo is expired and 
> aborted (and the target doesn't really abort the cmd) and after some 
> time the target finaly try to access some invalid resource.
>
> what is the CPU utilization in the target side with your load ?
>
> I think this is different issue now...
>
> Max.
>