Unexpected issues with 2 NVME initiators using the same target

Tue Mar 7 01:27:49 PST 2017

Hi,

Shahar/Joseph, what is your link layer conf (IB/Eth) ?
In eth case, have you configured some PFC ? if not, can you try it ?
I suspect that this is the root cause and it might help you avoiding 
this case, meanwhile we're looking for for the best solution.

Adding Vladimir that will run iSER on his performance setup with the new 
fencing patch (not an NVMEoF related issue).
We can run also NVMEoF later on if needed.

Max.

On 3/6/2017 1:28 PM, Sagi Grimberg wrote:
>> Hi Sagi,
>>
>> I think we need to add fence to the UMR wqe.
>>
>> so lets try this one:
>>
>> diff --git a/drivers/infiniband/hw/mlx5/qp.c
>> b/drivers/infiniband/hw/mlx5/qp.c
>> index ad8a263..c38c4fa 100644
>> --- a/drivers/infiniband/hw/mlx5/qp.c
>> +++ b/drivers/infiniband/hw/mlx5/qp.c
>> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int
>> idx, int size_16)
>>
>>  static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>>  {
>> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
>> -                    wr->send_flags & IB_SEND_FENCE))
>> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == IB_WR_REG_MR)
>>                 return MLX5_FENCE_MODE_STRONG_ORDERING;
>>
>>         if (unlikely(fence)) {
>
> This will kill performance, isn't there another fix that can
> be applied just for retransmission flow?
>
>> Couldn't repro that case but I run some initial tests in my Lab (with my
>> patch above) - not performace servers:
>>
>> Initiator with 24 CPUs (2 threads/core, 6 cores/socket, 2 sockets),
>> Connect IB (same driver mlx5_ib), kernel 4.10.0, fio test with 24 jobs
>> and 128 iodepth.
>> register_always=N
>>
>> Target - 1 subsystem with 1 ns (null_blk)
>>
>> bs   read (without/with patch)   write (without/with patch)
>> --- --------------------------  ---------------------------
>> 512     1019k / 1008k                 1004k / 992k
>> 1k      1021k / 1013k                 1002k / 991k
>> 4k      1030k / 1022k                 978k  / 969k
>>
>> CPU usage is 100% for both cases in the initiator side.
>> haven't seen difference with bs = 16k.
>> No so big drop like we would expect,
>
> Obviously you won't see a drop without registering memory
> for small IO (register_always=N), this would bypass registration
> altogether... Please retest with register_always=Y.