Unexpected issues with 2 NVME initiators using the same target

Sun Mar 12 05:33:45 PDT 2017

Hi,
I tested performance regression with/without patch of Max.
I user to servers of HP.

Initiator:												
CPUs: 48												
CPU: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz												
Hardware: ConnectX-4 												
Interface : Infiniband	Ethernet											
Kernel: 3.10.0-327.el7.x86_64												
OFED:	MLNX_OFED_LINUX-4.0-1.5.6.0:											
OS: RHEL 7.2												
Tunning commands:												
modprobe ib_iser always_register=N												
for i in `ls sd*`;do echo 2 > /sys/block/$i/queue/nomerges;done												
for i in `ls sd*`;do echo 0 > /sys/block/$i/queue/add_random ;done												
for i in `ls sd*`;do echo 1 > /sys/block/$i/queue/rq_affinity;done												
for i in `ls sd*`;do echo noop > /sys/block/$i/queue/scheduler;done												
echo performance > /sys/devices/system/cpu/cpu[0-9]*/cpufreq/scaling_governor												
service irqbalance stop												
set_irq_affinity.sh ens4f0												

Target:												
Hardware: ConnectX-4 												
CPUs: 48												
CPU: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz												
Interface : Infiniband												

OS: RHEL 7.2												
Kernel: 3.10.0-327.el7.x86_64												
Tunning commands:												
service irqbalance stop												
set_irq_affinity.sh ens4f0												
Target type LIO												

Command:												
fio --rw=write -bs=4K --numjobs=3 --iodepth=128 --runtime=60 --time_based --size=300k --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap --group_reporting --exitall `cat disks`

Results:

Patched  by patch of Max (block 4K):

allways reg          Y                N
write              1902K            1923.3K
read               1315K            2009K 

Original  OFED code (block 4K)     

allways reg          Y                N
write              1947K           1982K
read               1273K           1978K												

Thanks,
Vladimir

-----Original Message-----
From: Max Gurtovoy 
Sent: Tuesday, March 7, 2017 11:28 AM
To: Sagi Grimberg <sagi at grimberg.me>; Gruher, Joseph R <joseph.r.gruher at intel.com>; shahar.salzman <shahar.salzman at gmail.com>; Laurence Oberman <loberman at redhat.com>; Riches Jr, Robert M <robert.m.riches.jr at intel.com>
Cc: linux-rdma at vger.kernel.org; linux-nvme at lists.infradead.org; Robert LeBlanc <robert at leblancnet.us>; Vladimir Neyelov <vladimirn at mellanox.com>
Subject: Re: Unexpected issues with 2 NVME initiators using the same target

Hi,

Shahar/Joseph, what is your link layer conf (IB/Eth) ?
In eth case, have you configured some PFC ? if not, can you try it ?
I suspect that this is the root cause and it might help you avoiding this case, meanwhile we're looking for for the best solution.

Adding Vladimir that will run iSER on his performance setup with the new fencing patch (not an NVMEoF related issue).
We can run also NVMEoF later on if needed.

Max.

On 3/6/2017 1:28 PM, Sagi Grimberg wrote:
>> Hi Sagi,
>>
>> I think we need to add fence to the UMR wqe.
>>
>> so lets try this one:
>>
>> diff --git a/drivers/infiniband/hw/mlx5/qp.c 
>> b/drivers/infiniband/hw/mlx5/qp.c index ad8a263..c38c4fa 100644
>> --- a/drivers/infiniband/hw/mlx5/qp.c
>> +++ b/drivers/infiniband/hw/mlx5/qp.c
>> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int 
>> idx, int size_16)
>>
>>  static u8 get_fence(u8 fence, struct ib_send_wr *wr)  {
>> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
>> -                    wr->send_flags & IB_SEND_FENCE))
>> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == 
>> + IB_WR_REG_MR)
>>                 return MLX5_FENCE_MODE_STRONG_ORDERING;
>>
>>         if (unlikely(fence)) {
>
> This will kill performance, isn't there another fix that can be 
> applied just for retransmission flow?
>
>> Couldn't repro that case but I run some initial tests in my Lab (with 
>> my patch above) - not performace servers:
>>
>> Initiator with 24 CPUs (2 threads/core, 6 cores/socket, 2 sockets), 
>> Connect IB (same driver mlx5_ib), kernel 4.10.0, fio test with 24 
>> jobs and 128 iodepth.
>> register_always=N
>>
>> Target - 1 subsystem with 1 ns (null_blk)
>>
>> bs   read (without/with patch)   write (without/with patch)
>> --- --------------------------  ---------------------------
>> 512     1019k / 1008k                 1004k / 992k
>> 1k      1021k / 1013k                 1002k / 991k
>> 4k      1030k / 1022k                 978k  / 969k
>>
>> CPU usage is 100% for both cases in the initiator side.
>> haven't seen difference with bs = 16k.
>> No so big drop like we would expect,
>
> Obviously you won't see a drop without registering memory for small IO 
> (register_always=N), this would bypass registration altogether... 
> Please retest with register_always=Y.