[PATCH v2 0/3] nvmet-rdma: SRQ per completion vector
Max Gurtovoy
maxg at mellanox.com
Thu Nov 16 09:21:22 PST 2017
Since there is an active discussion regarding the CQ pool architecture, I decided to push
this feature (maybe it can be pushed before CQ pool).
This is a new feature for NVMEoF RDMA target, that is intended to save resource allocation
(by sharing them) and utilize the locality of completions to get the best performance with
Shared Receive Queues (SRQs). We'll create a SRQ per completion vector (and not per device)
using a new API (SRQ pool, added to this patchset too) and associate each created QP/CQ with
an appropriate SRQ. This will also reduce the lock contention on the single SRQ per device
(today's solution).
My testing environment included 4 initiators (CX5, CX5, CX4, CX3) that were connected to 4
subsystems (1 ns per sub) throw 2 ports (each initiator connected to unique subsystem
backed in a different bull_blk device) using a switch to the NVMEoF target (CX5).
I used RoCE link layer.
Configuration:
- Irqbalancer stopped on each server
- set_irq_affinity.sh on each interface
- 2 initiators run traffic throw port 1
- 2 initiators run traffic throw port 2
- On initiator set register_always=N
- Fio with 12 jobs, iodepth 128
Memory consumption calculation for recv buffers (target):
- Multiple SRQ: SRQ_size * comp_num * ib_devs_num * inline_buffer_size
- Single SRQ: SRQ_size * 1 * ib_devs_num * inline_buffer_size
- MQ: RQ_size * CPU_num * ctrl_num * inline_buffer_size
Cases:
1. Multiple SRQ with 1024 entries:
- Mem = 1024 * 24 * 2 * 4k = 192MiB (Constant number - not depend on initiators number)
2. Multiple SRQ with 256 entries:
- Mem = 256 * 24 * 2 * 4k = 48MiB (Constant number - not depend on initiators number)
3. MQ:
- Mem = 256 * 24 * 8 * 4k = 192MiB (Mem grows for every new created ctrl)
4. Single SRQ (current SRQ implementation):
- Mem = 4096 * 1 * 2 * 4k = 32MiB (Constant number - not depend on initiators number)
results:
BS 1.read (target CPU) 2.read (target CPU) 3.read (target CPU) 4.read (target CPU)
--- --------------------- --------------------- --------------------- ----------------------
1k 5.88M (80%) 5.45M (72%) 6.77M (91%) 2.2M (72%)
2k 3.56M (65%) 3.45M (59%) 3.72M (64%) 2.12M (59%)
4k 1.8M (33%) 1.87M (32%) 1.88M (32%) 1.59M (34%)
BS 1.write (target CPU) 2.write (target CPU) 3.write (target CPU) 4.write (target CPU)
--- --------------------- --------------------- --------------------- ----------------------
1k 5.42M (63%) 5.14M (55%) 7.75M (82%) 2.14M (74%)
2k 4.15M (56%) 4.14M (51%) 4.16M (52%) 2.08M (73%)
4k 2.17M (28%) 2.17M (27%) 2.16M (28%) 1.62M (24%)
We can see the perf improvement between Case 2 and Case 4 (same order of resource).
We can see the benefit in resource consumption (mem and CPU) with a small perf loss
between cases 2 and 3.
There is still an open question between the perf differance for 1k between Case 1 and
Case 3, but I guess we can investigate and improve it incrementaly.
Thanks to Idan Burstein and Oren Duer for suggesting this nice feature.
Changes from V1:
- Added SRQ pool per protection domain for IB/core
- Fixed few comments from Christoph and Sagi
Max Gurtovoy (3):
IB/core: add a simple SRQ pool per PD
nvmet-rdma: use srq pointer in rdma_cmd
nvmet-rdma: use SRQ per completion vector
drivers/infiniband/core/Makefile | 2 +-
drivers/infiniband/core/srq_pool.c | 106 +++++++++++++++++++++
drivers/infiniband/core/verbs.c | 4 +
drivers/nvme/target/rdma.c | 190 +++++++++++++++++++++++++++----------
include/rdma/ib_verbs.h | 5 +
include/rdma/srq_pool.h | 46 +++++++++
6 files changed, 301 insertions(+), 52 deletions(-)
create mode 100644 drivers/infiniband/core/srq_pool.c
create mode 100644 include/rdma/srq_pool.h
--
1.8.3.1
More information about the Linux-nvme
mailing list