[PATCHv1] nvmet-rdma: Support 16K worth of inline data for write commands
Sagi Grimberg
sagi at grimberg.me
Wed Feb 8 01:58:59 PST 2017
> This patch allows supporting 16Kbytes of inline data for write commands.
>
> With null target below are the performance improvements achieved.
> Workload: random write, 70-30 mixed IOs
> null target: 250GB, 64 core CPU, single controller.
> Queue depth: 256 commands
>
> cpu idle % iops (K) latency (usec)
> (higher better) (higher better) (lower better)
>
> Inline 16K 4K 16K 4K 16K 4K
> size
> io_size random write random write random write
> 512 78 79 2349 2343 5.45 5.45
> 1K 78 78 2438 2417 5.78 5.29
> 2K 78 78 2437 2387 5.78 5.35
> 4K 78 79 2332 2274 5.75 5.62
> 8K 78 87 1308 711 11 21.65
> 16K 79 90 680 538 22 28.64
> 32K 80 95 337 333 47 47.41
>
> mix RW-30/70 mix RW-30/70 mix RW-30/70
> 512 78 78 2389 2349 5.43 5.45
> 1K 78 78 2250 2354 5.61 5.42
> 2K 79 78 2261 2294 5.62 5.60
> 4K 77 78 2180 2131 5.8 6.28
> 8K 78 79 1746 797 8.5 18.42
> 16K 78 86 943 628 15.90 23.76
> 32K 92 92 440 440 32 33.39
>
> This is tested with modified Linux initiator that can support
> 16K worth of inline data.
> Applications which has typical 8K or 16K block size will benefit most
> out of this performance improvement.
>
> Additionally when IOPs are throttled to 700K IOPs, cpu utilization and
> latency numbers are same for both the inline size;
> confirming that higher inline size is not consuming any extra CPU
> for serving same number of IOPs.
>
> cpu idle % iops (K) latency (usec)
> (higher better) (higher better) (lower better)
>
> Inline 16K 4K 16K 4K 16K 4K
> size
> io_size random write random write random write
> 4K 93 93 700 700 5.75 5.62
> 8K 86 87 700 700 11 21.65
> 16K 83 88 680 538 22 28.64
> 32K 94 94 337 333 47 47.41
Parav,
I think the value is evident in this, however, I share Christoph's
concern of memory usage, moreover, I think we should avoid higher
order allocations and be more friendly to slub/slab.
I think this can impacts the scalability of the target, however,
I think that if we use SRQ (per-core) where we can, can ease the
limitation. I have code that makes nvme-rdma use SRQ per-core, but I
was kind of hoping we can get a generic interface for it so other ULPs
can enjoy it as well. I thought about some hook into the CQ pool API
but didn't follow up on it.
More information about the Linux-nvme
mailing list