NVMe array support

Kevin M. Hildebrand kevin at umd.edu
Mon Nov 27 05:41:43 PST 2017


I have indeed looked at the IRQ affinity.  At the moment, without
doing anything special, and even with irqbalance running, it appears
that IRQs are well spread across all of the CPU cores.
I've checked on both of my test boxes, one running kernel 4.14.1, and
the other running 3.10.0-693.5.2 (both on RedHat 7.4 systems).

As I originally mentioned, I am able to get good performance with
multiple fio jobs running in direct mode, but that's only good for
benchmarks.  I'm looking for others that are able to get good
real-world performance out of their arrays, using buffered mode.

Do you have a filesystem on your arrays (if so, which one?), and are
you able to get anywhere close to your measured performance when using
other applications?

Thanks!
Kevin



On Wed, Nov 22, 2017 at 10:08 PM, Joshua Mora <joshua_mora at usa.net> wrote:
> Did you play with IRQ affinity on the NVMEs ?
> By default they may go to single core.
> You have to spread them across several cores.
>
> I get 13.6GB/s read with 4 NVMEs.
> I get 52GB/s read with 16 NVMEs.
>
> I get 9M iops with 16 NVMEs using kernel mode.
>
> I am assuming you are running multiple jobs not a single one and using
> cpus_allowed to pin each fio job to a different core.
>
> All the tests I do are direct, not through memory.
>
> Joshua
>
>
> ------ Original Message ------
> Received: 02:09 PM CST, 11/22/2017
> From: "Kevin M. Hildebrand" <kevin at umd.edu>
> To: Joshua Mora <joshua_mora at usa.net>
> Cc: linux-nvme at lists.infradead.org
> Subject: Re: NVMe array support
>
>
> You're using linux MD raid? Have you been able to get good
> performance with something other than "fio -direct"?
>
> I have a RAID 0 with eight elements (see below for details).
>
> Running fio on an individual drive in direct mode gives me okay
> performance for that drive- around 1.8-1.9GB/s seq write.
> Running fio on an individual drive in buffered mode gives me wildly
> variable performance according to fio, but iostat shows similar rates
> to the drive, around 1.8-1.9GB/s.
>
> Running fio to the array in direct mode gives me performance for the
> array at around 12GB/s, which is reasonable, and approximately what
> I'd expect.
> Running fio to the array in buffered mode also gives varied
> performance according to fio, but iostat shows write rates to the
> array at around 2GB/s, barely better than a single drive.
>
> If I put a filesystem (ext4, for example, though I've also tried
> others...) on top of the array and run fio with multiple files and
> multiple threads, I get slightly better performance in buffered mode,
> but nowhere near the 12-14GB/s I'm looking for. Playing with CPU
> affinity helps a little too, but still nowhere near what I need.
>
> Running fdt or gridftp or other actual applications I am able to get
> no better than around 2GB/s, which again is around the speed of a
> single drive.
>
> Thanks,
> Kevin
>
> # mdadm --detail /dev/md0
> /dev/md0:
> Version : 1.2
> Creation Time : Wed Nov 22 14:57:35 2017
> Raid Level : raid0
> Array Size : 12501458944 (11922.32 GiB 12801.49 GB)
> Raid Devices : 8
> Total Devices : 8
> Persistence : Superblock is persistent
>
> Update Time : Wed Nov 22 14:57:35 2017
> State : clean
> Active Devices : 8
> Working Devices : 8
> Failed Devices : 0
> Spare Devices : 0
>
> Chunk Size : 512K
>
> Consistency Policy : none
>
> Name : XXX
> UUID : 2a2234a4:78d2bbb2:9e1b3031:022b3315
> Events : 0
>
> Number Major Minor RaidDevice State
> 0 259 2 0 active sync /dev/nvme0n1
> 1 259 7 1 active sync /dev/nvme1n1
> 2 259 5 2 active sync /dev/nvme2n1
> 3 259 1 3 active sync /dev/nvme3n1
> 4 259 4 4 active sync /dev/nvme4n1
> 5 259 3 5 active sync /dev/nvme5n1
> 6 259 0 6 active sync /dev/nvme6n1
> 7 259 6 7 active sync /dev/nvme7n1
>
>
>
> On Wed, Nov 22, 2017 at 2:20 PM, Joshua Mora <joshua_mora at usa.net> wrote:
>> Hi Kevin.
>> I did, they are great.
>> I get to max them out for both reads and writes.
>> I have used the ones of 1.6TB (so ~3.4GB/s seq read and ~2.2GB/s seq
>> writes
>> with 128k record length). You don't need large iodepth.
>> I tested for instance RAID 10 with 4 drives and tested surprise removal
>> when
>> I was doing writes.
>> Using AMD EPYC based platform, leveraging the many PCIE lanes that it has.
>> You want to use 1 core for every 2 NVMEs to max them out for large record
>> length.
>> You will need more cores for 4k record length.
>>
>> Joshua
>>
>>
>> ------ Original Message ------
>> Received: 11:57 AM CST, 11/22/2017
>> From: "Kevin M. Hildebrand" <kevin at umd.edu>
>> To: linux-nvme at lists.infradead.org
>> Subject: NVMe array support
>>
>>
>> I've got eight Samsung PM1725a NVMe drives I'm trying to combine into
>> an array to be able to aggregate the performance of having multiple
>> drives. My initial experiments have yielded abysmal performance in
>> most cases. I've tried creating RAID 0 arrays with MD raid, ZFS, and
>> a few others and most of the time I'm getting somewhere around the
>> performance of a single drive, even though I've got more than one.
>> The only way I can get decent performance is when writing to the array
>> in direct mode (O_DIRECT). I've been using fio, fdt, and dd for
>> running tests. Has anyone successfully created software arrays of
>> NVMe drives and been able to get usable performance from them? The
>> drives are all in a DELL R940 server, which has 4 Skylake CPUs, and
>> all of the drives are connected to a single CPU, with full PCI
>> bandwidth.
>>
>> Sorry if this isn't the right place to send this message, I'm having a
>> hard time finding anyone that's doing this.
>>
>> If anyone's doing this successfully, I'd love to hear more about your
>> configuration.
>>
>> Thanks!
>> Kevin
>>
>> --
>> Kevin Hildebrand
>> University of Maryland
>> Division of IT
>>
>> _______________________________________________
>> Linux-nvme mailing list
>> Linux-nvme at lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>
>>
>
>



More information about the Linux-nvme mailing list