Higher block layer latency in kernel v4.8-r6 vs. v4.4.16 for NVMe

Tue Nov 8 17:43:55 PST 2016

Hi,

I have been profiling the performance of the NVMe and SAS IO stacks on Linux.  I used blktrace and blkparse to collect block layer trace points and a custom analysis script on the trace points to average out the latencies of each trace point interval of each IO.

I started with Linux kernel v4.4.16 but then switched to v4.8-r6.  One thing that stood out is that for measurements at queue depth = 1, the average Q2D latency was quite a bit higher in the NVMe path with the newer version of the kernel.

The Q, G, I, and D below refer to blktrace/blkparse trace points (queued, get request, inserted, and issued).

Queue Depth = 1       
Interval   Average - v4.4.16 (us) Average - v4.8-rc6 (us)
Q2G         0.212                             0.573
G2I           0.944                             1.507
I2D           0.435                             0.837
Q2D         1.592                             2.917

For other queue depths, Q2D was similar for both versions of the kernel.

Queue Depth     Average Q2D - v4.4.16 (us)  Average Q2D - v4.8-rc6 (us)
2                          1.893                                        1.736
4                          1.289                                        1.38
8                          1.223                                        1.162
16                        1.14                                          1.178
32                        1.007                                        1.425
64                        0.964                                        0.978
128                      0.915                                        0.941

I did not see this as a problem with the 12G SAS SSD that I measured.

Queue Depth = 1
Interval   Average - v4.4.16 (us) Average - v4.8-rc6 (us)
Q2G        0.264                              0.301           
G2I          0.917                              0.864
I2D          0.432                              0.397
Q2D        1.613                              1.561

Is this a known change or do you know what the reason for this is?

My data flows were 4KB random reads, 4KB aligned, generated with fio/libaio.  I am running IOs against a 4G file on an ext4 file system.  The above measurements are the averaged over 1 million IOs.

I am using a Ubuntu 16.04.1

I am running on a Supermicro server with an Intel Xeon CPU E5-2690 v3 @ 2.6 GHz, 12 cores.  Hyperthreading is enabled and SpeedStep is disabled.

My NVMe drive is an Intel SSD P3700 Series, 400 GB.

Thanks,

Alana