[PATCH 0/5] nvmet_tcp: Introduce poll groups and polling optimizations

Thu Aug 27 21:00:49 EDT 2020

nvmet_tcp: Introduce poll groups and polling optimizations

Currently nvmet-tcp spreads the incoming queues to be processed in a
round-robin fashion, independent of the underlying network details.
While this helps to spread the processing load across available CPU cores,
It is not optimized for latency based on NIC affinity, and not optimized
for polling optimizations this patch set is aiming to introduce.

This patch series introduces the 'poll group' concept where a single
kworker will process multiple connections associated to a group.
Each group being aligned with the CPU core that matches the
so_incoming_cpu value of an established network connection.

The motivation here is to allow NIC steering methods to drive the load
spreading while we can maintain threading affinity corresponding to
that.  In addition having our I/O context work on a group of connections
allows us to minimize the context switching involved with having a single
connection per work element.

The new group I/O context will still be bound in quota, but a suitable
quota that is designed to take into account multiple connections and
is configurable by the user (via modparam).  At a later stage, we will
make the polling quota adaptive based on load statistics that will be
added to the driver itself.

This model provides improvements in both IOPS and latency reduction
as shown in the measurements below.  

Patch 1: Use of socket sk_incoming_cpu as the queue cpu for processing.

Patch 2: Defines and allocates grouping structures.

Patch 3: Implements how connections are associated with groups.

Patch 4: Adds how connections are added/removed from the poll group
list to be processed. Provides module option to increase io_work() poll
period (io_work_poll_budget).

Patch 5: Adds how connections are transitioned from a group in support
of disconnection.

This patch series was developed and tested on the nvme_5.9 branch.

Resiliency testing for controller resets or target disconnections was
performed while running >10 active connections with over 2M IOPS
for over an hour.  The target was not compromised during this testing.

Performance testing was performed using a single ramdisk.  It used a
random read pattern for 4k block sizes. FIO io_uring hipri engine option.

The new nvmet_tcp module parameter (io_work_poll_budget) value
of 50000 usec was used.

Single connection/group performance:

For the following data, the results are the min/max for
the two queue depth and batch combinations, across 100
host fio threads.  Single host thread connection processed
by single target core/group.

Baseline before applying group series.
QD/B:    IOPS       Avg Lat (usec)     99.99 (usec)
1/1:        30.6       31.56                    64.25
               42.1       22.63                    25.47
32/8:     186        154.42                  433
               277        100.05                  188

Grouping patch applied on target. Single connection active in
each target poll group.
1/1:        43.5       21.93                    29.82
               47.8       19.87                    21.12
32/8:     212        126.19                  265
               285        97.85                    188

Group scaling performance:

To test group scaling the number of online system cores was
reduced to four (two in each numa node).  The number of
connections established was a total of 24 in order to ensure
at least 5 connections per target poll group.

Scaling is measured by increasing FIO jobs, one per target
poll group.  Each job issuing 5 threads on different host cores
to drive I/O over 5 separate connections.  Numbers reported
are an average across 5 test runs.

Baseline:
In the baseline case some up front testing was required to
first learn which connections were processed by the same
target CPU (a logical group of connections).

All latencies are in usecs.  IOPS in K.

#groups : IOPS(k) : Avg.Lat. : 99.99 Lat. : 99.95 Lat.
Baseline:
1 : 659 : 223.02 : 1214 : 468
2 : 963 : 312.79 : 1582 : 657
3 : 1364 : 330.19 : 1539 : 763
4 : 1652 : 359.27 : 1280 : 827

With Grouping:
1 : 749 : 208.32 : 881 : 237
2 : 1277 : 243.57 : 922 : 285
3 : 1896 : 246.27 : 881 : 297
4 : 2337 : 266.20 : 857 : 338

Signed-off-by: Mark Wunderlich <mark.wunderlich at intel.com>