[RFC PATCH 0/2] nvmet-tcp: introduce idle poll option

Wunderlich, Mark mark.wunderlich at intel.com
Mon Dec 14 20:11:08 EST 2020


This RFC 2 patch series was created as an alternative solution to the previous series I submitted back on 8/27/2020 [nvmet_tcp: introduce poll groups and polling optimizations], where only the first patch in that series was accepted.  This series is greatly simplified from that previous patch series in that it does not explicitly create managed/tracked poll groups, with the io_work() function processing of a single queue preserved.  This simplified patch series relies on the accepted new queue context assignment to so_incoming_cpu to logically maintain shared processing of a logical group of queues on the same CPU core.

What this new patch pair provides is twofold. First, it introduces changes and a new nvmet-tcp module option 'idle_poll_period' that enables the kworker for a given queue to persistently get re-queued until the queue is sensed as being 'idle'.  Idle being determined as being when there is no new requests or pending backend device completions for the specified 'idle_poll_period' length of time.  By allowing this type of persistent poll behavior it is then possible to introduce the capability contained in the second patch of the set, this being the ability to have the kworker directly poll a backend NVMe device for completions instead of waiting for the NVMe device to interrupt and push out completions.

The data below shows that this new polling option, and removing of NVMe device interrupts, can have a benefit to performance.  But it is understood that this change prototype has some drawbacks, one specifically being how the nvmet-tcp module is being allowed to directly call the newly introduced block layer poll once function which can be considered a layering violation.  But in it's current state the patch series at least is a means to open discussion towards the benefits of working towards enabling such a polling model.  Is it a sound concept, but best implemented in the nvme target core layer instead?  Maybe moving away from using kworkers and having more dedicated core service tasks for logical queue groups?

The following test data was collected using the following config:
. FIO io_uring, using HIPRI option
. 4 FIO jobs, each directed at 3 NVMe device namespaces, each with 4 threads pinned to isolated NUMA node CPU close to NIC for total of 16 active thread connections.
. 4K, random read I/O pattern
. Queue Depth 32, Batch size 8
. Initiator NIC HW queues set to 52 (number of local NUMA cores), HW queue affinity set to local NUMA cores.
. Target NIC HW queues set to 8 , HW queue affinity set to local NUMA cores.
. Initiator to Target connected using 1 default, 32 poll queues.
. When testing with 'idle_poll_period' enabled, to test polling NVMe devices, the target NVMe module loaded with module option to enable 16 poll queues.
. Patches tested and developed on linux-nvme nvme-5.10 branch.
. All measured times in usecs unless otherwise noted.
. Fewer target HW queues vs. number of active connections to measure the impact of virtual grouping of connection processing on shared target cores.

Baseline - patches not applied
IOPS(k)                 Avg Lat                99.99     99.90     99.00     Stdev                    Context Switch                 Tgt. CPU Util.
1910                     252.92                  685        578        502        98.45                    ~215K                                  801.37

Patches applied on target - module option 'idle_poll_period' not set.
2152                     223.31                  685        611        424        92.11                    ~120K                                  882.14

Patches applied on target - module option 'idle_poll_period' set to 750000 usecs. (NVMe module not enabling poll queues)
2227                     216.65                  627        553        388        69.48                    ~100-400                            934.35

Patches applied on target - module option 'idle_poll_period' set to 750000 usecs. NVMe module loaded enabling 16 poll queues.
2779                     168.60                  545        474        375        53.24                    ~100-400                            802

This data shows that adding the new idle poll option does not impact performance when not used.  When used with the addition of direct NVMe device polling for completions, there is a nice performance benefit.
It was noted during testing that logical grouping of queues to cores on the target can vary as a result of RSS which will or without this patch series which can vary the performance achieved, but when using the patch set with direct NVMe device polling then the logical queue grouping was more consistent and performance varied less.

The patches included in this series:
1/2: nvmet-tcp: enable io_work() idle period tracking
2/2: nvmet_tcp: enable polling nvme for completions

Cheers --- Mark





More information about the Linux-nvme mailing list