[RFC PATCH 0/2] nvmet-tcp: introduce idle poll option

Wed Jan 13 21:22:30 EST 2021

Hey Mark,

On 12/14/20 5:11 PM, Wunderlich, Mark wrote:
> This RFC 2 patch series was created as an alternative solution to the previous series I submitted back on 8/27/2020 [nvmet_tcp: introduce poll groups and polling optimizations], where only the first patch in that series was accepted.  This series is greatly simplified from that previous patch series in that it does not explicitly create managed/tracked poll groups, with the io_work() function processing of a single queue preserved.  This simplified patch series relies on the accepted new queue context assignment to so_incoming_cpu to logically maintain shared processing of a logical group of queues on the same CPU core.
> 
> What this new patch pair provides is twofold. First, it introduces changes and a new nvmet-tcp module option 'idle_poll_period' that enables the kworker for a given queue to persistently get re-queued until the queue is sensed as being 'idle'.  Idle being determined as being when there is no new requests or pending backend device completions for the specified 'idle_poll_period' length of time.  By allowing this type of persistent poll behavior it is then possible to introduce the capability contained in the second patch of the set, this being the ability to have the kworker directly poll a backend NVMe device for completions instead of waiting for the NVMe device to interrupt and push out completions.
> 
> The data below shows that this new polling option, and removing of NVMe device interrupts, can have a benefit to performance.  But it is understood that this change prototype has some drawbacks, one specifically being how the nvmet-tcp module is being allowed to directly call the newly introduced block layer poll once function which can be considered a layering violation.  But in it's current state the patch series at least is a means to open discussion towards the benefits of working towards enabling such a polling model.  Is it a sound concept, but best implemented in the nvme target core layer instead?  Maybe moving away from using kworkers and having more dedicated core service tasks for logical queue groups?
> 
> The following test data was collected using the following config:
> . FIO io_uring, using HIPRI option
> . 4 FIO jobs, each directed at 3 NVMe device namespaces, each with 4 threads pinned to isolated NUMA node CPU close to NIC for total of 16 active thread connections.
> . 4K, random read I/O pattern
> . Queue Depth 32, Batch size 8
> . Initiator NIC HW queues set to 52 (number of local NUMA cores), HW queue affinity set to local NUMA cores.
> . Target NIC HW queues set to 8 , HW queue affinity set to local NUMA cores.
> . Initiator to Target connected using 1 default, 32 poll queues.
> . When testing with 'idle_poll_period' enabled, to test polling NVMe devices, the target NVMe module loaded with module option to enable 16 poll queues.
> . Patches tested and developed on linux-nvme nvme-5.10 branch.
> . All measured times in usecs unless otherwise noted.
> . Fewer target HW queues vs. number of active connections to measure the impact of virtual grouping of connection processing on shared target cores.
> 
> Baseline - patches not applied
> IOPS(k)                 Avg Lat                99.99     99.90     99.00     Stdev                    Context Switch                 Tgt. CPU Util.
> 1910                     252.92                  685        578        502        98.45                    ~215K                                  801.37
> 
> Patches applied on target - module option 'idle_poll_period' not set.
> 2152                     223.31                  685        611        424        92.11                    ~120K                                  882.14
> 
> Patches applied on target - module option 'idle_poll_period' set to 750000 usecs. (NVMe module not enabling poll queues)
> 2227                     216.65                  627        553        388        69.48                    ~100-400                            934.35
> 
> Patches applied on target - module option 'idle_poll_period' set to 750000 usecs. NVMe module loaded enabling 16 poll queues.
> 2779                     168.60                  545        474        375        53.24                    ~100-400                            802
> 
> This data shows that adding the new idle poll option does not impact performance when not used.  When used with the addition of direct NVMe device polling for completions, there is a nice performance benefit.
> It was noted during testing that logical grouping of queues to cores on the target can vary as a result of RSS which will or without this patch series which can vary the performance achieved, but when using the patch set with direct NVMe device polling then the logical queue grouping was more consistent and performance varied less.

So it does seem that polling the backend devices does have merit, and 
the command tracking indicator if we should poll the backend devices is
an interesting idea.

However this should something that should really be handled by nvmet
core. We could potentially add some counters to nvmet_sq and/or
nvmet_cq.

My difficulty is to understand if we can move the I/O threads from the
transport drivers to the core. Right now the I/O contexts are provided
by either the transports or the backend devices. IFF we want to explore
this (and as I said, you clearly showed it has merit), we would need to
explore if moving the I/O context and build
transport_poll/transport_wait sort of interface for this.

This can be done for tcp/rdma and loop for sure, but I'm not sure
what we can do for FC (but maybe this is possible, need to look
further).

Adding Chaitanya who looked into device polling in the past to get
his feedback if this is something interesting.