[RFC PATCH 0/2] nvmet-tcp: introduce idle poll option

Mon Jan 25 12:50:24 EST 2021

>Hey Mark,

On 12/14/20 5:11 PM, Wunderlich, Mark wrote:
>> This RFC 2 patch series was created as an alternative solution to the previous series I submitted back on 8/27/2020 [nvmet_tcp: introduce poll groups and polling optimizations], where only the first patch in that series was accepted.  This series is greatly simplified from that previous patch series in that it does not explicitly create managed/tracked poll groups, with the io_work() function processing of a single queue preserved.  This simplified patch series relies on the accepted new queue context assignment to so_incoming_cpu to logically maintain shared processing of a logical group of queues on the same CPU core.
>> 
>> What this new patch pair provides is twofold. First, it introduces changes and a new nvmet-tcp module option 'idle_poll_period' that enables the kworker for a given queue to persistently get re-queued until the queue is sensed as being 'idle'.  Idle being determined as being when there is no new requests or pending backend device completions for the specified 'idle_poll_period' length of time.  By allowing this type of persistent poll behavior it is then possible to introduce the capability contained in the second patch of the set, this being the ability to have the kworker directly poll a backend NVMe device for completions instead of waiting for the NVMe device to interrupt and push out completions.
>> 
>> The data below shows that this new polling option, and removing of NVMe device interrupts, can have a benefit to performance.  But it is understood that this change prototype has some drawbacks, one specifically being how the nvmet-tcp module is being allowed to directly call the newly introduced block layer poll once function which can be considered a layering violation.  But in it's current state the patch series at least is a means to open discussion towards the benefits of working towards enabling such a polling model.  Is it a sound concept, but best implemented in the nvme target core layer instead?  Maybe moving away from using kworkers and having more dedicated core service tasks for logical queue groups?
>> 
>> The following test data was collected using the following config:
>> . FIO io_uring, using HIPRI option
>> . 4 FIO jobs, each directed at 3 NVMe device namespaces, each with 4 threads pinned to isolated NUMA node CPU close to NIC for total of 16 active thread connections.
>> . 4K, random read I/O pattern
>> . Queue Depth 32, Batch size 8
>> . Initiator NIC HW queues set to 52 (number of local NUMA cores), HW queue affinity set to local NUMA cores.
>> . Target NIC HW queues set to 8 , HW queue affinity set to local NUMA cores.
>> . Initiator to Target connected using 1 default, 32 poll queues.
>> . When testing with 'idle_poll_period' enabled, to test polling NVMe devices, the target NVMe module loaded with module option to enable 16 poll queues.
>> . Patches tested and developed on linux-nvme nvme-5.10 branch.
>> . All measured times in usecs unless otherwise noted.
>> . Fewer target HW queues vs. number of active connections to measure the impact of virtual grouping of connection processing on shared target cores.
>> 
>> Baseline - patches not applied
>> IOPS(k)                 Avg Lat                99.99     99.90     99.00     Stdev                    Context Switch                 Tgt. CPU Util.
>> 1910                     252.92                  685        578        
>> 502        98.45                    ~215K                                  
>> 801.37
>> 
>> Patches applied on target - module option 'idle_poll_period' not set.
>> 2152                     223.31                  685        611        
>> 424        92.11                    ~120K                                  
>> 882.14
>> 
>> Patches applied on target - module option 'idle_poll_period' set to 
>> 750000 usecs. (NVMe module not enabling poll queues)
>> 2227                     216.65                  627        553        
>> 388        69.48                    ~100-400                            
>> 934.35
>> 
>> Patches applied on target - module option 'idle_poll_period' set to 750000 usecs. NVMe module loaded enabling 16 poll queues.
>> 2779                     168.60                  545        474        
>> 375        53.24                    ~100-400                            
>> 802
>> 
>> This data shows that adding the new idle poll option does not impact performance when not used.  When used with the addition of direct NVMe device polling for completions, there is a nice performance benefit.
>> It was noted during testing that logical grouping of queues to cores on the target can vary as a result of RSS which will or without this patch series which can vary the performance achieved, but when using the patch set with direct NVMe device polling then the logical queue grouping was more consistent and performance varied less.

>So it does seem that polling the backend devices does have merit, and the command tracking indicator if we should poll the backend devices is an interesting idea.

>However this should something that should really be handled by nvmet core. We could potentially add some counters to nvmet_sq and/or nvmet_cq.

>My difficulty is to understand if we can move the I/O threads from the transport drivers to the core. Right now the I/O contexts are provided by either the transports or the backend devices. IFF we want to explore this (and as I said, you clearly showed it has merit), we would need to explore if moving the I/O context and build transport_poll/transport_wait sort of interface for this.

I can understand the architectural direction that such a polling thread is better to be performed within the core, my initial goal being isolated to demonstrate goodness within TCP implementation itself with as minimal change as required in other areas.  I hope a move to the core could remain an optional feature that other transports would not be expected, or forced, to accept.  Would the best approach in moving to a core centric model be to first introduce the core specific thread context option, then follow that with the changes to allow non-blocking device polling (and whatever request tracking may still be required) from such thread contexts?

This RFC would align with this phased approach, given that the first part is targeted at improvement to io_work() thread to better service transport interfaces that benefit from a more active polled model.  Then the second part that focused on enabling the ability to have non-blocking polled access to NVMe device poll queues.

>This can be done for tcp/rdma and loop for sure, but I'm not sure what we can do for FC (but maybe this is possible, need to look further).

>Adding Chaitanya who looked into device polling in the past to get his feedback if this is something interesting.

Ah, would be great to hear your thoughts on this Chaitanya, given your experience.