[PATCH v3 0/9] Introduce per-device completion queue pools

Mon Nov 20 04:08:20 PST 2017

> Recall that NFS is limited to a single QP per client-server
> pair.
> 
> ib_alloc_cq(compvec) determines which CPU will handle Receive
> completions for a QP. Let's call this CPU R.
> 
> I assume any CPU can initiate an RPC Call. For example, let's
> say an application is running on CPU C != R.
> 
> The Receive completion occurs on CPU R. Suppose the Receive
> matches to an incoming RPC that had no registered MRs. The
> Receive completion can invoke xprt_complete_rqst in the
> Receive completion handler to complete the RPC on CPU R
> without another context switch.
> 
> The problem is that the RPC completes on CPU R because the
> RPC stack uses a BOUND workqueue, and so does NFS. Thus at
> least the RPC and NFS completion processing are competing
> for CPU R, instead of being handled on other CPUs, and
> maybe the requesting application is also likely to migrate
> onto CPU R.
> 
> I observed this behavior experimentally.
> 
> Today, the xprtrdma Receive completion handler processes
> simple RPCs (ie, RPCs with no MRs) immediately, but finishes
> completion processing for RPCs with MRs by re-scheduling
> them on an UNBOUND secondary workqueue.
> 
> I thought it would save me a context switch if the Receive
> completion handler dealt with an RPC with only one MR that
> had been remotely invalidated as a simple RPC, and allowed
> it to complete immediately (all it needs to do is DMA unmap
> that already-invalidated MR) rather than re-scheduling.
> 
> Assuming NFS READs and WRITEs are less than 1MB and the
> payload can be registered in a single MR, I can avoid
> that context switch for every I/O (and this assumption
> is valid for my test system, using CX-3 Pro).
> 
> Except when I tried this, the IOPS throughput dropped
> considerably, even while the measured per-RPC latency was
> lower by the expected 5-15 microseconds. CPU R was running
> flat out handling Receives, RPC completions, and NFS I/O
> completions. In one case I recall seeing a 12 thread fio
> run not using CPU on any other core on the client.

I see your point Chuck. The design choice here assumes that
other CPUs are equally occupied (even with NFS-RPC context) hence the
choice on which cpu to run would almost always want to run the local
cpu.

If this is not the case, then this design does not apply.

>> My baseline assumption is that other cpu cores have their own tasks
>> that they are handling, and making RDMA completions be processed
>> on a different cpu is blocking something, maybe not the submitter,
>> but something else. So under the assumption that completion processing
>> always comes on the expense of something, choosing anything else other
>> than the cpu core that the I/O was submitted on is an inferior choice.
>>
>> Is my understanding correct that you are trying to emphasize that
>> unbound workqueues make sense on some use-cases for initiator drivers
>> (like xprtrdma)?
> 
> No, I'm just searching for the right tool for the job.
> 
> I think what you are saying is that when a file system
> like XFS resides on an RDMA-enabled block device, you
> have multiple QPs and CQs to route the completion
> workload back to the CPUs that dispatched the work. There
> shouldn't be an issue there similar to NFS, even though
> XFS might also use BOUND workqueues. Fair enough.

The issue I've seen with unbound workqueues is that the
worker thread can migrate between cpus which messes up
the locality we are trying to achieve. However, we could
easily add IB_POLL_UNBOUND_WORKQUEUE polling context if
that helps your use case.

> Latency is also introduced when ib_comp_wq cannot get
> scheduled for some time because of competing work on
> the same CPU. Soft IRQ, Send completions, or other
> HIGHPRI work can delay the dispatch of RPC and NFS work
> on a particular CPU.

True, but again, the design assumes that other cores can (and
will) run similar tasks. The overhead of trying to select an
"optimal" cpu at exactly that moment is something we would want
to avoid for fast storage devices. Moreover, in high stress these
decisions are not guaranteed to be optimal and might be counter
productive (as estimations often can be).

>> I'm stating the obvious here, but this issue historically existed in
>> various devices ranging from network to storage and more. The solution
>> is using multiple queues (ideally per-cpu) and try to have minimal
>> synchronization in the submission path (like XPS for networking) and
>> keep completions as local as possible to the submission cores (like flow
>> steering).
> 
> For the time being, the Linux NFS client does not support
> multiple connections to a single NFS server. There is some
> protocol standards work to be done to help clients discover
> all distinct network paths to a server. We're also looking
> at safe ways to schedule NFS RPCs over multiple connections.
> 
> To get multiple connections today you can use pNFS with
> block devices, but that doesn't help the metadata workload
> (GETATTRs, LOOKUPs, and the like), and not everyone wants
> to use pNFS.
> 
> Also, there are some deployment scenarios where "creating
> another connection" has an undesirable scalability impact:

I can understand that.

> - The NFS client has dozens or hundreds of CPUs. Typical
> for a single large host running containers, where the
> host's kernel NFS client manages the mounts, which are
> shared among containers.
> 
> - The NFS client has mounted dozens or hundreds of NFS
> servers, and thus wants to conserve its connection count
> to avoid managing MxN connections.

So in this use-case, do you really see that non-local cpu
selection for completion processing is performing better?

 From my experience, linear scaling is much harder to achieve
with bouncing cpus with all the context-switching overhead involved.