[PATCH v3 0/9] Introduce per-device completion queue pools

Sagi Grimberg sagi at grimberg.me
Thu Nov 9 09:06:37 PST 2017


> Hi Sagi, glad to see progress on this!

Hi Chuck,

> When running on the same CPU, Send and Receive completions compete
> for the same finite CPU resource. In addition, they compete with
> soft IRQ tasks that are also pinned to that CPU, and any other
> BOUND workqueue tasks that are running there.

Thats true.

> Send and Receive completions often have significant work to do
> (for example, DMA syncing or unmapping followed by some parsing
> of the completion results) and are all serialized on ib_poll_wq or
> by soft IRQ.

Yes, that's correct.

> This limits IOPS, and restricts other users of that shared CQ.

I agree that's true for a single queue aspect. When multiple queues
are used, usually centralizing context to their cpu core is probably
the best approach to achieve linear scalability, otherwise we pay
more for context switches, cacheline bounces, resource contention, etc.

> I recognize that handling interrupts on the same core where they
> fired is best, but some of this work has to be allowed to migrate
> when this CPU core is already fully utilized. A lot of the RDMA
> core and ULP workqueues are BOUND, which prevents task migration,
> even in the upper layers.

So for the ib_comp_wq, started as an UNBOUND workqueue, but the fact
that unbound worqueue workers are not cpu bound did not fit well
with cpu/numa locality used with high-end storage devices and was a 
source of latency

See:
--
commit b7363e67b23e04c23c2a99437feefac7292a88bc
Author: Sagi Grimberg <sagi at grimberg.me>
Date:   Wed Mar 8 22:03:17 2017 +0200

     IB/device: Convert ib-comp-wq to be CPU-bound

     This workqueue is used by our storage target mode ULPs
     via the new CQ API. Recent observations when working
     with very high-end flash storage devices reveal that
     UNBOUND workqueue threads can migrate between cpu cores
     and even numa nodes (although some numa locality is accounted
     for).

     While this attribute can be useful in some workloads,
     it does not fit in very nicely with the normal
     run-to-completion model we usually use in our target-mode
     ULPs and the block-mq irq<->cpu affinity facilities.

     The whole block-mq concept is that the completion will
     land on the same cpu where the submission was performed.
     The fact that our submitter thread is migrating cpus
     can break this locality.

     We assume that as a target mode ULP, we will serve multiple
     initiators/clients and we can spread the load enough without
     having to use unbound kworkers.

     Also, while we're at it, expose this workqueue via sysfs which
     is harmless and can be useful for debug.
--

The rational is that storage targets (or file servers) usually serve
multiple clients and the spreading across cpu cores for more efficient
utilization would come from spreading the completion vectors.

However if this is not the case, then by all means we need a knob for
it (maybe have two ib completion workqueues and ULP will choose).

> I would like to see a capability of intelligently spreading the
> CQ workload for a single QP onto more CPU cores.

That is a different use case than what I was trying to achieve. In
ulp consumers such as nvme-rdma (or srp and alike) will use multiple
qp-cq pairs (usually even per-core) and for that use-case, probably
cpu locality is a better approach to take imo.

How likely that multiple NFS mount-points will be used on a single
server? Is that something you are looking for to optimize? or is
the single (or few) mount-points per server the common use-case?
If its the latter, then I perfectly agree with you, and we should
come up with a core api for it (probably rds or smc will want it
too).

> As an example, I've found that ensuring that NFS/RDMA's Receive
> and Send completions are handled on separate CPU cores results in
> slightly higher IOPS (~5%) and lower latency jitter on one mount
> point.

That is valuable information. I do agree that what you are proposing
is useful. I'll need some time to think on that.

> This is more critical now that our ULPs are handling more Send
> completions.

We still need to fix some more...

>> In addition, we introduce a configfs knob to our nvme-target to
>> bound I/O threads to a given cpulist (can be a subset). This is
>> useful for numa configurations where the backend device access is
>> configured with care to numa affinity, and we want to restrict rdma
>> device and I/O threads affinity accordingly.
>>
>> The patch set convert iser, isert, srpt, svcrdma, nvme-rdma and
>> nvmet-rdma to use the new API.
> 
> Is there a straightforward way to assess whether this work
> improves scalability and performance when multiple ULPs share a
> device?

I guess the only way is running multiple ULPs in parallel? I tried
running iser+nvme-rdma in parallel but my poor 2 VMs are not the best
performance platform I can evaluate this...



More information about the Linux-nvme mailing list