[PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask

Wed Sep 7 08:38:40 PDT 2016

On Tue, 6 Sep 2016, Christoph Hellwig wrote:

> [adding Thomas as it's about the affinity_mask he (we) added to the
>  IRQ core]
> 
> On Tue, Sep 06, 2016 at 10:39:28AM -0400, Keith Busch wrote:
> > > Always the previous one.  Below is a patch to get us back to the
> > > previous behavior:
> > 
> > No, that's not right.
> > 
> > Here's my topology info:
> > 
> >   # numactl --hardware
> >   available: 2 nodes (0-1)
> >   node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
> >   node 0 size: 15745 MB
> >   node 0 free: 15319 MB
> >   node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
> >   node 1 size: 16150 MB
> >   node 1 free: 15758 MB
> >   node distances:
> >   node   0   1
> >     0:  10  21
> >     1:  21  10
> 
> How do you get that mapping?  Does this CPU use Hyperthreading and
> thus expose siblings using topology_sibling_cpumask?  As that's the
> only thing the old code used for any sort of special casing.

That's a normal Intel mapping with two sockets and HT enabled. The cpu
enumeration is

Socket0 - physical cores
Socket1 - physical cores

Socket0 - HT siblings
Socket1 - HT siblings

> I'll need to see if I can find a system with such a mapping to reproduce.

Any 2 socket Intel with HT enabled will do. If you need access to one let
me know.

> > If I have 16 vectors, the affinity_mask generated by what you're doing
> > looks like 0000ffff, CPU's 0-15. So the first 16 bits are set since each
> > of those are the first unique CPU, getting a unique vector just like you
> > wanted. If an unset bit just means share with the previous, then all of
> > my thread siblings (CPU's 16-31) get to share with CPU 15. That's awful!
> > 
> > What we want for my CPU topology is the 16th CPU to pair with CPU 0,
> > 17 pairs with 1, 18 with 2, and so on. You can't convey that information
> > with this scheme. We need affinity_masks per vector.
> 
> We actually have per-vector masks, but they are hidden inside the IRQ
> core and awkward to use.  We could to the get_first_sibling magic
> in the block-mq queue mapping (and in fact with the current code I guess
> we need to).  Or take a step back from trying to emulate the old code
> and instead look at NUMA nodes instead of siblings which some folks
> suggested a while ago.

I think you want both.

NUMA nodes are certainly the first decision factor. You split the number of
vectors to the nodes:

  vecs_per_node = num_vector / num_nodes;

Then you spread the number of vectors per node by the number of cpus per
node.

  cpus_per_vec = cpus_on(node) / vecs_per_node;

If the number of cpus per vector is <= 1 you just use a round robin
scheme. If not, you need to look at siblings.

Looking at the whole thing, I think we need to be more clever when setting
up the msi descriptor affinity masks.

I'll send a RFC series soon.

Thanks,

	tglx