irqbalancer subset policy and CPU lock up on storage controller.

Tue Oct 13 06:10:06 PDT 2015

+0530, Kashyap Desai wrote:
> > > On Mon, Oct 12, 2015 at 11:52:30PM +0530, Kashyap Desai wrote:
> > > > > > What should be the solution if we really want to slow down IO
> > > > > > submission to avoid CPU lockup. We don't want only one CPU to
> > > > > > keep busy for completion.
> > > > > >
> > > > > > Any suggestion ?
> > > > > >
> > > > > Yup, file a bug with Oracle :)
> > > >
> > > > Neil -
> > > >
> > > > Thanks for info. I understood to use latest <irqbalance>...that
> > > > was already attempted. I tried with latest irqbalance and I see
> > > > expected behavior as long as I provide <exact> or <subset> + <--
> poliicyscript>.
> > > > We are planning for the same, but wanted to understand what is
> > > > latest <irqbalancer> default settings. Is there any reason we are
> > > > seeing default settings changed from  subset to ignore ?
> > > >
> > >
> > > Latest defaults are that hinting is ignored by default, but hinting
> > > can
> > also be
> > > set via a policyscript on an irq by irq basis.
> > >
> > > The reasons for changing the default behavior are documented in
> > > commit d9138c78c3e8cb286864509fc444ebb4484c3d70.  Irq affinity
> > > hinting is effectively a holdover from back in the days when
> > > irqbalance couldn't understand a devices locality and irq count
> > > easily.  Now that it can,
> > there is
> > > really no need for an irq affinity hint, unless your driver doesn't
> > properly
> > > participate in sysfs device ennumeration.
> >
> > Neil - I went through those details, but could not understand how
> > <ignore> policy is useful. I may be missing something here. :-(
> Yes, what you are missing is the fact that affinity hinting is an
outdated
> method of assigning affinty hints.  On any modern kernel its not needed
at
> all, so the default policy is to ignore it.

Now it clear. Understood that now no more affinity hint is required from
driver and <irqbalance> can manage using required detail populated from
<sysfs>.

>
> > With <ignore> policy, mpt3sas driver on 32 logical CPU system has
> > below affinity mask. As you said, driver hint is ignored.  That is
> > understood as <ignore> is hinting for the same, but why affinity mask
> > is just localized to local node (Node 0 in this case) ?
> This has nothing to do with ignoring hint policy.  The reasons the below
> might occur are:
>
> 1) the class of the device on the pci bus is such that irqbalance is
deciding
> that numa node is the level at which it should be balanced.  Currently
there
> are no such devices that get balanced at that level.  There are however
> package level balanced devices, and if you have a single cpu package
(with
> multiple
> cores) on a single numa node, you might see this behavior. What is the
pci
> class of the mpt3sas adapter?

<mpt3sas> is <storage> class adapter. See <class> sysfs details-

[root]# cat /sys/devices/pci0000:00/0000:00:03.0/0000:02:00.0/class
0x010700

>
> 2) The interrupt controller on your system doesn't allow for user
setting of
> interrupt affinity.  I don't think that would be the case given that
other
> interrupts can be affined.  If you can manually set the affinity of
these irqs
> you can discount this possibility.

Affinity hint from driver provided by <exact> policy and manually setting
affinity works on my setup. We can skip this part.
>From storage controller requirement side, we are looking for msix-vector
and  logic CPU# mapping in same sequence.

>
> 3) You are using a policyscript that assigns these affinities.  As I
previously
> requested, are you using a policy script and can you post it here?

I have attached policy script (This very basic script..just to understand
irqbalance we created this and we got our work done).
I required - "use balancing as core level and distribute policy across
each NUMA node."

>From attached irqbalance debug output, I can see that irqbalancer able to
work as expected for policy script.

>
> > What is confusing me is - "cpu affinity mask" is just localize to Numa
> > Node-0  as PCI device enumeration detected pci device is local to
> > numa_node 0.
> I really dont know what you mean by this.  Yes, your masks seem to be
> following what could be your numa node layout, but you're assuming (or
it
> sounds like you're assuming) that irqbalance is doing that
intentionally.  Its
> not, one of the above things is going on.
>
> >
> >
> > When you say "Driver does not participate in sysfs enumeration" - Does
> > it mean "numa_node" exposure in sysfs or anything more than that ?
> > Sorry for basics and helping me to understand things.
> >
> I mean, does your driver register itself as a pci device?  If so, it
should have
> a directory in sysfs in /sys/bus/pci/<pci b:d:f>/.  As long as that
directory
> exists and is properly populated, irqbalance should have everything it
needs
> to properly assign a cpu to all of your irqs.

Yes driver register device as pci device and I can see all
/sys/bus/pci/devices/ entry for mpt3sas attached device.

> Note that the RHEL6 kernel did
> not always properly populate that directory.  I added sysfs code to
expose
> needed irq information in the kernel, and if you have an older kernel
and
> newer irqbalance, that might be part of the problem - another reason to
> contact oracle.
>
>
> another thing you can try is posting the output of irqbalance while
running
> it with -f and -d.  That will give us some insight as to what its doing
(note
> I'm referring here to upstream irqbalance, not the old version).  And
you
> still didn't answer my question regarding the policyscript.

I have attached irqbalance (latest from github last commit
8922ff13704dd0e069c63d46a7bdad89df5f151c) debug output and policy script.

Due to some reason, I have to move to different server which has 4 Numa
socket.

Here Is detail of my setup -

[root]# lstopo-no-graphics
Machine (64GB)
  NUMANode L#0 (P#0 16GB)
    Socket L#0 + L3 L#0 (10MB)
      L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#16)
      L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#17)
      L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#18)
      L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#19)
    HostBridge L#0
      PCIBridge
        PCI 8086:0953
      PCIBridge
        PCI 1000:0097
          Block L#0 "sdb"
          Block L#1 "sdc"
          Block L#2 "sdd"
          Block L#3 "sde"
          Block L#4 "sdf"
          Block L#5 "sdg"
          Block L#6 "sdh"
      PCIBridge
        PCI 102b:0532
      PCI 8086:1d02
        Block L#7 "sda"
  NUMANode L#1 (P#1 16GB)
    Socket L#1 + L3 L#1 (10MB)
      L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#20)
      L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#21)
      L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#22)
      L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#23)
    HostBridge L#4
      PCIBridge
        PCI 1000:005b
  NUMANode L#2 (P#2 16GB) + Socket L#2 + L3 L#2 (10MB)
    L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
      PU L#16 (P#8)
      PU L#17 (P#24)
    L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
      PU L#18 (P#9)
      PU L#19 (P#25)
    L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
      PU L#20 (P#10)
      PU L#21 (P#26)
    L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
      PU L#22 (P#11)
      PU L#23 (P#27)
  NUMANode L#3 (P#3 16GB)
    Socket L#3 + L3 L#3 (10MB)
      L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#28)
      L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#29)
      L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
        PU L#28 (P#14)
        PU L#29 (P#30)
      L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
        PU L#30 (P#15)
        PU L#31 (P#31)
    HostBridge L#6
      PCIBridge
        PCI 8086:1528
          Net L#8 "enp193s0f0"
        PCI 8086:1528
          Net L#9 "enp193s0f1"
      PCIBridge
        PCI 8086:0953
      PCIBridge

Things are bit clear now, but what I am seeing here is - "With <ignore>
and attached policy hint, CPU to MSIX mask is not same as logic sequence
of CPU #. It is random within core. I guess it based on some link list in
irqbalance, which I am not able to understand.
E.a  Below piece of code hint that per Core, irq numbers are stored in
link list "d->interrupts". This list is not based on sequential traverse,
rather based on how interrupts are generated. Right ?

static void dump_cache_domain(struct topo_obj *d, void *data)
{
        char *buffer = data;
        cpumask_scnprintf(buffer, 4095, d->mask);
        log(TO_CONSOLE, LOG_INFO, "%s%sCache domain %i:  numa_node is %d
cpu mask is %s  (load %lu) \n",
            log_indent, log_indent,
            d->number, cache_domain_numa_node(d)->number, buffer,
(unsigned long)d->load);
        if (d->children)
                for_each_object(d->children, dump_balance_obj, NULL);
        if (g_list_length(d->interrupts) > 0)
                for_each_irq(d->interrupts, dump_irq, (void *)10);
}

I see sometimes different cpu masks  as below snippet -   cpu mask on my
setup varies on run to run, but good thing is mask is within <core>, but
not like <exact>.

	msix index = 0, irq number =  355, cpu affinity mask = 00000008
hint = 00000001
	msix index = 1, irq number =  356, cpu affinity mask = 00000004
hint = 00000002
	msix index = 2, irq number =  357, cpu affinity mask = 00000002
hint = 00000004
	msix index = 3, irq number =  358, cpu affinity mask = 00000001
hint = 00000008

	msix index = 0, irq number =  355, cpu affinity mask = 00000002
hint = 00000001
	msix index = 1, irq number =  356, cpu affinity mask = 00000008
hint = 00000002
	msix index = 2, irq number =  357, cpu affinity mask = 00000004
hint = 00000004
	msix index = 3, irq number =  358, cpu affinity mask = 00000001
hint = 00000008

I am expecting  as below because once <mpt3sas> driver send IO it hint FW
about completion queue. E.a if IO is submitted from logical CPU #X, driver
use smp_process_id() to get that logical CPU #X and expect completion on
same CPU for better performance.  Is this expectation possible with
existing latest <irqbalance> ?

	msix index = 0, irq number =  355, cpu affinity mask = 00000001
hint = 00000001
	msix index = 1, irq number =  356, cpu affinity mask = 00000002
hint = 00000002
	msix index = 2, irq number =  357, cpu affinity mask = 00000004
hint = 00000004
	msix index = 3, irq number =  358, cpu affinity mask = 00000008
hint = 00000008

~ Kashyap

>
> Neil
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: irqbalance.debug
Type: application/octet-stream
Size: 50807 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/irqbalance/attachments/20151013/f214af70/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: set_numa_node.sh
Type: application/octet-stream
Size: 1440 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/irqbalance/attachments/20151013/f214af70/attachment-0003.obj>