[PATCH] nvme-pci: do not set the NUMA node of device if it has none

Fri Jul 28 11:09:32 PDT 2023

Hi,

On Wed, Jul 26 2023, Keith Busch wrote:

> On Wed, Jul 26, 2023 at 09:32:33PM +0200, Pratyush Yadav wrote:
>> On Wed, Jul 26 2023, Keith Busch wrote:
>> > Could you send the output of:
>> >
>> >   numactl --hardware
>>
>> $ numactl --hardware
>> available: 2 nodes (0-1)
>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
>> node 0 size: 245847 MB
>> node 0 free: 245211 MB
>> node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
>> node 1 size: 245932 MB
>> node 1 free: 245328 MB
>> node distances:
>> node   0   1
>>   0:  10  21
>>   1:  21  10
>>
>> >
>> > and then with and without your patch:
>> >
>> >   for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
>> >     cat /proc/irq/$i/{smp,effective}_affinity_list; \
>> >   done
>>
>> Without my patch:
>>
>>     $   for i in $(cat /proc/interrupts | grep nvme0 | sed "s/^ *//g" | cut -d":" -f 1); do \
>>     >     cat /proc/irq/$i/{smp,effective}_affinity_list; \
>>     >   done
>
> Hm, I wonder if there's something wrong with my script. All the cpu's
> should be accounted for in the smp_affinity_list, assuming it captured
> all the vectors of the nvme device, but both examples are missing half
> the CPUs. It looks like you have 32 vectors. Does that sound right?

Yes, there are 32 vectors, from nvme0q0 to nvme0q31. Should there be one
vector for each CPU? Perhaps the device does not support that many
queues?

FWIW,

    $ sudo nvme get-feature /dev/nvme0n1 -f 7 -H
    get-feature:0x7 (Number of Queues), Current value:0x1e001e
            Number of IO Completion Queues Allocated (NCQA): 31
            Number of IO Submission Queues Allocated (NSQA): 31

>
> This does show the effective affinity is indeed always on node 0 without
> your patch. I don't see why, though: the "group_cpus_evenly()" function
> that spreads the interrupts doesn't know anything about the device the
> resource is being grouped for, so it shouldn't even take its NUMA node
> into consideration. It's just supposed to ensure all CPUs have a shared
> resource, preferring to not share across numa nodes.

I am guessing you are looking at irq_create_affinity_masks(). Yeah, It
does not take into account the NUMA information. In fact, even if it
did, the NUMA node associated with the IRQ is NUMA_NO_NODE
(/proc/$irq/node == -1).

I did some more digging over the week to figure out what is going on. It
seems like the kernel _does_ in fact allow all CPUs in the affinity. I
added some prints in set_affinity_irq() in
drivers/xen/events/events_base.c (since that is the irqchip for the
interrupt). I see it being called with mask: ffffffff,ffffffff.

But I later see the function being called again with a different mask:
00000000,00008000. The stack trace shows the call is coming from
ksys_write(). The process doing the write is irqbalance.

So I think your earlier statement was incorrect. irqbalance does in fact
balance these interrupts and it probably looks at the NUMA information
of the device to make that decision. My original reasoning holds and
irqbalance is the one picking the affinity.

With this explanation, do you think the patch is good to go?

BTW, could you please also add the below when applying? I forgot to add
it when sending the patch.

  Fixes: a4aea5623d4a5 ("NVMe: Convert to blk-mq")

>
> I'll emulate a similar CPU topology with similar nvme vector count and
> see if I can find anything suspicious. I'm a little concerned we may
> have the same problem for devices that have an associated NUMA node that
> your patch isn't addressing.
>
[...]

--
Regards,
Pratyush Yadav

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879