[PATCH v7 1/4] arm64, numa: adding numa support for arm64 platforms.

Tue Dec 22 05:43:12 PST 2015

On Tue, Dec 22, 2015 at 3:25 PM, Will Deacon <will.deacon at arm.com> wrote:
> On Tue, Dec 22, 2015 at 03:04:48PM +0530, Ganapatrao Kulkarni wrote:
>> On Fri, Dec 18, 2015 at 12:00 AM, Ganapatrao Kulkarni
>> <gpkulkarni at gmail.com> wrote:
>> > On Thu, Dec 17, 2015 at 10:41 PM, Will Deacon <will.deacon at arm.com> wrote:
>> >> This all looks pretty reasonable, but I'd like to see an Ack from a
>> >> devicetree maintainer on the binding before I merge anything (and I see
>> >> that there are outstanding comments from Rutland on that).
>> > IIUC, there are no open comments for the binding.
>> > Mark Rutland: please let me know, if there any open comments.
>> > otherwise, can you please Ack the binding.
>> >>
>> >> On Tue, Nov 17, 2015 at 10:50:40PM +0530, Ganapatrao Kulkarni wrote:
>> >>> Adding numa support for arm64 based platforms.
>> >>> This patch adds by default the dummy numa node and
>> >>> maps all memory and cpus to node 0.
>> >>> using this patch, numa can be simulated on single node arm64 platforms.
>> >>>
>> >>> Reviewed-by: Robert Richter <rrichter at cavium.com>
>> >>> Signed-off-by: Ganapatrao Kulkarni <gkulkarni at caviumnetworks.com>
>> >>> ---
>> >>>  arch/arm64/Kconfig              |  25 +++
>> >>>  arch/arm64/include/asm/mmzone.h |  17 ++
>> >>>  arch/arm64/include/asm/numa.h   |  47 +++++
>> >>>  arch/arm64/kernel/setup.c       |   4 +
>> >>>  arch/arm64/kernel/smp.c         |   2 +
>> >>>  arch/arm64/mm/Makefile          |   1 +
>> >>>  arch/arm64/mm/init.c            |  30 +++-
>> >>>  arch/arm64/mm/numa.c            | 384 ++++++++++++++++++++++++++++++++++++++++
>> >>>  8 files changed, 506 insertions(+), 4 deletions(-)
>> >>>  create mode 100644 arch/arm64/include/asm/mmzone.h
>> >>>  create mode 100644 arch/arm64/include/asm/numa.h
>> >>>  create mode 100644 arch/arm64/mm/numa.c
>> >>>
>> >>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> >>> index 9ac16a4..7d8fb42 100644
>> >>> --- a/arch/arm64/Kconfig
>> >>> +++ b/arch/arm64/Kconfig
>> >>> @@ -71,6 +71,7 @@ config ARM64
>> >>>       select HAVE_GENERIC_DMA_COHERENT
>> >>>       select HAVE_HW_BREAKPOINT if PERF_EVENTS
>> >>>       select HAVE_MEMBLOCK
>> >>> +     select HAVE_MEMBLOCK_NODE_MAP if NUMA
>> >>>       select HAVE_PATA_PLATFORM
>> >>>       select HAVE_PERF_EVENTS
>> >>>       select HAVE_PERF_REGS
>> >>> @@ -482,6 +483,30 @@ config HOTPLUG_CPU
>> >>>         Say Y here to experiment with turning CPUs off and on.  CPUs
>> >>>         can be controlled through /sys/devices/system/cpu.
>> >>>
>> >>> +# Common NUMA Features
>> >>> +config NUMA
>> >>> +     bool "Numa Memory Allocation and Scheduler Support"
>> >>> +     depends on SMP
>> >>> +     help
>> >>> +       Enable NUMA (Non Uniform Memory Access) support.
>> >>> +
>> >>> +       The kernel will try to allocate memory used by a CPU on the
>> >>> +       local memory controller of the CPU and add some more
>> >>> +       NUMA awareness to the kernel.
>> >>
>> >> I appreciate that this is copied from x86, but what exactly do you mean
>> >> by "memory controller" here?
>> > ok, it is fair enough to say local memory.
>> >>
>> >>> diff --git a/arch/arm64/include/asm/mmzone.h b/arch/arm64/include/asm/mmzone.h
>> >>> new file mode 100644
>> >>> index 0000000..6ddd468
>> >>> --- /dev/null
>> >>> +++ b/arch/arm64/include/asm/mmzone.h
>> >>> @@ -0,0 +1,17 @@
>> >>> +#ifndef __ASM_ARM64_MMZONE_H_
>> >>> +#define __ASM_ARM64_MMZONE_H_
>> >>
>> >> Please try to follow the standard naming for header guards under arm64
>> >> (yes, it's not perfect, but we've made some effort for consistency).
>> > sure, will follow as in other code.
>> >>
>> >>> +
>> >>> +#ifdef CONFIG_NUMA
>> >>> +
>> >>> +#include <linux/mmdebug.h>
>> >>> +#include <linux/types.h>
>> >>> +
>> >>> +#include <asm/smp.h>
>> >>> +#include <asm/numa.h>
>> >>> +
>> >>> +extern struct pglist_data *node_data[];
>> >>> +
>> >>> +#define NODE_DATA(nid)               (node_data[(nid)])
>> >>
>> >> This is the same as m32r, metag, parisc, powerpc, s390, sh, sparc, tile
>> >> and x86. Can we make this the default in the core code instead and then
>> >> replace this header file with asm-generic or something?
>> > IIUC, it is same in most but not in all arch.
>
> Yes, so we should make the core code take on the behaviour that most
> architectures want, and then allow the few remaining architectures that
> need to do something differently override those macros. We do this all
> over the place in the kernel.
i totally agree with you.
we will do these in subsequent clean up patches, once this series is upstreamed.
please let us not mix in this series.
>
>> >>
>> >>> +
>> >>> +#endif /* CONFIG_NUMA */
>> >>> +#endif /* __ASM_ARM64_MMZONE_H_ */
>> >>> diff --git a/arch/arm64/include/asm/numa.h b/arch/arm64/include/asm/numa.h
>> >>> new file mode 100644
>> >>> index 0000000..c00f3a4
>> >>> --- /dev/null
>> >>> +++ b/arch/arm64/include/asm/numa.h
>> >>> @@ -0,0 +1,47 @@
>> >>> +#ifndef _ASM_NUMA_H
>> >>> +#define _ASM_NUMA_H
>> >>
>> >> Same comment on the guards.
>> > ok
>> >>
>> >>> +#include <linux/nodemask.h>
>> >>> +#include <asm/topology.h>
>> >>> +
>> >>> +#ifdef CONFIG_NUMA
>> >>> +
>> >>> +#define NR_NODE_MEMBLKS              (MAX_NUMNODES * 2)
>> >>
>> >> This is only used by the ACPI code afaict, so maybe include it when you
>> >> add that?
>> > ok
>> >>
>> >>> +#define ZONE_ALIGN (1UL << (MAX_ORDER + PAGE_SHIFT))
>> >>
>> >> Where is this used?
>> > sorry, was used in v6, missed to delete.
>> >>
>> >>> +
>> >>> +/* currently, arm64 implements flat NUMA topology */
>> >>> +#define parent_node(node)    (node)
>> >>> +
>> >>> +extern int __node_distance(int from, int to);
>> >>> +#define node_distance(a, b) __node_distance(a, b)
>> >>> +
>> >>> +/* dummy definitions for pci functions */
>> >>> +#define pcibus_to_node(node) 0
>> >>> +#define cpumask_of_pcibus(bus)       0
>> >>
>> >> There's a bunch of these dummy definitions already available in
>> >> asm-generic/topology.h. Can we use those instead of rolling our own
>> >> please?
>> these are dummy in this patch, and i will post separate patch for numa-pci.
>
> So you're doing to use the generic definitions?
generic are different than what we need.
for pci, i have sent a patch.
>
>> >> This is certainly more readable then the non-numa zone_sizes_init. Is
>> >> there a reason we can't always select HAVE_MEMBLOCK_NODE_MAP and avoid
>> >> having to handle the zone holds explicitly?
>> > yes, i can think off to have select HAVE_MEMBLOCK_NODE_MAP always
>> > instead of for only numa.
>> > i can experiment to have the zone_sizes_init of numa to non-numa case
>> > also and delete the current zone_sizes_init.
>> if i enable HAVE_MEMBLOCK_NODE_MAP for non-numa case, i see issues in
>> sparse_init and could be some dependency due to explicit call of
>> memory_present(note sure) in non-numa.
>> IMO, this needs to be worked out as a separate patch.
>
> Ok, please can you investigate and send a separate patch then?
sure, not in this series please.
>
>> >>> +#ifdef CONFIG_DEBUG_PER_CPU_MAPS
>> >>> +/*
>> >>> + * Returns a pointer to the bitmask of CPUs on Node 'node'.
>> >>> + */
>> >>> +const struct cpumask *cpumask_of_node(int node)
>> >>> +{
>> >>> +
>> >>> +     if (WARN_ON(node >= nr_node_ids))
>> >>> +             return cpu_none_mask;
>> >>> +
>> >>> +     if (WARN_ON(node_to_cpumask_map[node] == NULL))
>> >>> +             return cpu_online_mask;
>> >>> +
>> >>> +     return node_to_cpumask_map[node];
>> >>> +}
>> >>> +EXPORT_SYMBOL(cpumask_of_node);
>> >>> +#endif
>> >>> +
>> >>> +static void map_cpu_to_node(unsigned int cpu, int nid)
>> >>> +{
>> >>> +     set_cpu_numa_node(cpu, nid);
>> >>> +     if (nid >= 0)
>> >>> +             cpumask_set_cpu(cpu, node_to_cpumask_map[nid]);
>> >>> +}
>> >>> +
>> >>> +static void unmap_cpu_to_node(unsigned int cpu)
>> >>> +{
>> >>> +     int nid = cpu_to_node(cpu);
>> >>> +
>> >>> +     if (nid >= 0)
>> >>> +             cpumask_clear_cpu(cpu, node_to_cpumask_map[nid]);
>> >>> +     set_cpu_numa_node(cpu, NUMA_NO_NODE);
>> >>> +}
>> >>> +
>> >>> +void numa_clear_node(unsigned int cpu)
>> >>> +{
>> >>> +     unmap_cpu_to_node(cpu);
>> >>> +}
>> >>> +
>> >>> +/*
>> >>> + * Allocate node_to_cpumask_map based on number of available nodes
>> >>> + * Requires node_possible_map to be valid.
>> >>> + *
>> >>> + * Note: cpumask_of_node() is not valid until after this is done.
>> >>> + * (Use CONFIG_DEBUG_PER_CPU_MAPS to check this.)
>> >>> + */
>> >>> +static void __init setup_node_to_cpumask_map(void)
>> >>> +{
>> >>> +     unsigned int cpu;
>> >>> +     int node;
>> >>> +
>> >>> +     /* setup nr_node_ids if not done yet */
>> >>> +     if (nr_node_ids == MAX_NUMNODES)
>> >>> +             setup_nr_node_ids();
>> >>> +
>> >>> +     /* allocate and clear the mapping */
>> >>> +     for (node = 0; node < nr_node_ids; node++) {
>> >>> +             alloc_bootmem_cpumask_var(&node_to_cpumask_map[node]);
>> >>> +             cpumask_clear(node_to_cpumask_map[node]);
>> >>> +     }
>> >>> +
>> >>> +     for_each_possible_cpu(cpu)
>> >>> +             set_cpu_numa_node(cpu, NUMA_NO_NODE);
>> >>> +
>> >>> +     /* cpumask_of_node() will now work */
>> >>> +     pr_debug("Node to cpumask map for %d nodes\n", nr_node_ids);
>> >>> +}
>> >>> +
>> >>> +/*
>> >>> + *  Set the cpu to node and mem mapping
>> >>> + */
>> >>> +void numa_store_cpu_info(unsigned int cpu)
>> >>> +{
>> >>> +     map_cpu_to_node(cpu, numa_off ? 0 : cpu_to_node_map[cpu]);
>> >>> +}
>> >>> +
>> >>> +/**
>> >>> + * numa_add_memblk - Set node id to memblk
>> >>> + * @nid: NUMA node ID of the new memblk
>> >>> + * @start: Start address of the new memblk
>> >>> + * @size:  Size of the new memblk
>> >>> + *
>> >>> + * RETURNS:
>> >>> + * 0 on success, -errno on failure.
>> >>> + */
>> >>> +int __init numa_add_memblk(int nid, u64 start, u64 size)
>> >>> +{
>> >>> +     int ret;
>> >>> +
>> >>> +     ret = memblock_set_node(start, size, &memblock.memory, nid);
>> >>> +     if (ret < 0) {
>> >>> +             pr_err("NUMA: memblock [0x%llx - 0x%llx] failed to add on node %d\n",
>> >>> +                     start, (start + size - 1), nid);
>> >>> +             return ret;
>> >>> +     }
>> >>> +
>> >>> +     node_set(nid, numa_nodes_parsed);
>> >>> +     pr_info("NUMA: Adding memblock [0x%llx - 0x%llx] on node %d\n",
>> >>> +                     start, (start + size - 1), nid);
>> >>> +     return ret;
>> >>> +}
>> >>> +EXPORT_SYMBOL(numa_add_memblk);
>> >>> +
>> >>> +/* Initialize NODE_DATA for a node on the local memory */
>> >>> +static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
>> >>> +{
>> >>> +     const size_t nd_size = roundup(sizeof(pg_data_t), SMP_CACHE_BYTES);
>> >>> +     u64 nd_pa;
>> >>> +     void *nd;
>> >>> +     int tnid;
>> >>> +
>> >>> +     pr_info("Initmem setup node %d [mem %#010Lx-%#010Lx]\n",
>> >>> +                     nid, start_pfn << PAGE_SHIFT,
>> >>> +                     (end_pfn << PAGE_SHIFT) - 1);
>> >>> +
>> >>> +     nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
>> >>> +     nd = __va(nd_pa);
>> >>> +
>> >>> +     /* report and initialize */
>> >>> +     pr_info("  NODE_DATA [mem %#010Lx-%#010Lx]\n",
>> >>> +             nd_pa, nd_pa + nd_size - 1);
>> >>> +     tnid = early_pfn_to_nid(nd_pa >> PAGE_SHIFT);
>> >>> +     if (tnid != nid)
>> >>> +             pr_info("    NODE_DATA(%d) on node %d\n", nid, tnid);
>> >>> +
>> >>> +     node_data[nid] = nd;
>> >>> +     memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
>> >>> +     NODE_DATA(nid)->node_id = nid;
>> >>> +     NODE_DATA(nid)->node_start_pfn = start_pfn;
>> >>> +     NODE_DATA(nid)->node_spanned_pages = end_pfn - start_pfn;
>> >>> +}
>> >>> +
>> >>> +/**
>> >>> + * numa_reset_distance - Reset NUMA distance table
>> >>> + *
>> >>> + * The current table is freed.
>> >>> + * The next numa_set_distance() call will create a new one.
>> >>> + */
>> >>> +void __init numa_reset_distance(void)
>> >>> +{
>> >>> +     size_t size;
>> >>> +
>> >>> +     if (!numa_distance)
>> >>> +             return;
>> >>> +
>> >>> +     size = numa_distance_cnt * numa_distance_cnt *
>> >>> +             sizeof(numa_distance[0]);
>> >>> +
>> >>> +     memblock_free(__pa(numa_distance), size);
>> >>> +     numa_distance_cnt = 0;
>> >>> +     numa_distance = NULL;
>> >>> +}
>> >>> +
>> >>> +static int __init numa_alloc_distance(void)
>> >>> +{
>> >>> +     size_t size;
>> >>> +     u64 phys;
>> >>> +     int i, j;
>> >>> +
>> >>> +     size = nr_node_ids * nr_node_ids * sizeof(numa_distance[0]);
>> >>> +     phys = memblock_find_in_range(0, PFN_PHYS(max_pfn),
>> >>> +                                   size, PAGE_SIZE);
>> >>> +     if (WARN_ON(!phys))
>> >>> +             return -ENOMEM;
>> >>> +
>> >>> +     memblock_reserve(phys, size);
>> >>> +
>> >>> +     numa_distance = __va(phys);
>> >>> +     numa_distance_cnt = nr_node_ids;
>> >>> +
>> >>> +     /* fill with the default distances */
>> >>> +     for (i = 0; i < numa_distance_cnt; i++)
>> >>> +             for (j = 0; j < numa_distance_cnt; j++)
>> >>> +                     numa_distance[i * numa_distance_cnt + j] = i == j ?
>> >>> +                             LOCAL_DISTANCE : REMOTE_DISTANCE;
>> >>> +
>> >>> +     pr_debug("NUMA: Initialized distance table, cnt=%d\n",
>> >>> +                     numa_distance_cnt);
>> >>> +
>> >>> +     return 0;
>> >>> +}
>> >>> +
>> >>> +/**
>> >>> + * numa_set_distance - Set NUMA distance from one NUMA to another
>> >>> + * @from: the 'from' node to set distance
>> >>> + * @to: the 'to'  node to set distance
>> >>> + * @distance: NUMA distance
>> >>> + *
>> >>> + * Set the distance from node @from to @to to @distance.  If distance table
>> >>> + * doesn't exist, one which is large enough to accommodate all the currently
>> >>> + * known nodes will be created.
>> >>> + *
>> >>> + * If such table cannot be allocated, a warning is printed and further
>> >>> + * calls are ignored until the distance table is reset with
>> >>> + * numa_reset_distance().
>> >>> + *
>> >>> + * If @from or @to is higher than the highest known node or lower than zero
>> >>> + * at the time of table creation or @distance doesn't make sense, the call
>> >>> + * is ignored.
>> >>> + * This is to allow simplification of specific NUMA config implementations.
>> >>> + */
>> >>> +void __init numa_set_distance(int from, int to, int distance)
>> >>> +{
>> >>> +     if (!numa_distance)
>> >>> +             return;
>> >>> +
>> >>> +     if (from >= numa_distance_cnt || to >= numa_distance_cnt ||
>> >>> +                     from < 0 || to < 0) {
>> >>> +             pr_warn_once("NUMA: Warning: node ids are out of bound, from=%d to=%d distance=%d\n",
>> >>> +                         from, to, distance);
>> >>> +             return;
>> >>> +     }
>> >>> +
>> >>> +     if ((u8)distance != distance ||
>> >>> +         (from == to && distance != LOCAL_DISTANCE)) {
>> >>> +             pr_warn_once("NUMA: Warning: invalid distance parameter, from=%d to=%d distance=%d\n",
>> >>> +                          from, to, distance);
>> >>> +             return;
>> >>> +     }
>> >>> +
>> >>> +     numa_distance[from * numa_distance_cnt + to] = distance;
>> >>> +}
>> >>> +EXPORT_SYMBOL(numa_set_distance);
>> >>> +
>> >>> +int __node_distance(int from, int to)
>> >>> +{
>> >>> +     if (from >= numa_distance_cnt || to >= numa_distance_cnt)
>> >>> +             return from == to ? LOCAL_DISTANCE : REMOTE_DISTANCE;
>> >>> +     return numa_distance[from * numa_distance_cnt + to];
>> >>> +}
>> >>> +EXPORT_SYMBOL(__node_distance);
>> >>
>> >> Much of this is simply a direct copy/paste from x86. Why can't it be
>> >> moved to common code? I don't see anything arch-specific here.
>> > not same for all arch.
>
> See my earlier comment. Not having identical code for all architectures
> doesn't mean it's not worth having a common portion of that in core code.
> It makes it easier to maintain, easier to use and easier to extend.
>
>> >>> +/**
>> >>> + * dummy_numa_init - Fallback dummy NUMA init
>> >>> + *
>> >>> + * Used if there's no underlying NUMA architecture, NUMA initialization
>> >>> + * fails, or NUMA is disabled on the command line.
>> >>> + *
>> >>> + * Must online at least one node and add memory blocks that cover all
>> >>> + * allowed memory.  This function must not fail.
>> >>
>> >> Why can't it fail? It looks like the return value is ignored by numa_init.
no, numa_init is not ignoring return value.
    ret = init_func();
        if (ret < 0)
                return ret;


>> > this function adds all memblocks to node 0. which is unlikely that it will fail.
>
> Good thing it returns an int, then.
>
> Will
thanks
Ganapat