[RFC PATCH 4/4] arm64:numa: adding numa support for arm64 platforms.

Mon Oct 20 07:25:56 PDT 2014

On Fri, Oct 17, 2014 at 10:49:56PM +0530, Ganapatrao Kulkarni wrote:
> Hi All,
> 
> On Mon, Oct 6, 2014 at 11:22 PM, Ganapatrao Kulkarni
> <gpkulkarni at gmail.com> wrote:
> > On Mon, Oct 6, 2014 at 4:56 PM, Mark Rutland <mark.rutland at arm.com> wrote:
> >> On Mon, Oct 06, 2014 at 06:14:36AM +0100, Ganapatrao Kulkarni wrote:
> >>> Hi Mark,
> >>>
> >>> On Fri, Oct 3, 2014 at 5:43 PM, Mark Rutland <mark.rutland at arm.com> wrote:
> >>> > On Thu, Sep 25, 2014 at 10:03:59AM +0100, Ganapatrao Kulkarni wrote:
> >>> >> Adding numa support for arm64 based platforms.
> >>> >> This version creates numa mapping by parsing the dt table.
> >>> >> cpu to node id mapping is derived from cluster_id as defined in cpu-map.
> >>> >> memory to node id mapping is derived from nid property of memory node.
> >>> >
> >>> > [...]
> >>> >
> >>> >> +/*
> >>> >> + * Too small node sizes may confuse the VM badly. Usually they
> >>> >> + * result from BIOS bugs. So dont recognize nodes as standalone
> >>> >> + * NUMA entities that have less than this amount of RAM listed:
> >>> >> + */
> >>> >> +#define NODE_MIN_SIZE (4*1024*1024)
> >>> >
> >>> > Why do these confuse the VM? what does BIOS have to do with arm64?
> >>> sneaked in from x86, will remove this.
> >>> >
> >>> >> +
> >>> >> +#define parent_node(node)      (node)
> >>> >
> >>> > Huh?
> >>> for thunder, no hierarchy at numa nodes. shall i put under ifdef or
> >>> separate header file?
> >>
> >> I was confused by a node being its own parent, but that seems to be the
> >> case elsewhere for parent_node() implementations. Please at least have a
> >> comment that we're assuming a flat hierarchy (for now).
> > sure, will add the required comments.
> >>
> >>> >
> >>> > [...]
> >>> >
> >>> >> @@ -168,6 +191,11 @@ void __init bootmem_init(void)
> >>> >>         min = PFN_UP(memblock_start_of_DRAM());
> >>> >>         max = PFN_DOWN(memblock_end_of_DRAM());
> >>> >>
> >>> >> +       high_memory = __va((max << PAGE_SHIFT) - 1) + 1;
> >>> >> +       max_pfn = max_low_pfn = max;
> >>> >> +
> >>> >> +       if (IS_ENABLED(CONFIG_NUMA))
> >>> >> +               arm64_numa_init();
> >>> >
> >>> > Is this function defined if !CONFIG_NUMA? Surely it must do nothing in
> >>> > that case anyway?
> >>> yes, if is not required, will remove it.
> >>> >
> >>> > [...]
> >>> >
> >>> >> +/*
> >>> >> + *  Set the cpu to node and mem mapping
> >>> >> + */
> >>> >> +void numa_store_cpu_info(cpu)
> >>> >> +{
> >>> >> +       cpu_to_node_map[cpu] = cpu_topology[cpu].cluster_id;
> >>> >> +       cpumask_set_cpu(cpu, node_to_cpumask_map[cpu_to_node_map[cpu]]);
> >>> >> +       set_numa_node(cpu_to_node_map[cpu]);
> >>> >> +       set_numa_mem(local_memory_node(cpu_to_node_map[cpu]));
> >>> >> +}
> >>> >
> >>> > I don't like this. I think we need to be more explicit in the DT w.r.t.
> >>> > the relationship between memory and the CPU hierarchy.
> >>> >
> >>> > I can imagine that we might end up with systems with multiple levels of
> >>> > NUMA hierarchy (using MPIDR_EL1.Aff{3,2}), and I'd rather that we were
> >>> > explcit as possible from the start w.r.t. the relationship between
> >>> > memory and groups of CPUs such that we don't end up with multiple ways
> >>> > of specifying said relationship, and horrible edge cases around implicit
> >>> > definitions (e.g. the nid to cluster mapping).
> >>> are you recomending to have explicit nid attribute to each cpu device node?
> >>
> >> I am recommending that we make the relationship explicit. If anything,
> >> using the cpu-map (with phandles) seems like a better approach.
> > yes, will add the mapping.
> >>
> >>> >> +/**
> >>> >> + * dummy_numa_init - Fallback dummy NUMA init
> >>> >> + *
> >>> >> + * Used if there's no underlying NUMA architecture, NUMA initialization
> >>> >> + * fails, or NUMA is disabled on the command line.
> >>> >> + *
> >>> >> + * Must online at least one node and add memory blocks that cover all
> >>> >> + * allowed memory.  This function must not fail.
> >>> >> + */
> >>> >> +static int __init dummy_numa_init(void)
> >>> >> +{
> >>> >> +       pr_info("%s\n",
> >>> >> +              numa_off ? "NUMA turned off" : "No NUMA configuration found");
> >>> >
> >>> > Why not print "NUMA turned off" in numa_setup?
> >>> enters this function only when, numa is turned off or the DT/ACPI
> >>> based numa init fails.
> >>
> >> Sure. Moving the "NUMA turned off" print into numa_setup would mean you
> >> could just print "Using dummy NUMA layout" or something to that effect
> >> here -- the function has no need to care about the value of numa_off.
> > agreed.
> >>
> >>> >
> >>> >> +       pr_info("Faking a node at [mem %#018Lx-%#018Lx]\n",
> >>> >> +              0LLU, PFN_PHYS(max_pfn) - 1);
> >>> >> +
> >>> >> +       node_set(0, numa_nodes_parsed);
> >>> >> +       numa_add_memblk(0, 0, PFN_PHYS(max_pfn));
> >>> >> +
> >>> >> +       return 0;
> >>> >> +}
> >>> >> +
> >>> >> +/**
> >>> >> + * early_init_dt_scan_numa_map - parse memory node and map nid to memory range.
> >>> >> + */
> >>> >> +int __init early_init_dt_scan_numa_map(unsigned long node, const char *uname,
> >>> >> +                                    int depth, void *data)
> >>> >> +{
> >>> >> +       const char *type = of_get_flat_dt_prop(node, "device_type", NULL);
> >>> >> +       const __be32 *reg, *endp, *nid_prop;
> >>> >> +       int l, nid;
> >>> >> +
> >>> >> +       /* We are scanning "memory" nodes only */
> >>> >> +       if (type == NULL) {
> >>> >> +               /*
> >>> >> +                * The longtrail doesn't have a device_type on the
> >>> >> +                * /memory node, so look for the node called /memory at 0.
> >>> >> +                */
> >>> >> +               if (depth != 1 || strcmp(uname, "memory at 0") != 0)
> >>> >> +                       return 0;
> >>> >
> >>> > This has no place on arm64.
> >>> i am not sure that we can move to driver/of, at this moment this is
> >>> arm64 specific binding.
> >>
> >> I meant the longtrail workaround, hence my comments on that below.
> > oh! thanks, will remove this.
> >>
> >>> > We limited to longtrail workaround in the core memory parsing to PPC32
> >>> > only in commit b44aa25d20e2ef6b (of: Handle memory at 0 node on PPC32
> >>> > only). This code doesn't need it enabled ever.
> >>> >
> >>> > Are you booting using UEFI? This isn't going to work when the memory map
> >>> tried with bootwrapper, working on to boot from UEFI.
> >>> > comes from UEFI and we have no memory nodes in the DTB.
> >>> yes, there is issue with UEFI boot, since memory node is removed.
> >>> i request UEFI stub dev-team to suggest the possible ways to address this.
> >>
> >> I've Cc'd a few people who have worked on the stub and/or EFI memory map
> >> stuff. It would be worth keeping them on Cc so as to keep them informed.
> > thanks.
> >>
> >> I believe that the EFI stub is doing the right thing by ensuring that
> >> the EFI memory map is used, so this is just another configuration that
> >> your binding has to consider.
> > going through EFI stub, next is to boot numa kernel using UEFI, will
> > include UEFI boot support in v2 patch.
> >>
> >> Mark.
> Below is the example for the proposal of numa bindings in DT.
> This covers cpu to node mapping, memory ranges to node mapping.
> Also defines proximity distance matrix of nodes to each other.
> please let me know your comments to go ahead with the implementation.
> 
> numa-map{
>          /*  Address cells used for memory range base address in mem-map.
>              For all others, size-cells is used.
>              Node-count tells the number of numa nodes in the system.
>          */
>          #address-cells = <2>;
>          #size-cells = <1>;
>          #node-count = <4>;
> 
>          /* Memmap for memory ranges on each node>
> 
>          mem-map = <0x0   0x00c00000 0>,
>            <0x1   0x00000000 1>,
>            <0x100 0x00000000 2>,
>            <0x200 0x00000000 3>;
> 
>          /* CPU to node map for 4 NODE and 16 CPUs system
>         < first-cpu last-cpu  node belongs>
>          */
>          cpu-map = <0 3 0>,
>                          <4 7 1>,
>                         <8 11 2>,
>                         <12 16 3>;
> 
>          /*Proximity Distance matrix for 4Node system
>                        <from-node to-node distance>
>         */
>          node-matrix=    <0 0 10>,
>                                <0 1 20>,
>                                <0 2 30>,
>                                <0 3 10>,
>                                <1 0 20>,
>                                 <1 1 10>,
>                                <1 2 30>,
>                               <1 3 10>,
>                               <2 0 30>,
>                               <2 1 20>,
>                               <2 2 10>,
>                              <2 3 10>,
>                             <3 0 10>,
>                             <3 1 20>,
>                             <3 2 30>,
>                             <3 3 10>,
>  }

Hi Ganapat,
The above caught my attention.

For a 4-node system do we not need 16 distances; the implication of that
would be that the distance between node A-B could be different from the
distance between B-A? Also the distance from a node to itself could be
safely assumed to be zero?

I think we should have a symmetric matrix with zero-diagonals so strictly
only seven values would need specifying for a 4-node system.

Cheers,
-- 
Steve