<p>Dave,</p>
<p>I will give it a try ASAP.  Unfortunately, ASAP won't be until Friday.</p>
<p>tim<br></p>
<div class="gmail_quote">On Nov 20, 2011 6:43 PM, "Dave Young" <<a href="mailto:dyoung@redhat.com">dyoung@redhat.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
On 11/19/2011 02:55 AM, Tejun Heo wrote:<br>
<br>
> Percpu allocator recorded the cpus which map to the first and last<br>
> units in pcpu_first/last_unit_cpu respectively and used them to<br>
> determine the address range of a chunk - e.g. it assumed that the<br>
> first unit has the lowest address in a chunk while the last unit has<br>
> the highest address.<br>
><br>
> This simply isn't true.  Groups in a chunk can have arbitrary positive<br>
> or negative offsets from the previous one and there is no guarantee<br>
> that the first unit occupies the lowest offset while the last one the<br>
> highest.<br>
><br>
> Fix it by actually comparing unit offsets to determine cpus occupying<br>
> the lowest and highest offsets.  Also, rename pcu_first/last_unit_cpu<br>
> to pcpu_low/high_unit_cpu to avoid confusion.<br>
><br>
> The chunk address range is used to flush cache on vmalloc area<br>
> map/unmap and decide whether a given address is in the first chunk by<br>
> per_cpu_ptr_to_phys() and the bug was discovered by invalid<br>
> per_cpu_ptr_to_phys() translation for crash_note.<br>
><br>
> Kudos to Dave Young for tracking down the problem.<br>
<br>
<br>
Tejun, thanks<br>
<br>
Now that if addr is not in first trunk it must be in vmalloc area, below<br>
logic should be right:<br>
        if (in_first_chunk) {<br>
                if (!is_vmalloc_addr(addr))<br>
                        return __pa(addr);<br>
                else<br>
                        return page_to_phys(vmalloc_to_page(addr));<br>
        } else<br>
                if (!is_vmalloc_addr(addr)) /* not possible */<br>
                        return __pa(addr);<br>
                else<br>
                        return page_to_phys(pcpu_addr_to_page(addr));<br>
<br>
So how about just simply remove in first chunk checking to simplify the<br>
code as followging:<br>
<br>
phys_addr_t per_cpu_ptr_to_phys(void *addr)<br>
{<br>
        if (!is_vmalloc_addr(addr))<br>
                return __pa(addr);<br>
        else<br>
                return page_to_phys(pcpu_addr_to_page(addr));<br>
}<br>
<br>
><br>
> Signed-off-by: Tejun Heo <<a href="mailto:tj@kernel.org">tj@kernel.org</a>><br>
> Reported-by: WANG Cong <<a href="mailto:xiyou.wangcong@gmail.com">xiyou.wangcong@gmail.com</a>><br>
> Reported-by: Dave Young <<a href="mailto:dyoung@redhat.com">dyoung@redhat.com</a>><br>
> LKML-Reference: <<a href="mailto:4EC21F67.10905@redhat.com">4EC21F67.10905@redhat.com</a>><br>
> Cc: stable @<a href="http://kernel.org" target="_blank">kernel.org</a><br>
> ---<br>
> Heh, that's a stupid bug.  Can you please verify this fixes the<br>
> problem?<br>
<br>
<br>
I can confirm that the per cpu translation works with this patch, but I<br>
have not managed to finished kdump test because kexec tools refuse<br>
 to load the crash kernel. I need more debugging on kexec tools.<br>
<br>
Tim, could you help to test if this patch works for your kdump case?<br>
<br>
><br>
> Thank you very much.<br>
><br>
>  mm/percpu-vm.c |   12 ++++++------<br>
>  mm/percpu.c    |   34 ++++++++++++++++++++--------------<br>
>  2 files changed, 26 insertions(+), 20 deletions(-)<br>
><br>
> diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c<br>
> index ea53496..bfad724 100644<br>
> --- a/mm/percpu-vm.c<br>
> +++ b/mm/percpu-vm.c<br>
> @@ -143,8 +143,8 @@ static void pcpu_pre_unmap_flush(struct pcpu_chunk *chunk,<br>
>                                int page_start, int page_end)<br>
>  {<br>
>       flush_cache_vunmap(<br>
> -             pcpu_chunk_addr(chunk, pcpu_first_unit_cpu, page_start),<br>
> -             pcpu_chunk_addr(chunk, pcpu_last_unit_cpu, page_end));<br>
> +             pcpu_chunk_addr(chunk, pcpu_low_unit_cpu, page_start),<br>
> +             pcpu_chunk_addr(chunk, pcpu_high_unit_cpu, page_end));<br>
>  }<br>
><br>
>  static void __pcpu_unmap_pages(unsigned long addr, int nr_pages)<br>
> @@ -206,8 +206,8 @@ static void pcpu_post_unmap_tlb_flush(struct pcpu_chunk *chunk,<br>
>                                     int page_start, int page_end)<br>
>  {<br>
>       flush_tlb_kernel_range(<br>
> -             pcpu_chunk_addr(chunk, pcpu_first_unit_cpu, page_start),<br>
> -             pcpu_chunk_addr(chunk, pcpu_last_unit_cpu, page_end));<br>
> +             pcpu_chunk_addr(chunk, pcpu_low_unit_cpu, page_start),<br>
> +             pcpu_chunk_addr(chunk, pcpu_high_unit_cpu, page_end));<br>
>  }<br>
><br>
>  static int __pcpu_map_pages(unsigned long addr, struct page **pages,<br>
> @@ -284,8 +284,8 @@ static void pcpu_post_map_flush(struct pcpu_chunk *chunk,<br>
>                               int page_start, int page_end)<br>
>  {<br>
>       flush_cache_vmap(<br>
> -             pcpu_chunk_addr(chunk, pcpu_first_unit_cpu, page_start),<br>
> -             pcpu_chunk_addr(chunk, pcpu_last_unit_cpu, page_end));<br>
> +             pcpu_chunk_addr(chunk, pcpu_low_unit_cpu, page_start),<br>
> +             pcpu_chunk_addr(chunk, pcpu_high_unit_cpu, page_end));<br>
>  }<br>
><br>
>  /**<br>
> diff --git a/mm/percpu.c b/mm/percpu.c<br>
> index bf80e55..93b5a7c 100644<br>
> --- a/mm/percpu.c<br>
> +++ b/mm/percpu.c<br>
> @@ -116,9 +116,9 @@ static int pcpu_atom_size __read_mostly;<br>
>  static int pcpu_nr_slots __read_mostly;<br>
>  static size_t pcpu_chunk_struct_size __read_mostly;<br>
><br>
> -/* cpus with the lowest and highest unit numbers */<br>
> -static unsigned int pcpu_first_unit_cpu __read_mostly;<br>
> -static unsigned int pcpu_last_unit_cpu __read_mostly;<br>
> +/* cpus with the lowest and highest unit addresses */<br>
> +static unsigned int pcpu_low_unit_cpu __read_mostly;<br>
> +static unsigned int pcpu_high_unit_cpu __read_mostly;<br>
><br>
>  /* the address of the first chunk which starts with the kernel static area */<br>
>  void *pcpu_base_addr __read_mostly;<br>
> @@ -984,19 +984,19 @@ phys_addr_t per_cpu_ptr_to_phys(void *addr)<br>
>  {<br>
>       void __percpu *base = __addr_to_pcpu_ptr(pcpu_base_addr);<br>
>       bool in_first_chunk = false;<br>
> -     unsigned long first_start, first_end;<br>
> +     unsigned long first_low, first_high;<br>
>       unsigned int cpu;<br>
><br>
>       /*<br>
> -      * The following test on first_start/end isn't strictly<br>
> +      * The following test on unit_low/high isn't strictly<br>
>        * necessary but will speed up lookups of addresses which<br>
>        * aren't in the first chunk.<br>
>        */<br>
> -     first_start = pcpu_chunk_addr(pcpu_first_chunk, pcpu_first_unit_cpu, 0);<br>
> -     first_end = pcpu_chunk_addr(pcpu_first_chunk, pcpu_last_unit_cpu,<br>
> -                                 pcpu_unit_pages);<br>
> -     if ((unsigned long)addr >= first_start &&<br>
> -         (unsigned long)addr < first_end) {<br>
> +     first_low = pcpu_chunk_addr(pcpu_first_chunk, pcpu_low_unit_cpu, 0);<br>
> +     first_high = pcpu_chunk_addr(pcpu_first_chunk, pcpu_high_unit_cpu,<br>
> +                                  pcpu_unit_pages);<br>
> +     if ((unsigned long)addr >= first_low &&<br>
> +         (unsigned long)addr < first_high) {<br>
>               for_each_possible_cpu(cpu) {<br>
>                       void *start = per_cpu_ptr(base, cpu);<br>
><br>
> @@ -1233,7 +1233,9 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,<br>
><br>
>       for (cpu = 0; cpu < nr_cpu_ids; cpu++)<br>
>               unit_map[cpu] = UINT_MAX;<br>
> -     pcpu_first_unit_cpu = NR_CPUS;<br>
> +<br>
> +     pcpu_low_unit_cpu = NR_CPUS;<br>
> +     pcpu_high_unit_cpu = NR_CPUS;<br>
><br>
>       for (group = 0, unit = 0; group < ai->nr_groups; group++, unit += i) {<br>
>               const struct pcpu_group_info *gi = &ai->groups[group];<br>
> @@ -1253,9 +1255,13 @@ int __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai,<br>
>                       unit_map[cpu] = unit + i;<br>
>                       unit_off[cpu] = gi->base_offset + i * ai->unit_size;<br>
><br>
> -                     if (pcpu_first_unit_cpu == NR_CPUS)<br>
> -                             pcpu_first_unit_cpu = cpu;<br>
> -                     pcpu_last_unit_cpu = cpu;<br>
> +                     /* determine low/high unit_cpu */<br>
> +                     if (pcpu_low_unit_cpu == NR_CPUS ||<br>
> +                         unit_off[cpu] < unit_off[pcpu_low_unit_cpu])<br>
> +                             pcpu_low_unit_cpu = cpu;<br>
> +                     if (pcpu_high_unit_cpu == NR_CPUS ||<br>
> +                         unit_off[cpu] > unit_off[pcpu_high_unit_cpu])<br>
> +                             pcpu_high_unit_cpu = cpu;<br>
>               }<br>
>       }<br>
>       pcpu_nr_units = unit;<br>
><br>
> _______________________________________________<br>
> kexec mailing list<br>
> <a href="mailto:kexec@lists.infradead.org">kexec@lists.infradead.org</a><br>
> <a href="http://lists.infradead.org/mailman/listinfo/kexec" target="_blank">http://lists.infradead.org/mailman/listinfo/kexec</a><br>
<br>
<br>
<br>
--<br>
Thanks<br>
Dave<br>
</blockquote></div>