[PATCH] percpu: fix chunk range calculation

Mon Nov 21 12:01:46 EST 2011

Hello, Dave.

On Mon, Nov 21, 2011 at 09:45:51AM +0800, Dave Young wrote:
> On 11/19/2011 02:55 AM, Tejun Heo wrote:
> 
> > Percpu allocator recorded the cpus which map to the first and last
> > units in pcpu_first/last_unit_cpu respectively and used them to
> > determine the address range of a chunk - e.g. it assumed that the
> > first unit has the lowest address in a chunk while the last unit has
> > the highest address.
> > 
> > This simply isn't true.  Groups in a chunk can have arbitrary positive
> > or negative offsets from the previous one and there is no guarantee
> > that the first unit occupies the lowest offset while the last one the
> > highest.
> > 
> > Fix it by actually comparing unit offsets to determine cpus occupying
> > the lowest and highest offsets.  Also, rename pcu_first/last_unit_cpu
> > to pcpu_low/high_unit_cpu to avoid confusion.
> > 
> > The chunk address range is used to flush cache on vmalloc area
> > map/unmap and decide whether a given address is in the first chunk by
> > per_cpu_ptr_to_phys() and the bug was discovered by invalid
> > per_cpu_ptr_to_phys() translation for crash_note.
> > 
> > Kudos to Dave Young for tracking down the problem.
> 
> Tejun, thanks
> 
> Now that if addr is not in first trunk it must be in vmalloc area, below
> logic should be right:
> 	if (in_first_chunk) {
> 		if (!is_vmalloc_addr(addr))
> 			return __pa(addr);
> 		else
> 			return page_to_phys(vmalloc_to_page(addr));
> 	} else
> 		if (!is_vmalloc_addr(addr)) /* not possible */
> 			return __pa(addr);
> 		else
> 			return page_to_phys(pcpu_addr_to_page(addr));
> 
> So how about just simply remove in first chunk checking to simplify the
> code as followging:
> 
> phys_addr_t per_cpu_ptr_to_phys(void *addr)
> {
> 	if (!is_vmalloc_addr(addr))
> 		return __pa(addr);
> 	else
> 		return page_to_phys(pcpu_addr_to_page(addr));
> }

Yes, that's much simpler.  Hmmm... I'm still slightly reluctant
because the current code reflects better how percpu allocator actually
works.  It has special setup for the first chunk, which currently
supports either embedding in linear address space or vmalloc mapping,
and, from the second one, the backing allocator (currently either vm
or km) provides translation.

per_cpu_ptr_to_phys() has been pretty good at exposing wrong
assumptions in address translation mostly because this is one of few
points where the assumptions are visible in a verifiable way (I think
this is the second bug it has discovered).

This forces the verification that "if it isn't in one of the explicit
first chunk areas, it MUST follow backing allocator mapping" which can
discover both bugs in percpu allocator itself and
per_cpu_ptr_to_phys() callers.  e.g. with the proposed change, if the
caller passes in usual kernel address on percpu-vm which is not in the
first chunk, it will return __pa for the address even though that
address can't possibly be a percpu address, but the current code will
trip VIRTUAL_BUG_ON() in vmalloc_to_page().

Maybe we can add comment there how it can be further simplified and
why things look more complicated than necessary?

Thank you.

-- 
tejun