[PATCH] ARM: implement optimized percpu variable access
Will Deacon
will.deacon at arm.com
Wed Nov 28 07:34:58 EST 2012
Hi Jamie,
On Tue, Nov 27, 2012 at 01:02:03AM +0000, Jamie Lokier wrote:
> Will Deacon wrote:
> > That was a fun bit of debugging -- my hunch was right,
>
> Yes it was.
>
> I'll sum up what I found looking at the x86 version.
> Brief summary:
>
> 1. Due to preemption, it's not safe to cache per-CPU values within
> a function in general.
> 2. Except, if they are per-thread values (like current and
> current_thread_info) that don't depend on the CPU and just use
> per-CPU for efficient reading.
> 3. So, implement separate this_cpu_read_stable() and
> this_cpu_read() if you want GCC to cache certain values that are
> safe to cache.
> 4. Use asm volatile, not just an "m" constraint. I think x86 has a
> bug by using just "m" for this_cpu_read().
>
> Long version:
>
> - It's not really about the code in schedule(). It's about when
> context switch can happen. Which is every instruction in a
> preemptible context.
I disagree. You don't get magically pre-empted: control must pass through
context_switch to put you on a runqueue and choose a different task. With
PREEMPT, this can happen in response to an interrupt but the state of the
interrupted context is still correctly saved/restored via switch_to and
friends.
If a function accesses per-cpu data when preemptible(), then it must be
prepared to handle the pointer being incorrect (and this does seem to be
used: see slab_alloc_node, called during kmalloc, for example). So actually,
the only case of note *is* __schedule. Why? Because preemption is disabled,
but half of the function may appear to execute on one CPU and the other half
(i.e. once the task has been rescheduled from a different runqueue) may
execute on a different CPU.
The solution is to make the per-cpu offset reader hazard with
context-switch/barrier(). I tried Nico's suggestion of adding a memory
clobber and it seems to work pretty well, without any noticeable degradation
in quality of the generated code that I could spot:
diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h
index 9c8d051..2e58a1d 100644
--- a/arch/arm/include/asm/percpu.h
+++ b/arch/arm/include/asm/percpu.h
@@ -24,13 +24,15 @@
static inline void set_my_cpu_offset(unsigned long off)
{
- asm volatile("mcr p15, 0, %0, c13, c0, 4 @ set TPIDRPRW" : : "r" (off) );
+ /* Set TPIDRPRW */
+ asm volatile("mcr p15, 0, %0, c13, c0, 4" : : "r" (off) : "memory");
}
static inline unsigned long __my_cpu_offset(void)
{
unsigned long off;
- asm("mrc p15, 0, %0, c13, c0, 4 @ get TPIDRPRW" : "=r" (off) : );
+ /* Read TPIDRPRW */
+ asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off) : : "memory");
return off;
}
#define __my_cpu_offset __my_cpu_offset()
> - That seems like a nice trick, allowing some reads to be reused
> between instructions. It can be replicated even on other archs,
> using an "m" constraint on a dummy extern variable (no need to
> actually read anything!). But I'm not convinced this isn't a bug
> on x86. Consider this code:
The "m" constraint probably isn't what we want. Firstly, it will cause GCC
to emit instructions to calculate the address of whatever the dummy extern
variable is (probably a PC-relative load to get its address) and secondly,
GCC may generate a post-increment/decrement addressing mode with the
assumption that it will be evaluated exactly once in the asm block. You
could use "o", but you still have to load the address.
> #include <linux/percpu.h>
> #include <linux/preempt.h>
> #include <linux/preempt.h>
>
> DEFINE_PER_CPU(int, myvar1) = 0;
> DEFINE_PER_CPU(int, myvar2) = 0;
>
> static spinlock_t mylock = SPIN_LOCK_UNLOCKED;
Guessing this should be raw?
> int mytest(void)
> {
> long flags;
> int x, y, z;
>
> x = this_cpu_read(myvar1);
If it's important that the CPU doesn't change, this read into x should happen
inside the critical section.
> spin_lock_irqsave(&mylock, flags);
>
> /*
> * These two values should be consistent due to irqsave: No
> * preemption, no interrupts. But on x86, GCC can reuse the x
> * above for the value of y. If preemption happened before the
> * irqsave, y and z are not consistent.
> */
> y = this_cpu_read(myvar1);
> z = this_cpu_read(myvar2);
> spin_unlock_irqrestore(&mylock, flags);
>
> return y + z;
> }
Will
More information about the linux-arm-kernel
mailing list