[PATCH 1/1] [RFC] arm: add half-word __xchg
Jamie Lokier
jamie at shareable.org
Thu Mar 18 23:36:01 EDT 2010
Mathieu Desnoyers wrote:
> * Jamie Lokier (jamie at shareable.org) wrote:
> > Mathieu Desnoyers wrote:
> > > But.. thinking about it. It's bad to have a 2-byte xchg primitive that
> > > only works on UP and breaks the build on SMP. We should instead
> > > implement a workaround based on __cmpxchg4 to perform the 2-byte xchg().
> >
> > This exposes why there should be __cmpxchg_bool() versions, which do
> > not loop, preferably in the generic kernel API, because the workaround
> > using __cmpxchg4 has to add yet another pointless loop nesting to all
> > cmpxchg users.
>
> The workaround I propose is to use __cmpxchg4 in a loop for 2-byte xchg()
> fallback on arm; it is not related to cmpxchg() in any way. I don't
> think one single use-case justifies the creation of a __cmpxchg_bool()
> across all architectures.
>
> Also, I've never seen where having a cmpxchg primitive returning a
> boolean is more efficient than returning the value read, ever. If we
> have to loop, then we can re-use the value that has been read by
> cmpxchg, while the version returning a boolean would need a read to
> re-read the next value.
Your workaround is an example (but we are both right; the best I can
come up with is described at the end). This is a version of __xchg2
as you've described (please ignore any irrelevant mistakes :-):
static inline u16 __xchg2(u16 *ptr, u16 new)
{
int shift = (~(unsigned)ptr & 2) * 8; /* Assumes big-endian. */
u32 mask = ~(0x0000ffff << shift), add = new << shift;
u32 *ptrbig = (u32 *)((char *)ptr - ((unsigned)ptr & 2));
u32 oldbig = *ptrbig;
while (1) {
u32 tmp = __cmpxchg4(ptrbig, oldbig, (oldbig & mask) | add);
if (tmp == oldbig)
break;
oldbig = tmp;
}
return (u16)(oldbig >> shift);
}
This is __cmpxchg4() from the __cmpxchg() macro in <asm/system.h>,
types changed for sanity:
static inline u32 __cmpxchg4(u32 *ptr, u32 old, u32 new)
{
u32 oldval, res;
do {
asm volatile("@ __cmpxchg4\n"
" ldrex %1, [%2]\n"
" mov %0, #0\n"
" teq %1, %3\n"
" strexeq %0, %4, [%2]\n"
: "=&r" (res), "=&r" (oldval)
: "r" (ptr), "Ir" (old), "r" (new)
: "memory", "cc");
} while (res);
return oldval;
}
Now let's imagine a __cmpxchg4_bool, which returns non-zero if the new
value was stored, and zero if the new value was not stored:
static inline int __cmpxchg4_bool(u32 *ptr, u32 old, u32 new)
{
u32 tmp, failed;
asm volatile("@ __cmpxchg4_bool\n"
" ldrex %1, [%2]\n"
" mov %0, #1\n"
" teq %1, %3\n"
" strexeq %0, %4, [%2]\n"
: "=&r" (res), "=&r" (tmp)
: "r" (ptr), "Ir" (old), "r" (new)
: "memory", "cc");
return !failed;
}
And the alternative loop part of __xchg2 (setup and return aren't
interseting):
do {
oldbig = *ptrbig;
} while (!__cmpxchg4_bool(ptrbig, oldbig, (oldbig & mask) | add);
I'm sure you can already see where this is going.
Let's compile __xchg2 using __cmpxchg4 by hand. We're not interested
in the setup, just the loop, so let's imagine the code starting at
"oldbig = *ptrbig". The names are all symbolic register names:
ldr oldbig, [ptrbig]
@ start of loop in __xchg2
and new, oldbig, mask
or new, new, add
1:
@ start of __cmpxchg4
2:
ldrex tmp, [ptrbig]
mov res, #0
teq tmp, oldbig
strexeq res, newbig, [ptrbig]
@ end of asm, but do..while(res) still needed
teq res, #0
bne 2b
@ end of __cmpxchg4
@ remainder of loop in __xchg2
teq tmp, oldbig
mov oldbig, tmp
bne 1b
The result is (oldbig >> shift).
Now for comparison, the equivalent part of __xchg2 using __cmpxchg4_bool:
@ start of loop in __xchg2
1:
ldr oldbig, [ptrbig]
and new, oldbig, mask
or new, new, add
@ start of __cmpxchg4_bool
ldrex tmp, [ptrbig]
mov failed, #1
teq tmp, oldbig
strexeq failed, newbig, [ptrbig]
@ end of __cmpxchg4_bool (!failed is optimised by caller's !)
@ remainder of loop in __xchg2
teq failed, #0
bne 1b
As you can see, the bool version has one less loop nesting, and is 9
instructions instead of 11 for the non-bool version because of that.
Although the bool version is shorter, and the double-loop is
pointless,
You are right that the ldrex result can be recycled to avoid a load
(and is probably faster), although the double-loop is pointless.
Meaning the "ideal" API for the bool-flavoured call would returns
_both_ the old value and whether the new value was stored. A tidy way
to do that would be to take &old instead, and modify the pointed-to value.
Combining both observations:
static inline int __cmpxchg4_bool2(u32 *ptr, u32 *old, u32 new)
{
u32 failed;
asm volatile("@ __cmpxchg4_bool2\n"
" ldrex %1, [%2]\n"
" mov %0, #1\n"
" teq %1, %3\n"
" strexeq %0, %4, [%2]\n"
: "=&r" (res), "=&r" (*old)
: "r" (ptr), "Ir" (*old), "r" (new)
: "memory", "cc");
return !failed;
}
The new loop part of __xchg2:
u32 oldbig = *ptrbig;
while (!__cmpxchg4_bool2(ptrbig, &oldbig, (oldbig & mask) | add)) {}
The result (assuming GCC does it's job optimising *&oldbig to oldbig):
ldr oldbig, [ptrbig]
@ start of loop in __xchg2
1:
and new, oldbig, mask
or new, new, add
@ start of __cmpxchg4_bool2
ldrex tmp, [ptrbig]
mov failed, #1
teq tmp, oldbig
strexeq failed, newbig, [ptrbig]
mov oldbig, tmp
@ end of __cmpxchg4_bool2 (!failed is optimised by caller's !)
@ remainder of loop in __xchg2
teq failed, #0
bne 1b
That eliminates the load and the double-loop.
Quite a lot of __cmpxchg calls in the kernel are like this, with an
unnecessary extra loop, but by no means all are. It's not just
__xchg2, but obviously all the callers would have to change to take
advantage of a __cmpxchg_bool2-like function.
The difference becomes more evident when writing __cmpxchg2 emulation
using __cmpxchg4. Then you get an unnecessary *triple* loop at many
call sites: From __cmpxchg4 itself, __cmpxchg2 wrapped around it, and
the caller is also looping. Only the caller's loop is necessary.
Obviously we can only go so far. The shortest version of __xchg2 puts
the and+or between ldrex and strex, but that really isn't something
that can be exported safely in an arch-generic way:
@ start of loop in __xchg2_smallest
1:
ldrex oldbig, [ptrbig]
and new, oldbig, mask
or new, new, add
strex failed, newbig, [ptrbig]
teq failed, #0
bne 1b
The difference between 6 instructions and the original's 11, and the
fact it is simple and clear, leads me to suggest using last one as the
ARM version of __xchg2 when {ld,st}rexh are not available. It's only
downside is it can't go in asm-generic. That is, unless we had an
__atomic_and_xor() arch function, for asm-generic to use :-)
-- Jamie
More information about the linux-arm-kernel
mailing list