udelay() broken for SMP cores?
Russell King - ARM Linux
linux at arm.linux.org.uk
Sat Jan 8 18:24:27 EST 2011
On Thu, Apr 22, 2010 at 01:14:17AM +0100, Jamie Lokier wrote:
> Russell King - ARM Linux wrote:
> > Ok, since you seem to have a clear idea how to convert this into a double
> > nested loop, try converting it:
> >
> > @ 0 <= r0 <= 0x7fffff06
> > ldr r2, .LC0 (loops_per_jiffy)
> > ldr r2, [r2] @ max = 0x01ffffff
> > mov r0, r0, lsr #14 @ max = 0x0001ffff
> > mov r2, r2, lsr #10 @ max = 0x00007fff
> > mul r0, r2, r0 @ max = 2^32-1
> > movs r0, r0, lsr #6
> > moveq pc, lr
> > 1: subs r0, r0, #1
> > bhi 1b
> > mov pc, lr
> >
> > into two loops without losing the precision - note that the multiply
> > is part of a 'dividing by multiply+shift' technique.
>
> ldr r2, loops_per_jiffy
> ldr r3, microseconds_per_jiffy
> mov r4, r2
> 1: subs r4, r4, r3
> bhi 1b
> subs r0, r0, #1
> add r4, r4, r2
> bhi 1b
> mov pc, lr
>
> Goodnight :)
I thought I'd dig this out and give it a go - but it has problems. Let's
say usec_per_jiffy is 10000. Initially, loops_per_jiffy is 1<<12 or 4096
at boot.
If udelay() is used prior to calibration (it is - see things like OMAP/8250
console drivers which use udelay(1)), the initial loops_per_jiffy value
will be used.
So, r0 = 10000. r3 = 10000. r2 = 4096.
mov r4, r2 @ r4 := 4096
1: subs r4, r4, r3 @ r4 -= 10000 := -5904
bhi 1b @ not taken
subs r0, r0, #1 @ r0 -= 1 := 9999
add r4, r4, r2 @ r4 += 4096 := -1808 (or 4294965488)
bhi 1b @ taken
That's the first iteration. The next iteration:
1: subs r4, r4, r3 @ r4 -= 10000 := 4294955488
bhi 1b @ taken
1: subs r4, r4, r3 @ r4 -= 10000 := 4294945488
bhi 1b @ taken
... which means we have about 429493 loops to go ...
So this becomes an extremely slow loop - it works when loops_per_jiffy >
usec_per_jiffy.
Even with a value of 8192 (the first tried lpj in the calibration loop),
things eventually go wrong - r4 on each iteration goes -1808, -3616, ..
-9040 and then we're into the problem above - and this will be the case
for anyone with HZ=100.
So, this solution has undesirable behaviours... and this is what I've
come up with - we manually increase the lpj and decrease the required
delay by a power of two until we meet the necessary preconditions
(lpj >= usec/jiffy).
.LC0: .long loops_per_jiffy
.long (1000000 + (HZ / 2))/HZ
ENTRY(__delay)
ldr r3, .LC0 + 4 @ usec/jiffy
mov r2, r0
mov r0, r3
b 2f
ENTRY(__udelay)
ldr r2, .LC0
ldr r3, .LC0 + 4 @ usec/jiffy
ldr r2, [r2] @ lpj
2: cmp r2, r3
movcc r2, r2, lsl #1
movcc r0, r0, lsr #1
bcc 2b
mov ip, r2
1: subs ip, ip, r3
addls ip, ip, r2
sublss r0, r0, #2
bhi 1b
mov pc, lr
Note that I've also tweaked the loop a little to make the cycle count
(in theory) around the loop the same no matter what it does.
This way, I get the same lpj calibration value as the old way - which is
good as with the old way, we were calibrating just this loop:
1: subs r0, r0, #1
bhi 1b
mov pc, lr
where r0 = lpj and the target delay time was 1 jiffy.
Now, what sparked this off was:
> > We could go to ns delays, but then we have a big problem - the cost of
> > calculating the number of loops starts to become significant compared to
> > the delays - and that's a quality of implementation factor. In fact,
> > the existing cost has always been significant for short delays for
> > slower (sub-100MHz) ARMs.
>
> I'm surprised it makes much difference to, say, 20MHz ARMs because you
> could structure it as a nested loop, the inner one executed once per
> microsecond and calibrated to 1us. The delays don't have to be super
> accurate.
With this nested loop approach we can't go to ns resolution.
nsec_per_jiffy would be 10000000, and with an initial loops_per_jiffy
of 4096 or 8192, this would be extremely bad.
That said, I do think your approach has merit - especially as we're
now seeing CPUs in the 2000 BogoMips range, and our existing solution
goes bad at 3355 BogoMips. As the board I have is something like 8
months old we've probably got what, 10 months left according to Moore's
law?
More information about the linux-arm-kernel
mailing list