udelay() broken for SMP cores?

Sat Jan 8 18:24:27 EST 2011

On Thu, Apr 22, 2010 at 01:14:17AM +0100, Jamie Lokier wrote:
> Russell King - ARM Linux wrote:
> > Ok, since you seem to have a clear idea how to convert this into a double
> > nested loop, try converting it:
> > 
> > 						@ 0 <= r0 <= 0x7fffff06
> >                 ldr     r2, .LC0 (loops_per_jiffy)
> >                 ldr     r2, [r2]                @ max = 0x01ffffff
> >                 mov     r0, r0, lsr #14         @ max = 0x0001ffff
> >                 mov     r2, r2, lsr #10         @ max = 0x00007fff
> >                 mul     r0, r2, r0              @ max = 2^32-1
> >                 movs    r0, r0, lsr #6
> >                 moveq   pc, lr
> > 1:              subs    r0, r0, #1
> >                 bhi     1b
> >                 mov     pc, lr
> > 
> > into two loops without losing the precision - note that the multiply
> > is part of a 'dividing by multiply+shift' technique.
> 
> 	ldr	r2, loops_per_jiffy
> 	ldr	r3, microseconds_per_jiffy
>         mov     r4, r2
> 1:      subs    r4, r4, r3
>         bhi     1b
>         subs    r0, r0, #1
>         add     r4, r4, r2
>         bhi     1b
>         mov     pc, lr
> 
> Goodnight :)

I thought I'd dig this out and give it a go - but it has problems.  Let's
say usec_per_jiffy is 10000.  Initially, loops_per_jiffy is 1<<12 or 4096
at boot.

If udelay() is used prior to calibration (it is - see things like OMAP/8250
console drivers which use udelay(1)), the initial loops_per_jiffy value
will be used.

So, r0 = 10000.  r3 = 10000.  r2 = 4096.

	mov	r4, r2		@ r4 := 4096
1:	subs	r4, r4, r3	@ r4 -= 10000 := -5904
	bhi	1b		@ not taken
	subs	r0, r0, #1	@ r0 -= 1 := 9999
	add	r4, r4, r2	@ r4 += 4096 := -1808 (or 4294965488)
	bhi	1b		@ taken

That's the first iteration.  The next iteration:

1:	subs	r4, r4, r3	@ r4 -= 10000 := 4294955488
	bhi	1b		@ taken
1:	subs	r4, r4, r3	@ r4 -= 10000 := 4294945488
	bhi	1b		@ taken
... which means we have about 429493 loops to go ...

So this becomes an extremely slow loop - it works when loops_per_jiffy >
usec_per_jiffy.

Even with a value of 8192 (the first tried lpj in the calibration loop),
things eventually go wrong - r4 on each iteration goes -1808, -3616, ..
-9040 and then we're into the problem above - and this will be the case
for anyone with HZ=100.

So, this solution has undesirable behaviours... and this is what I've
come up with - we manually increase the lpj and decrease the required
delay by a power of two until we meet the necessary preconditions
(lpj >= usec/jiffy).

.LC0:		.long	loops_per_jiffy
		.long	(1000000 + (HZ / 2))/HZ

ENTRY(__delay)
                ldr     r3, .LC0 + 4		@ usec/jiffy
                mov     r2, r0
                mov     r0, r3
                b       2f
ENTRY(__udelay)
                ldr     r2, .LC0
                ldr     r3, .LC0 + 4		@ usec/jiffy
                ldr     r2, [r2]		@ lpj
2:              cmp     r2, r3
                movcc   r2, r2, lsl #1
                movcc   r0, r0, lsr #1
                bcc     2b
                mov     ip, r2
1:              subs    ip, ip, r3
                addls   ip, ip, r2
                sublss  r0, r0, #2
                bhi     1b
                mov     pc, lr

Note that I've also tweaked the loop a little to make the cycle count
(in theory) around the loop the same no matter what it does.

This way, I get the same lpj calibration value as the old way - which is
good as with the old way, we were calibrating just this loop:

1:		subs	r0, r0, #1
		bhi	1b
		mov	pc, lr

where r0 = lpj and the target delay time was 1 jiffy.

Now, what sparked this off was:

> > We could go to ns delays, but then we have a big problem - the cost of
> > calculating the number of loops starts to become significant compared to
> > the delays - and that's a quality of implementation factor.  In fact,
> > the existing cost has always been significant for short delays for
> > slower (sub-100MHz) ARMs.
> 
> I'm surprised it makes much difference to, say, 20MHz ARMs because you
> could structure it as a nested loop, the inner one executed once per
> microsecond and calibrated to 1us.  The delays don't have to be super
> accurate.

With this nested loop approach we can't go to ns resolution.
nsec_per_jiffy would be 10000000, and with an initial loops_per_jiffy
of 4096 or 8192, this would be extremely bad.

That said, I do think your approach has merit - especially as we're
now seeing CPUs in the 2000 BogoMips range, and our existing solution
goes bad at 3355 BogoMips.  As the board I have is something like 8
months old we've probably got what, 10 months left according to Moore's
law?