UBI leb_write_unlock NULL pointer Oops (continuation)

Thu Feb 20 12:26:42 EST 2014

>> Bill Pringlemeir wrote:

>> Disassembly of section .data:

>> 00000000 <.data>:
>> 0:   e48a7004        str     r7, [sl], #4
>> 4:   e5985004        ldr     r5, [r8, #4]
>> 8:   e15a0005        cmp     sl, r5
>> c:   0a000029        beq     0xb8
>> 10:   e595300c        ldr     r3, [r5, #12]

>> 'r5' is NULL.  It seems to be the same symptom.  If you run your ARM objdump
>> 	with -S on either vmlinux or '__up_write', it will help confirm that
>> 	it is the list corrupted again.  The assembler above should match.

On 20 Feb 2014, Thorsten.Wiedemer at lawo.com wrote:

> I don't have running a objdump on my ARM system at the moment, but
> 	rwsem-spinlock.c compiled with debug info, objdump -S -D gives for
> 	__up_write():
> ...
> 	sem->activity = 0;
> 29c:	e3a07000 	mov	r7, #0
> 2a0:	e1a0a008 	mov	sl, r8

> 2a4:	e48a7004 	str	r7, [sl], #4
> 2a8:	e5985004 	ldr	r5, [r8, #4]
> 	if (!list_empty(&sem->wait_list))
> 2ac:	e15a0005 	cmp	sl, r5
> 2b0: 0a000029 beq 35c <__up_write+0xe0> /* if we are allowed to wake writers
> 	try to grant a single write lock * if there's a writer at the front of
> 	the queue * - we leave the 'waiting count' incremented to signify
> 	potential * contention */ if (waiter->flags & RWSEM_WAITING_FOR_WRITE)
> 	{
> 2b4:	e595300c 	ldr	r3, [r5, #12]
> {
> ...

> Seems to match ...

It doesn't matter where it runs.  I just want to make sure it is always
the 'waiter' variable.

>> What is 'RAVENNA_streame'?  Is this your standard test and not the
>> '8k binary' copy test or are you doing the copy test with this
>> process also running?

> This is an application which runs parallel to our copy test. The last
> days, Emanuel set up another test environment which seems to reproduce
> the error more reliably (at least on some hardwares, not on all).  At
> the moment, there are running proprietary applications in parallel,
> but I'll try to strip it down to a sequence which I can provide you,
> if you like.

I think scheduling is important to this issue, that is why I asked.

> We could reproduce the error now with function tracing enabled, so we
> have two hopefully valuable traces. But they are rather big (around
> 4MB each). Shall I use pastebin and cut them in several peaces to
> provide them? Or off-list as email attachment?  The trace Emanuel
> posted Wednesday may be not valuable. Perhaps there is a (different)
> error triggered due to memory pressure caused by the function tracing.

After looking, the allocation is not due to memory pressure.  It is due
to different tasks waiting on the rwsem with 'waiter' allocated on the
stack; I guess the task is gone, handling a signal or something
else. However, the function traces are great.  As you note they are
rather big, so it will take anyone some time to analyze them.

You could alter '__rwsem_do_wake',

static inline struct rw_semaphore *
__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
{
	struct rwsem_waiter *waiter;
	struct task_struct *tsk;
	int woken;

	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
+       if(!waiter) {
+          printk("Bad rwsem\n");
+          printk("activity is %d.\n", sem->activity);
+          BUG();
+       }
	if (waiter->type == RWSEM_WAITING_FOR_WRITE) {
		if (wakewrite)

... or something like that.

 * the rw-semaphore definition
 * - if activity is 0 then there are no active readers or writers
 * - if activity is +ve then that is the number of active readers
 * - if activity is -1 then there is one active writer
 * - if wait_list is not empty, then there are processes waiting...

It seems inconsistent to have a non-empty list with activity as 0 as
well?  The above is trying to trace when we find a 'NULL' in the
'wait_list', which always seems to be the issue, but probably not the
root cause.

You can also put similar code in '__rwsem_wake_one_writer' if you
instead get the 'up_read()' fault.

Fwiw,
Bill Pringlemeir.