UBI leb_write_unlock NULL pointer Oops (continuation)

Richard Weinberger richard at nod.at
Tue Feb 4 02:46:17 EST 2014


Am 04.02.2014 08:22, schrieb Artem Bityutskiy:
> On Mon, 2014-02-03 at 14:56 +0100, Richard Weinberger wrote:
>> Am 03.02.2014 13:51, schrieb Wiedemer, Thorsten (Lawo AG):
>>> Hi,
>>>
>>> I can reproduce it fairly regularly, but not really "quickly". At the moment, I can use a setup of about identical 70 devices.
>>> A test over the last weekend resultet In 6 devices showing the bug.
>>> What we have are multiple processes which write in different intervals some data on the device and sync it, because this data should be available after a power cut.
>>> Perhaps I can force the error more often in writing test processes with shorter write/sync intervals.
>>>
>>> If I have further access to the "big" setup for some days, I will try to make a test without preemption.
>>
>> Hmm, ok.
>> Please also apply this patch, just in case...
>>
>> diff --git a/drivers/mtd/ubi/eba.c b/drivers/mtd/ubi/eba.c
>> index 0e11671d..48fd2aa 100644
>> --- a/drivers/mtd/ubi/eba.c
>> +++ b/drivers/mtd/ubi/eba.c
>> @@ -301,6 +301,7 @@ static void leb_write_unlock(struct ubi_device *ubi, int vol_id, int lnum)
>>
>>  	spin_lock(&ubi->ltree_lock);
>>  	le = ltree_lookup(ubi, vol_id, lnum);
>> +	ubi_assert(le);
>>  	le->users -= 1;
>>  	ubi_assert(le->users >= 0);
>>  	up_write(&le->mutex);
> 
> The UBI LEB locking is a bit over-designed, it could be simplified, may
> be this could help looking for the problem.
> 
> The this report does really sound like there is something specific to
> Thorsten's system which corrupts memory.

Thorsten sees:
Dec 25 03:59:22 kernel: Unable to handle kernel NULL pointer dereference at virtual address 0000000c
(leb_write_unlock+0x74/0xf0) from [<c02d0d10>] (ubi_eba_write_leb+0x94/0x820

In July 2013 we got this report from a user:
[  300.554525] Unable to handle kernel NULL pointer dereference at virtual address 0000000c
(leb_write_unlock+0xa0/0xf4) from [<802850e0>] (ubi_eba_write_leb+0x568/0x80c)

In both cases we fault at address 0000000c and leb_write_unlock() was called by ubi_eba_write_leb().

Same user saw the issue also in the read path:

[   38.471134] Unable to handle kernel NULL pointer dereference at
virtual address 00000000
(leb_read_unlock+0xa0/0xf4) from [<80285cdc>] (ubi_eba_read_leb+0x404/0x480)

In that case the fault happened at 00000000 directly.

A bit too deterministic for a memory corruption IMHO.

> And it is difficult to debug this via the mailing list. Thorsten should
> start adding various checks like this and try to come closer to the
> root-cause.

Yeah.
We also need more oopses, maybe we can spot a pattern.

Thanks,
//richard



More information about the linux-mtd mailing list