UBI leb_write_unlock NULL pointer Oops (continuation)

Tue Feb 4 02:54:52 EST 2014

On Tue, 2014-02-04 at 08:46 +0100, Richard Weinberger wrote:
> Am 04.02.2014 08:22, schrieb Artem Bityutskiy:
> > On Mon, 2014-02-03 at 14:56 +0100, Richard Weinberger wrote:
> >> Am 03.02.2014 13:51, schrieb Wiedemer, Thorsten (Lawo AG):
> >>> Hi,
> >>>
> >>> I can reproduce it fairly regularly, but not really "quickly". At the moment, I can use a setup of about identical 70 devices.
> >>> A test over the last weekend resultet In 6 devices showing the bug.
> >>> What we have are multiple processes which write in different intervals some data on the device and sync it, because this data should be available after a power cut.
> >>> Perhaps I can force the error more often in writing test processes with shorter write/sync intervals.
> >>>
> >>> If I have further access to the "big" setup for some days, I will try to make a test without preemption.
> >>
> >> Hmm, ok.
> >> Please also apply this patch, just in case...
> >>
> >> diff --git a/drivers/mtd/ubi/eba.c b/drivers/mtd/ubi/eba.c
> >> index 0e11671d..48fd2aa 100644
> >> --- a/drivers/mtd/ubi/eba.c
> >> +++ b/drivers/mtd/ubi/eba.c
> >> @@ -301,6 +301,7 @@ static void leb_write_unlock(struct ubi_device *ubi, int vol_id, int lnum)
> >>
> >>  	spin_lock(&ubi->ltree_lock);
> >>  	le = ltree_lookup(ubi, vol_id, lnum);
> >> +	ubi_assert(le);
> >>  	le->users -= 1;
> >>  	ubi_assert(le->users >= 0);
> >>  	up_write(&le->mutex);
> > 
> > The UBI LEB locking is a bit over-designed, it could be simplified, may
> > be this could help looking for the problem.
> > 
> > The this report does really sound like there is something specific to
> > Thorsten's system which corrupts memory.

May be. Although sometimes corruptions are also deterministics - a
buffer over-run at the same place causes the same side effects etc.

But in any case, the only way I know to deal with this issues is start
putting various prints and assertions, and trying to come closer to the
root-cause. Sometimes bisecting helps, but this case would be difficult
to bisect because the reproducability is hard. Indeed, one may think
that there is no failure duding a day, so the commit as 'good' while it
may be actually 'bad', the bug just happen to not manifest itself
quickly enough.

-- 
Best Regards,
Artem Bityutskiy