UBIFS corruption after power cut - possibly unstable bits issue?

Tue Dec 1 01:12:49 PST 2015

On Mon, 30 Nov 2015 13:58:34 -0800
Tim Harvey <tharvey at gateworks.com> wrote:

> On Mon, Nov 16, 2015 at 7:01 AM, Tim Harvey <tharvey at gateworks.com> wrote:
> > On Tue, Nov 3, 2015 at 5:38 AM, Boris Brezillon
> > <boris.brezillon at free-electrons.com> wrote:
> >> Hi Tim,
> >>
> >> On Mon, 2 Nov 2015 12:31:11 -0800
> >> Tim Harvey <tharvey at gateworks.com> wrote:
> >>
> >>> On Mon, Nov 2, 2015 at 12:27 PM, Tim Harvey <tharvey at gateworks.com> wrote:
> >>> > [    8.635364] UBIFS (ubi0:0): recovery needed
> >>> > [    8.676203] ubi0 warning: ubi_io_read: error -74 (ECC error) while
> >>> > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry
> >>> > [    8.692460] ubi0 warning: ubi_io_read: error -74 (ECC error) while
> >>> > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry
> >>> > [    8.708741] ubi0 warning: ubi_io_read: error -74 (ECC error) while
> >>> > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry
> >>> > ^^^^ non correctable ecc error on PEB 2254  - I verified that this was
> >>> > not the first time this PEB has been used
> >>
> >> I suspect one of the bit in PEB 2254 to be stuck at 0 (even after
> >> erasing the block the bit stays at 0). Have you tried to erase this
> >> block (flash_erase /dev/mtd2 0x23380000 1) and dump it in raw mode
> >> (nanddump -n -l 0x40000 -s 0x23380000 -f /tmp/dump /dev/mtd2)?
> >
> > Boris,
> >
> > I examined the bad PEB on several boards now that I have reproduced
> > this issue with and found no stuck bits (no 0's following erase, no
> > 1's following erase and raw write all ff's).
> >
> > So in this case it doesn't appear to be a bad block. Incidentally for
> > UBI/UBIFS, what is in charge of detecting bad blocks, how are they
> > detected, and when/how are they marked?
> >
> >>
> >>> >
> >>> > I've cc'd Huang, Elie, and Brian who were involved in the patch to
> >>> > detect bit-flips in gpmi-nand.c reads - perhaps they have some more
> >>> > ideas. I find it interesting that in one case that patch resolves the
> >>> > issue and in the other it does not.
> >>
> >> I posted a slightly reworked version of Huang's patch [1] a while ago
> >> addressing the "account for bitflips in OOB area" problem, but maybe we
> >> could do better (avoid this extra "read in raw mode" step, or use the
> >> generic nand_check_erased_ecc_chunk() function when ECC bytes are
> >> aligned).
> >>
> >> Best Regards,
> >>
> >> Boris
> >>
> >> [1]https://patchwork.ozlabs.org/patch/416543/
> >
> > At this point I likely need to reproduce this problem with additional
> > debugging enabled to show what last erased and/or wrote to the PEB's
> > that are corrupt. I will also try your patch as well and see if that
> > resolves anything.
> >
> > Regards,
> >
> > Tim
> 
> Boris,
> 
> I tried your patch [1] on a week-long test over 10x IMX6 boards
> booting over 60K times across temperature ranges and the patch
> resolved many previous failures to mount rootfs errors (previously I
> would encounter around 1% failure to mount rootfs). In addition I saw
> no nand corruption where I would have expected to see it several times
> with those numbers so I suspect this may have resolved that as well.
> 
> Can you re-submit your patch for inclusion and/or discussion?

I'm quite busy on other topics lately, but feel free to adapt/resubmit
the patch.

Best Regards,

Boris

-- 
Boris Brezillon, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com