UBIFS corruption after power cut - possibly unstable bits issue?

Tim Harvey tharvey at gateworks.com
Mon Nov 30 13:58:34 PST 2015


On Mon, Nov 16, 2015 at 7:01 AM, Tim Harvey <tharvey at gateworks.com> wrote:
> On Tue, Nov 3, 2015 at 5:38 AM, Boris Brezillon
> <boris.brezillon at free-electrons.com> wrote:
>> Hi Tim,
>>
>> On Mon, 2 Nov 2015 12:31:11 -0800
>> Tim Harvey <tharvey at gateworks.com> wrote:
>>
>>> On Mon, Nov 2, 2015 at 12:27 PM, Tim Harvey <tharvey at gateworks.com> wrote:
>>> > [    8.635364] UBIFS (ubi0:0): recovery needed
>>> > [    8.676203] ubi0 warning: ubi_io_read: error -74 (ECC error) while
>>> > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry
>>> > [    8.692460] ubi0 warning: ubi_io_read: error -74 (ECC error) while
>>> > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry
>>> > [    8.708741] ubi0 warning: ubi_io_read: error -74 (ECC error) while
>>> > reading 69632 bytes from PEB 2254:192512, read only 69632 bytes, retry
>>> > ^^^^ non correctable ecc error on PEB 2254  - I verified that this was
>>> > not the first time this PEB has been used
>>
>> I suspect one of the bit in PEB 2254 to be stuck at 0 (even after
>> erasing the block the bit stays at 0). Have you tried to erase this
>> block (flash_erase /dev/mtd2 0x23380000 1) and dump it in raw mode
>> (nanddump -n -l 0x40000 -s 0x23380000 -f /tmp/dump /dev/mtd2)?
>
> Boris,
>
> I examined the bad PEB on several boards now that I have reproduced
> this issue with and found no stuck bits (no 0's following erase, no
> 1's following erase and raw write all ff's).
>
> So in this case it doesn't appear to be a bad block. Incidentally for
> UBI/UBIFS, what is in charge of detecting bad blocks, how are they
> detected, and when/how are they marked?
>
>>
>>> >
>>> > I've cc'd Huang, Elie, and Brian who were involved in the patch to
>>> > detect bit-flips in gpmi-nand.c reads - perhaps they have some more
>>> > ideas. I find it interesting that in one case that patch resolves the
>>> > issue and in the other it does not.
>>
>> I posted a slightly reworked version of Huang's patch [1] a while ago
>> addressing the "account for bitflips in OOB area" problem, but maybe we
>> could do better (avoid this extra "read in raw mode" step, or use the
>> generic nand_check_erased_ecc_chunk() function when ECC bytes are
>> aligned).
>>
>> Best Regards,
>>
>> Boris
>>
>> [1]https://patchwork.ozlabs.org/patch/416543/
>
> At this point I likely need to reproduce this problem with additional
> debugging enabled to show what last erased and/or wrote to the PEB's
> that are corrupt. I will also try your patch as well and see if that
> resolves anything.
>
> Regards,
>
> Tim

Boris,

I tried your patch [1] on a week-long test over 10x IMX6 boards
booting over 60K times across temperature ranges and the patch
resolved many previous failures to mount rootfs errors (previously I
would encounter around 1% failure to mount rootfs). In addition I saw
no nand corruption where I would have expected to see it several times
with those numbers so I suspect this may have resolved that as well.

Can you re-submit your patch for inclusion and/or discussion?

Regards,

Tim

 [1]https://patchwork.ozlabs.org/patch/416543/



More information about the linux-mtd mailing list