corrupted empty space - again

Wed May 15 04:07:57 EDT 2013

On Mon, 2013-04-22 at 16:19 +0200, Michał Przepłata wrote:
> Hello Artem,
> 
> I am having some issues similar to the ones reported earlier by
> others, on this and other
> lists (corrupted empty space), but I cannot see reliable solution for
> this, hence I'm writing it here for help.
> I hope you could point me to some solution.
> 
> To the point:
> - I am using kernel 2.6.35 (with some patches up to 12.2010), I know - it's old
> - chip is Samsung K9F1G08U0D (128MB SLC, 2kB pages, layout is using
> ECC for 4x512B subpages),
>   SoC is Freescale i.mx28
> - our device must be finished booting and running apps under 10s, if
> this is not meet we are
>   powered down (by backend device)
> - I didn't run any MTD tests/bonnie++ yet for testing this chip/MTD
> driver (could be useful
>   for discovering other issue, but I don't think it matters for case
> described below)
> - on one of our devices we got UBIFS corruption for R/W data partition
> (in empty space area),
>   below is original bug report log (without further debug messages):
> (cut)
> [    0.230000] UBI: attaching mtd1 to ubi0
> [    0.230000] UBI: physical eraseblock size:   131072 bytes (128 KiB)
> [    0.230000] UBI: logical eraseblock size:    126976 bytes
> [    0.230000] UBI: smallest flash I/O unit:    2048
> [    0.230000] UBI: VID header offset:          2048 (aligned 2048)
> [    0.230000] UBI: data offset:                4096
> [    0.770000] UBI: attached mtd1 to ubi0
> [    0.770000] UBI: MTD device name:            "gpmi-nfc-general-use"
> [    0.770000] UBI: MTD device size:            117 MiB
> [    0.770000] UBI: number of good PEBs:        940
> [    0.770000] UBI: number of bad PEBs:         0
> [    0.770000] UBI: max. allowed volumes:       128
> [    0.770000] UBI: wear-leveling threshold:    4096
> [    0.770000] UBI: number of internal volumes: 1
> [    0.770000] UBI: number of user volumes:     3
> [    0.780000] UBI: available PEBs:             0
> [    0.780000] UBI: total number of reserved PEBs: 940
> [    0.780000] UBI: number of PEBs reserved for bad PEB handling: 9
> [    0.780000] UBI: max/mean erase counter: 407/123
> [    0.780000] UBI: image sequence number: 0
> [    0.780000] UBI: background thread "ubi_bgt0d" started, PID 30
> (cut)
> [    0.940000] VFS: Mounted root (squashfs filesystem) readonly on device 254:0.
> [    1.760000] UBIFS: recovery needed
> [    1.810000] UBIFS: recovery completed
> [    1.810000] UBIFS: mounted UBI device 0, volume 2, name "data"
> [    1.810000] UBIFS: file system size:   76947456 bytes (75144 KiB,
> 73 MiB, 606 LEBs)
> [    1.810000] UBIFS: journal size:       3809280 bytes (3720 KiB, 3
> MiB, 30 LEBs)
> [    1.810000] UBIFS: media format:       w4/r0 (latest is w4/r0)
> [    1.810000] UBIFS: default compressor: zlib
> [    1.810000] UBIFS: reserved for root:  3634417 bytes (3549 KiB)
> [    3.460000] UBI error: ubi_io_read: error -74 while reading 126976
> bytes from PEB 97:4096, read 126976 bytes
> [    3.470000] UBIFS error (pid 86): ubifs_scan: corrupt empty space
> at LEB 318:116009
> [    3.490000] UBIFS error (pid 86): ubifs_scanned_corruption:
> corruption at LEB 318:116009
> [    3.490000] 00000000: ffffffdf ffffffff ffffffff ffffffff ffffffff
> ffffffff ffffffff ffffffff  ................................
> [    3.490000] 00000020: ffffffff ffffffff ffffffff ffffffff ffffffff
> ffffffff ffffffff ffffffff  ................................
> (cut)
> [    3.560000] 00001fe0: ffffffff ffffffff ffffffff ffffffff ffffffff
> ffffffff ffffffff ffffffff  ................................
> [    3.560000] UBIFS error (pid 86): ubifs_scan: LEB 318 scanning failed
> [    3.560000] UBIFS warning (pid 86): ubifs_ro_mode: switched to
> read-only mode, error -117
> [    3.560000] UBIFS error (pid 86): make_reservation: cannot reserve
> 137 bytes in jhead 2, error -117
> [    3.580000] UBIFS error (pid 86): do_writepage: cannot write page 1
> of inode 1218, error -117
> (cut)
> 
> It seems that:
> a) error is of single bit-flip kind (read decay) (I don't suspect currently
>     unstable bits issue during erasing/writting)
> b) our NAND driver doesn't protect our empty space (no wonder, as 13
> bytes ECC used
>     per 512B subpage should be left 0xFF until written with real data)
> c) as checked, this is the first empty-page (2kB) in this PEB,
> previous page contains
>     some data (and nothing shows that we have more than one page corrupted)
> 
> I have tried of changing NAND/MTD driver to return -EUCLEAN instead of
> -EBADMSG (to fix
> the problem below UBI layer, pretending that we have correctable
> bit-flip). Results (with UBI debug turned on):
> FAIL#1 - error was still there (UBIFS corruption when mounting data
> partition, required for booting),
>   scrubbing for this PEB was initiated (ubi_wl_scrub_peb), but happend
> some time later, when
>   left running after artificially disconnecting backend (I guess it
> was scheduled to ubi_bgt0d task)
> FAIL#2 - it seems that PEB 97 was rewritten to PEB 89, however
> corrupted empty space was
>   also preserved (sic!) at the very same offset, hence error is still
> there (confirmed with nanddump)
> 
> That means that further trying to fix that in NAND/MTD driver is
> futile. Am I right?
> 
> Questions:
> 1) is there any chance that merging UBI/UBIFS recent source will make
> it go away? 

No.

> 2) should I try to change UBI/UBIFS to deal with this problem? Ideally
> would be if rewritting/recovering
>    this PEB would happen immediately at the time of discovery (in UBI
> layer). Alternatively, immediately
>    at UBIFS layer (in ubifs_scan function, when page is checked for
> containing only 0xffffffff).
>    Could you point me to an example that would be proper for this?

Currently UBI/UBIFS assume that drivers fix-up bit-flips in erased
areas.

Why the fact that the driver does not protect the empty space does not
worry you? Isn't it a flaw? If you have empty space with too many
bit-flips, and write useful data there which you then cannot read, isn't
it a problem?

To me it really sounds like it is job of MTD layer and/or the driver to
protect the empty space. Probably MTD may provide some generic methods
which could be used by those drivers which do not have own protection?

> 3) what about a band-aid solution (commenting out ‘goto corrupted;’
> line) explained in:
>    http://e2e.ti.com/support/embedded/linux/f/354/t/171839.aspx
>    Does UBI/UBIFS does check also for all 0xFF in page before writing
> (not as part of any ‘extra checks’
>    debugging)? If so, then maybe such a quick-fix could be used
> (fixing ubifs_scan issue), but best
>    followed with some soon-to-happen recovery, that will recover this
> LEB and erase/reuse the PEB?

I think this is a bad idea, because this way you also ignore real
corruptions, instead of noticing them right away.

-- 
Best Regards,
Artem Bityutskiy