corrupted empty space - again

Mon Apr 22 10:19:35 EDT 2013

Hello Artem,

I am having some issues similar to the ones reported earlier by
others, on this and other
lists (corrupted empty space), but I cannot see reliable solution for
this, hence I'm writing it here for help.
I hope you could point me to some solution.

To the point:
- I am using kernel 2.6.35 (with some patches up to 12.2010), I know - it's old
- chip is Samsung K9F1G08U0D (128MB SLC, 2kB pages, layout is using
ECC for 4x512B subpages),
  SoC is Freescale i.mx28
- our device must be finished booting and running apps under 10s, if
this is not meet we are
  powered down (by backend device)
- I didn't run any MTD tests/bonnie++ yet for testing this chip/MTD
driver (could be useful
  for discovering other issue, but I don't think it matters for case
described below)
- on one of our devices we got UBIFS corruption for R/W data partition
(in empty space area),
  below is original bug report log (without further debug messages):
(cut)
[    0.230000] UBI: attaching mtd1 to ubi0
[    0.230000] UBI: physical eraseblock size:   131072 bytes (128 KiB)
[    0.230000] UBI: logical eraseblock size:    126976 bytes
[    0.230000] UBI: smallest flash I/O unit:    2048
[    0.230000] UBI: VID header offset:          2048 (aligned 2048)
[    0.230000] UBI: data offset:                4096
[    0.770000] UBI: attached mtd1 to ubi0
[    0.770000] UBI: MTD device name:            "gpmi-nfc-general-use"
[    0.770000] UBI: MTD device size:            117 MiB
[    0.770000] UBI: number of good PEBs:        940
[    0.770000] UBI: number of bad PEBs:         0
[    0.770000] UBI: max. allowed volumes:       128
[    0.770000] UBI: wear-leveling threshold:    4096
[    0.770000] UBI: number of internal volumes: 1
[    0.770000] UBI: number of user volumes:     3
[    0.780000] UBI: available PEBs:             0
[    0.780000] UBI: total number of reserved PEBs: 940
[    0.780000] UBI: number of PEBs reserved for bad PEB handling: 9
[    0.780000] UBI: max/mean erase counter: 407/123
[    0.780000] UBI: image sequence number: 0
[    0.780000] UBI: background thread "ubi_bgt0d" started, PID 30
(cut)
[    0.940000] VFS: Mounted root (squashfs filesystem) readonly on device 254:0.
[    1.760000] UBIFS: recovery needed
[    1.810000] UBIFS: recovery completed
[    1.810000] UBIFS: mounted UBI device 0, volume 2, name "data"
[    1.810000] UBIFS: file system size:   76947456 bytes (75144 KiB,
73 MiB, 606 LEBs)
[    1.810000] UBIFS: journal size:       3809280 bytes (3720 KiB, 3
MiB, 30 LEBs)
[    1.810000] UBIFS: media format:       w4/r0 (latest is w4/r0)
[    1.810000] UBIFS: default compressor: zlib
[    1.810000] UBIFS: reserved for root:  3634417 bytes (3549 KiB)
[    3.460000] UBI error: ubi_io_read: error -74 while reading 126976
bytes from PEB 97:4096, read 126976 bytes
[    3.470000] UBIFS error (pid 86): ubifs_scan: corrupt empty space
at LEB 318:116009
[    3.490000] UBIFS error (pid 86): ubifs_scanned_corruption:
corruption at LEB 318:116009
[    3.490000] 00000000: ffffffdf ffffffff ffffffff ffffffff ffffffff
ffffffff ffffffff ffffffff  ................................
[    3.490000] 00000020: ffffffff ffffffff ffffffff ffffffff ffffffff
ffffffff ffffffff ffffffff  ................................
(cut)
[    3.560000] 00001fe0: ffffffff ffffffff ffffffff ffffffff ffffffff
ffffffff ffffffff ffffffff  ................................
[    3.560000] UBIFS error (pid 86): ubifs_scan: LEB 318 scanning failed
[    3.560000] UBIFS warning (pid 86): ubifs_ro_mode: switched to
read-only mode, error -117
[    3.560000] UBIFS error (pid 86): make_reservation: cannot reserve
137 bytes in jhead 2, error -117
[    3.580000] UBIFS error (pid 86): do_writepage: cannot write page 1
of inode 1218, error -117
(cut)

It seems that:
a) error is of single bit-flip kind (read decay) (I don't suspect currently
    unstable bits issue during erasing/writting)
b) our NAND driver doesn't protect our empty space (no wonder, as 13
bytes ECC used
    per 512B subpage should be left 0xFF until written with real data)
c) as checked, this is the first empty-page (2kB) in this PEB,
previous page contains
    some data (and nothing shows that we have more than one page corrupted)

I have tried of changing NAND/MTD driver to return -EUCLEAN instead of
-EBADMSG (to fix
the problem below UBI layer, pretending that we have correctable
bit-flip). Results (with UBI debug turned on):
FAIL#1 - error was still there (UBIFS corruption when mounting data
partition, required for booting),
  scrubbing for this PEB was initiated (ubi_wl_scrub_peb), but happend
some time later, when
  left running after artificially disconnecting backend (I guess it
was scheduled to ubi_bgt0d task)
FAIL#2 - it seems that PEB 97 was rewritten to PEB 89, however
corrupted empty space was
  also preserved (sic!) at the very same offset, hence error is still
there (confirmed with nanddump)

That means that further trying to fix that in NAND/MTD driver is
futile. Am I right?

Questions:
1) is there any chance that merging UBI/UBIFS recent source will make
it go away? It is 2.5 year
   of code development and aside of UBI/UBIFS probably I would be
forced to merge also other
   subsystems etc. which could result in merging hell I would like to omit.
   I have browsed thru the GIT tree and I see that some 2 years ago
some set of patches introduced
   'corrupted PEBs list', that from what I understand, would make this
(and only this!) PEB read-only
   (unfortunately forever, which will deplete pool of reserved PEBs
sooner, which is also not that nice).
   Would merging those patches make UBIFS continue with scanning, or
will this still be scheduled
   to bg task (ie. useless in this case)?
2) should I try to change UBI/UBIFS to deal with this problem? Ideally
would be if rewritting/recovering
   this PEB would happen immediately at the time of discovery (in UBI
layer). Alternatively, immediately
   at UBIFS layer (in ubifs_scan function, when page is checked for
containing only 0xffffffff).
   Could you point me to an example that would be proper for this?
3) what about a band-aid solution (commenting out ‘goto corrupted;’
line) explained in:
   http://e2e.ti.com/support/embedded/linux/f/354/t/171839.aspx
   Does UBI/UBIFS does check also for all 0xFF in page before writing
(not as part of any ‘extra checks’
   debugging)? If so, then maybe such a quick-fix could be used
(fixing ubifs_scan issue), but best
   followed with some soon-to-happen recovery, that will recover this
LEB and erase/reuse the PEB?

Thanks for your help and time!

Regards,
Michal Przeplata