corrupted empty space - again

Tue Apr 23 01:21:04 EDT 2013

> It seems that:
> a) error is of single bit-flip kind (read decay) (I don't suspect currently
>     unstable bits issue during erasing/writting)
> b) our NAND driver doesn't protect our empty space (no wonder, as 13
> bytes ECC used
>     per 512B subpage should be left 0xFF until written with real data)

[Pekon]: This should not matter, suppose you wanted to write 
0x77 to a erase-page byte bit 0. So in actual 0x76 would be written on 
the device. And corresponding ECC stored in OOB/spare area.
When you read the byte, it would read as 0x76, but ECC stored in 
spare/OOB area was calculated for 0x77, so driver would detect a 
bit-flip and should correct it too.
So whether bit-flips occur on erased-page or whether after writing,
both can be detected and corrected by your driver.

[Pekon]: Another thing your driver should handle is whether it can
detect bit-flips in OOB/spare area (ECC syndrome) itself or not.
The reason is if your read-ECC itself is corrupted then you should not
mistakenly re-fix your data.

> c) as checked, this is the first empty-page (2kB) in this PEB,
> previous page contains
>     some data (and nothing shows that we have more than one page
> corrupted)
> 
> I have tried of changing NAND/MTD driver to return -EUCLEAN instead of
> -EBADMSG (to fix
> the problem below UBI layer, pretending that we have correctable
> bit-flip). Results (with UBI debug turned on):

[Pekon]: I think, this is wrong approach:
(a)-EBADMSG mean driver encountered multiple bit-flips which your
 ECC scheme cannot correct. So it is returning you the corrupted data.
(b)-EUCLEAN indicates that though data had bit-flips but they were
 _already_ corrected by the driver. So data is correct.
  In addition to it, upper File-System layer can take preventive actions
 (like scrubbing in UBIFS) to avoid accumulation of bit-flips in future.

So, replacing -EBADMSG with -EUCLEAN would not fix the corrupted
data, rather it would fool the upper FS layer to use the corrupted data
as the fixed one.

> FAIL#1 - error was still there (UBIFS corruption when mounting data
> partition, required for booting),
>   scrubbing for this PEB was initiated (ubi_wl_scrub_peb), but happend
> some time later, when
>   left running after artificially disconnecting backend (I guess it
> was scheduled to ubi_bgt0d task)
> FAIL#2 - it seems that PEB 97 was rewritten to PEB 89, however
> corrupted empty space was
>   also preserved (sic!) at the very same offset, hence error is still
> there (confirmed with nanddump)

[Pekon]: This why you are seeing the bit-flip getting copied to new PEB
because you fooled the FS layer by saying -EUCLEAN.

> 
> That means that further trying to fix that in NAND/MTD driver is
> futile. Am I right?
> 

[Pekon]: First identify the root cause of the problem
Possibility-1: whether you are actually seeing multiple bit-flips within 
same page that your ECC scheme is unable to handle ?
I assume you are using BCH8 algorithm, so check if your 
number_of_bit-flips > 8 per page.
You can check this by dumping the page without ECC correction.
nanddump -s <offset> -p  <device> -f "file1.hex"            (with correction)
nanddump -s <offset> -p -n <device> -f "file1.hex"  -n (without correction)
Solution-1:  upgrade ur ECC scheme to BCH-16 or similar.

Possibility-2: your NAND driver is not catching the bit-flips correctly.
Solution-2: fix your NAND driver, not MTD or UBI.

With regards, pekon