[PATCH v6 4/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page detection for BCHx_HW ECC schemes

Wed Jan 15 18:17:13 EST 2014

HI Brian,

>From: Brian Norris [mailto:computersforpeace at gmail.com]
[...]
>>
>> This patch solves following problems in ECC correction for BCHx_HW ecc-schemes.
>> Problem: Current implementation depends on a specific byte-position (reserved
>>          as 0x00) in ecc-layout to differentiate between programmed-pages v/s
>>          erased-pages.
>>       1) All ecc-scheme layouts do not have such Reserved byte marker to
>>          differentiate between erased-page v/s programmed-page. Thus this is a
>>          customized solution.
>>       2) Reserved byte can itself be subjected to bit-flips causing erased-page
>>          to be misunderstood as programmed-page.
>>
>> Solution: This patch removes dependency on single byte-position ini ecc-layout
>>          to differentiating between erased-page v/s programeed-page.
>>          This patch 'assumes' any page to be 'erased':
>> 		(a) if        all(read_ecc)  == 0xff
>> 		(b) else if   all(read_data) == 0xff
>
>Your assumptions break down if somebody intentionally programmed a page
>with 0xff data. Then the ECC region will not be 0xff, but the data area
>may be all 0xff. Isn't this a major problem? (You can try testing this
>with nandwrite/nanddump using a page of 0xff.)
>
I thought of this much, but I found that UBI doesn't really differentiate between
 (a) page intentionally programmed with 0xff data.
 (b) page left blank.
This is why we have 'white-space-fixup' feature in UBIFS, which re-erased the
page which it thinks are blank. For UBI, any blocks which has 'vid-header'
is assumed to contain some user-data.
Example: When UBI image is flashed using 'nand write' command from u-boot.
u-boot writes 0xff as data along with its OOB into empty pages.
But when UBIFS scans all the blocks it considers the blocks without 'vid-header'
as erased, in spite the fact that the pages is programmed with 0xff data.
(However, Artem can confirm this more)..

>>
>> Reasons for (a)
>>       -  An abrupt termination of page programming (like power failure)
>>          may result in partial write, leaving page in corrupted state with
>>          un-stable bits. As OOB region is programmed after the data-region,
>>          so if read_ecc[] == 0xff, then a page should treadted as erased.
>>
>>       -  Also, as ECC is not present, any bitflips in page cannot be detected.
>>
>> Reasons for (b)
>>       - Due to architecture of NAND cell, bit-flips cannot change programmed
>>         value from '0' -> '1'. So if all(read_data) == 0xff then its confirmed
>>         that there is 'no bit-flips in data-region'. Hence, read_data[] == 0xff
>>         can be safely returned.
>>
I think this is a valid assumption, because converting all the programmed bits
back to '1' (s) requires erase operation, which internally is done at slightly
higher voltage (as per my understanding of flash cell & floating-gate architecture).

>>       - if page was programmed-page with 0xff and 'calc_ecc[] != 0x00' then
>>         it means that page has bit-flips in OOB-region.
>>
Sorry, yes may be I wrote this incorrectly .. (please ignore this)

>>       - if page was erased-page and 'read_ecc[] != ecc_of_all_0xff' then
>>         it  mean that there are bit-flips in OOB-region of page.
Sorry there is a typo here.. it should be calc_ecc[] != ecc_of_all_0xff.
	If (OOB == all(0xff) /* page is erased */
		If (calc_ecc[] != ecc_of_all_0xff)  /* erased-page has bit-flips */

So, I think above assumption is also correct ??
(And this has been already followed in existing OMAP driver, refer bch8_vector)

>
>I think several assumptions and statements you are making are flawed,
>and (unless I'm wrong) you probably need to rethink the approach in this
>patch and the previous one.
>
>Are you mainly trying to (1) improve performance and (2) remove reliance
>on a single byte marker? I think the first point may be misguided and
>not important in this case, but the second looks like you can solve it
>well. What do you think of trying to tackle (2) without (1)? You would
>be keeping much of the existing hweight()-related code for determining
>bitflips across the whole data+OOB, but you could still have a smaller
>ECC-only check to perform a faster first pass.
>
Yes, I want to solve (2) "remove reliance on a single byte marker".
That's on priority. 
But recently I have come across some MLC and new technology NAND
devices, which have very high bit-flip rates. 
(a) Almost every page has bit-flips. And some pages have stuck bits too.
   But all this is within correctable range so performance will certainly hit.
(b) Even the erased-pages are having lot of bit-flips in OOB area, so
  it's difficult to differentiate between erased-page v/s programmed-page.
(c) And due to the bit-flips in erased-pages the 'faster' ECC-only check will
   mostly fail, and so we fall back to comparing each byte with 0xFF. Thus
   performance falls further.

I'm unable to find a proper solution for avoiding (2) and still able to meet (b).
So, in this patch I just did two checks
- compare all OOB with 0xff to see if the page is programmed.
- compare all data with 0xff to see if data is present.
If either of the two returns as true, then I consider the page as erased.
because UBIFS does not distinguish between erased-page and
programmed-page having all(0xff) data.

[...]

>> +			/* (a) page can be 'assumed' erased if
>> +			 * all(read_ecc) == 0xff */
>
>Can you fix your comment style please? Note the multiline style used by
>the comments you are deleting.
>
Ok.. But these are wasting 2 lines :-)

with regards, pekon