[PATCH v7 3/6] mtd: nand: omap: ecc.correct: omap_elm_correct_data: fix erased-page detection for BCHx_HW ECC schemes
Gupta, Pekon
pekon at ti.com
Wed Feb 12 05:31:02 EST 2014
Hi Brian,
>From: Brian Norris [mailto:computersforpeace at gmail.com]
>>On Fri, Jan 17, 2014 at 12:43:14PM +0530, Pekon Gupta wrote:
>> As erased-pages do not have ECC stored in their OOB area, so they need to be
>> seperated out from programmed-pages, before doing BCH ECC correction.
>>
>> In current implementation of omap_elm_correct_data() which does ECC correction
>> for BCHx ECC schemes, this erased-pages are detected based on specific marker
>> byte (reserved as 0x00) in ecc-layout.
>> However, this approach has some limitation like;
>> 1) All ecc-scheme layouts do not have such Reserved byte marker to
>> differentiate between erased-page v/s programmed-page. Thus this is a
>> customized solution.
>> 2) Reserved marker byte can itself be subjected to bit-flips causing
>> erased-page to be misunderstood as programmed-page.
>>
>> This patch removes dependency on any marker byte in ecc-layout, to differentiate
>> between erased-page v/s programeed-page. Instead a page is considered erased if
>> (a) all(OOB) == 0xff, .i.e., number of zero-bits in read_ecc[] == 0
>> (b) number of zero-bits in (Data + OOB) is less than ecc-strength
>>
>> This patch also adds a generic function count_zero_bits(), to find number of
>> bits which are '0' in given buffer. This function is optimized for comparing
>> read-data with 0xff.
>>
>> Signed-off-by: Pekon Gupta <pekon at ti.com>
>> ---
[...]
>> +/**
>> * omap_elm_correct_data - corrects page data area in case error reported
>> * @mtd: MTD device structure
>> * @data: page data
>> @@ -1359,14 +1383,21 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
>> {
>> struct omap_nand_info *info = container_of(mtd, struct omap_nand_info,
>> mtd);
>> - int eccbytes = info->nand.ecc.bytes;
>> - int eccsteps = info->nand.ecc.steps;
>> + struct nand_ecc_ctrl *ecc = &info->nand.ecc;
>> + int eccstrength = ecc->strength;
>> + int eccsize = ecc->size;
>> + int eccbytes = ecc->bytes;
>> + int eccsteps = ecc->steps;
>
>When I recommended adding the 'ecc' variable (as you did), I was
>suggesting you drop the next 4 variables. You don't need them when you
>can (with just as few characters) refer to ecc->field directly and
>easily. Stick with one or ther other -- the 4 local variables or the 1
>ecc pointer -- but you don't need both.
>
Oh sorry.. I misunderstood that. I'll fix this in next version.
>> @@ -1410,24 +1441,47 @@ static int omap_elm_correct_data(struct mtd_info *mtd, u_char *data,
>> }
>>
>> if (eccflag == 1) {
>> - /*
>> - * Set threshold to minimum of 4, half of ecc.strength/2
>> - * to allow max bit flip in byte to 4
>> - */
>> - unsigned int threshold = min_t(unsigned int, 4,
>> - info->nand.ecc.strength / 2);
>> + bitflip_count = 0;
>> + /* count zero-bits in OOB region */
>> + err = count_zero_bits(read_ecc, eccbytes,
>> + eccstrength, &bitflip_count);
>> + if (err) {
>> + /*
>> + * number of zero-bits in OOB > ecc-strength
>> + * either un-correctable or programmed-page
>> + */
>> + page_is_erased = false;
>> + } else if (bitflip_count == 0) {
>> + /* OOB == all(0xff) */
>> + page_is_erased = true;
>
>This else-if is still incorrect. You cannot assume that just because the
>OOB is all 0xff that the page is erased.
>
Now this part is most important, where I would like to get more clarity
and feedback before I proceed.
----------------------------
if OOB of an page is == all(0xff)
(a) As per general NAND spec, chip->write_buf() _only_ fills the internal buffers of
NAND device's with information to be written. Actual write to NAND array cells
happen only after 'NAND_CMD_PAGEPROG' or 'NAND_CMD_CACHEDPROG' is issued.
Thus both Main-area and OOB-area are programmed simultaneously inside NAND device.
if there is any disruption (like power-cut) in on-going NAND_CMD_PAGEPROG
cycle, then the data should be assumed to be garbage, and it may have un-stable bits.
(b) if OOB==all(0xff) then there is _no_ ECC stored in OOB to check the read_data
Hence driver cannot distinguish between genuine data and bit-flips.
Now based on (a) & (b), it's safe to assume that:
if (OOB == all 0xff)
/* page has no-data | garbage-data*/
Agree ?
(This is where I call it page_is_erased==true, Though I could have used better
variable name as page_has_no_valid_data = true).
----------------------------
And also, driver should return 'total number zero-bits in both Main-area + OOB'
if (0 < 'total number of zero-bits in both OOB + DATA region' < mtd->bitflip_threshold)
return -EUCLEAN;
else
return -EBADMSG;
** However there is a bug in current patch version which I just figured out.
I don't include the number of zero-bits of Main-area, if OOB==all(0xff).
+ } else if (bitflip_count == 0) {
+ /* OOB == all(0xff) */
+ page_is_erased = true;
+ /* missing */ bitflip_count += count_zero_bits(buf, eccsize,
+ eccstrength, &bitflip_count);
------------------------
>> + } else {
>> + /*
>> + * OOB has some bits as '0' but
>> + * number of zero-bits in OOB < ecc-strength.
>> + * It may be erased-page with bit-flips so,
>> + * further checks are needed for confirmation
>> + */
>> + /* count zero-bits in DATA region */
>> + buf = &data[eccsize * i];
>> + err = count_zero_bits(buf, eccsize,
>> + eccstrength, &bitflip_count);
>> + if (err) {
>> + /*
>> + * total number of zero-bits in OOB
>> + * and DATA exceeds ecc-strength
>> + */
>> + page_is_erased = false;
>> + } else {
>> + /* number of zero-bits < ecc-strength */
>> + page_is_erased = true;
>> + }
>> + }
>
>This whole block (where you call count_zero_bits() twice) is a
>convoluted and buggy way of just doing the following, AFAIU:
>
> page_is_erased = !count_zero_bits(read_ecc, ecc->bytes,
> ecc->strength, &bitflip_count) &&
> !count_zero_bits(&data[ecc->size * i], ecc->size,
> ecc->strength, &bitflip_count);
>
>You could modify that to your liking and add a comment, but it's much
>more concise than your version, and from what I can tell, it's actually
>correct.
>
Firstly, In you above statement 's/&&/||'
because as per above statement, if count_zero_bits(oob) == 0, then
my patch assumes 'page_has_no_valid_data = true'.
page_is_erased = !count_zero_bits(read_ecc, ecc->bytes,
ecc->strength, &bitflip_count) ||
!count_zero_bits(&data[ecc->size * i], ecc->size,
ecc->strength, &bitflip_count);
Secondly, You are missing the that here NAND driver is relaxing the limit of
number-of-acceptable bit-flips in an erased-page, because some MLC and
Toshiba SLC NANDs show bit-flips almost on every second already erased-page.
>> + if (err) {
>> + /*
>> + * total number of zero-bits in OOB
>> + * and DATA exceeds ecc-strength
>> + */
>> + page_is_erased = false;
>> + } else {
>> + /* number of zero-bits < ecc-strength */
>> + page_is_erased = true;
>> + }
In above patch I can replace ecc->strength with mtd->bitflip_threshold
to make it more user controllable. As bitflip_threshold is configurable via
/sys/class/mtd/mtdX/bitflip_threshold
What is your opinion ?
This was also the basis of my question to Artem [1]
[1] http://lists.infradead.org/pipermail/linux-mtd/2014-January/051556.html
With regards, pekon
More information about the linux-mtd
mailing list